• Skip to main content

CILVR at NYU

  • CILVR Home
  • About CILVR
  • People
  • Publications
  • Teaching
  • Sponsors
  • Carbon Neutral Lab
  • Seminar
  • Events

Anugya Srivastava

Oct 07 2020

jiant is an NLP toolkit: Introducing jiant 2.0

Author(s): Jason Phang and Jesse Swanson
Publication date: October 7 2020
Reviewer: Jason Lee 
Editor: Kyunghyun Cho

We are excited to release jiant 2.0: a natural language understanding toolkit built to reflect the evolving needs of NLP researchers.

Summary

  • jiant is a research library for natural language understanding including multi-task training and transfer learning
  • jiant supports a wide range of tasks (including the GLUE, SuperGLUE and XTREME benchmarks), and many of the newest transformers models (BERT, RoBERTa, ALBERT, ELECTRA, XLM-R, etc)
  • jiant is designed to be modular and easily adaptable to different experimental or training workflows
  • Check out jiant 2.0 here: https://github.com/nyu-mll/jiant

What is jiant?

jiant is a NLP research library written in Python. jiant is built for researching general-purpose text understanding models and transfer learning. jiant 1.x has been used in several published research papers. jiant 2.0 was rebuilt from the ground up to support cutting-edge NLP research.

What’s new in 2.0?

The NLP research landscape has changed dramatically since the initial release of jiant 1.0 two years ago. In that time, we have witnessed the rise of BERT-style fine-tuning, the development of a whole sub-field studying BERTology, and community adoption of libraries such as Hugging Face’s transformers, tokenizers, and datasets (formerly nlp).

To support the changing needs of NLP researchers, jiant has undergone major changes and is now built with transformers models and datasets. jiant has a new training pipeline to facilitate modern experimental workflows. jiant is no longer a framework. jiant is a modular library. At the same time, we expanded task and model support.

Why should I use jiant?

With transformers’ large and expanding library of examples, is there really a need for another wrapper library just to run some experiments?

We think so.

When doing research, you want a unified API for running experiments with different models, tasks, or run configurations rather than making ad-hoc tweaks for different experiments or running experiments with different scripts resulting in duplicated code. jiant is designed to facilitate large-scale, replicable, configuration-driven experiments from a single standardized interface.

To further explain what jiant brings to the table, let’s go over each of the major components of the jiant library.

Tasks

jiant supports more than 50 natural language understanding tasks built in part on Hugging Face’s datasets. These include the GLUE, SuperGLUE, and XTREME benchmark tasks, edge probing tasks, and many more. Each task implementation handles all the preprocessing from the task’s raw data format to transformers model inputs. If you’re using one of the currently supported tasks, you should not need to write any code to start training. If you intend to add a task, that’s easy too.  

Models

jiant supports BERT, RoBERTa, XLM-R, ALBERT, BART, mBART, and ELECTRA, based on Hugging Face’s transformers and tokenizers libraries. jiant models comprise of a text encoder (e.g BERT), and task-specific heads corresponding to each task. Multiple heads can share a single encoder, supporting multi-task training. jiant supports a wide variety of task heads including classification, regression, multiple choice, span comparison, span prediction, multi-label span comparison, tagging, question answering and masked language modeling. It is also possible to add support for encoder models beyond those currently supported.

Runner

With task data and models in hand, we need to train and evaluate our models. jiant provides a Runner implementation, which handles the training loop for an experiment (similar to transformers Trainer). jiant’s Runner natively supports multi-task training, and this includes considerations like different task-sampling methods and different batch sizes or gradient accumulation steps per task. All of this is exposed through a single run-script that you can call to start training your model.

How do I start using jiant?

The following is a multitask training example using RoBERTa on the MRPC and RTE task. jiant uses proportional sampling from each task as default (this is configurable).

Python version:

from jiant.proj.simple import runscript as run
import jiant.scripts.download_data.runscript as downloader

# Download the Data
downloader.download_data([“mrpc”, “rte”], “/content/data”)

# Set up the arguments for the Simple API
args = run.RunConfiguration(
    run_name=“simple”,
    exp_dir=“/content/exp”,
    data_dir=“/content/data”,
    model_type=“roberta-base”,
    tasks=“mrpc,rte”,
    train_batch_size=16,
    num_train_epochs=3
)

Bash version:

python jiant/scripts/download_data/runscript.py \
    download \
    --tasks mrpc rte \
    --output_path /content/data

python jiant/proj/simple/runscript.py \
    run \
    --run_name simple \
    --exp_dir /content/data \
    --data_dir /content/data \
    --model_type roberta-base \
    --tasks mrpc,rte \
    --train_batch_size 16 \
    --num_train_epochs 3

As you can see above, this is pretty simple! The simple API abstracts the complete jiant pipeline for quick evaluation on tasks. For more complex experimental workflows, we provide a more in-depth introduction to jiant here.

Beyond multi-task training, we have examples demonstrating training workflows including:

  • Sequential transfer (STILTs)
  • Zero-shot evaluation on XNLI from a model trained on MNLI

These examples can be found here.

Design Philosophy

A major departure from jiant 1.0 is the change in our design philosophy to build a library, not a framework. In other words, jiant 2.0 is designed to be able to be used piecewise and to support workflows that we haven’t thought of yet. jiant’s task preprocessing can be wrapped and fed into Transformer models. jiant’s models can be used in a different training workflow than the one we provide. 

In short, jiant is built to support NLP research across a wide swathe of models and tasks, but flexible enough to support more complex experimental workflows. And if there is some new idea you want to experiment with, and you want to quickly evaluate it against a large variety of tasks or models, jiant is the perfect place to start.

Final thoughts: Building a research library is hard!

In typical software libraries, one aims to draw a clean line between implementation and interface. In a research library like jiant, everything must be exposed and configurable since. researchers require maximum flexibility to implement new methods. Scope creep == research! The trade-off between flexibility and ease-of-use is a difficult line that we try to walk in jiant.

jiant is under active development, and there are many features we are excited to work on in the near future. We want to support use cases the NLP research community is interested in. We are actively seeking feedback and contributions from the community!

Check out jiant here: https://github.com/nyu-mll/jiant

Revision history

  • October 7 2020: Initial publication

Written by Anugya Srivastava · Categorized: Announcement, Research

Sep 24 2020

Representation quality and the complexity of learning

Author(s): Will Whitney, Min Jae Song, David Brandfonbrener, Jaan Altosaar, Kyunghyun Cho
Publication date: September 24 2020
Reviewer: Cheolhyoung Lee 
Editor: Kyunghyun Cho

In the last few years, there’s been an explosion of work on learning good representations of data. From NLP1 2 3 to computer vision4 5 6 to reinforcement learning7 8 9, the field has never been hotter. However, defining precisely what we mean by a good representation can be tricky. This has led to a somewhat ad-hoc approach to evaluation in the literature, with each paper choosing its own measure or set of measures and a general sense that our evaluation methods aren’t very robust.

In a recent paper, Evaluating representations by the complexity of learning low-loss predictors10, we show that many notions of the quality of a representation for a task can be expressed as a function of the loss-data curve. This perspective allows us to see the limitations of existing measures and propose new ones that are more robust.

We think that evaluation is crucially important in the field right now and we don’t want the measures that we and others have proposed to languish as purely theoretical exercises. Since these measures (ours and others) aren’t trivial to implement or to compute, we are releasing a library called Reprieve for representation evaluation that aims to standardize the evaluation of representation quality. Whether you’re using the measures that we proposed or several others, and no matter what ML library you use, you can evaluate representations with Reprieve.

The Reprieve library

Loss-data curves and existing measures

The loss-data curve, with the size of the training set on the X axis and validation loss on the Y axis, describes how an algorithm’s performance varies based on the amount of training data it’s given. Intuitively, the curve for a representation that allows the algorithm to learn efficiently (with little data) will lie to the left of the curve for a representation that makes learning less efficient. Meanwhile a representation that contains more predictive information will lead to a curve that goes lower as the training set size goes to infinity.

Loss-data curves and representation quality measures. The red and blue curves are the result of using the same learning algorithm with two different representations of the data.
Loss-data curves and representation quality measures. The red and blue curves are the result of using the same learning algorithm with two different representations of the data.

On the loss-data curve we can graphically show the meaning of several existing evaluation measures for representation quality (left panel).

Validation accuracy with limited data (VA) is the simplest measure. VA corresponds to picking some \(n\) for the dataset size and looking only at a vertical slice of the loss-data curve at that \(n\).

Mutual information (MI) attempts to measure the quality of a representation by its mutual information with the labels11. MI is equivalent to considering only the validation loss with infinite training data.

Minimum description length (MDL) is an interesting measure recently proposed by Voita et al. (2020)12. Given a fixed dataset, MDL measures the description length of the dataset’s labels (the vector of all the Ys) given its observations (the vector of all the Xs) according to a particular encoding scheme. In the prequential or online coding scheme, a model is trained to predict \(p(Y^k \mid X^k)\) on a dataset of size \(k\), and then used to encode the \((k+1)^{\mathrm{th}}\) point. MDL corresponds to the area under the loss-data curve up to \(n\), the full size of the dataset.

An interesting feature of all these methods is that they depend on (or specify, for MI) a particular dataset size. This can be a bit tricky: how much data should an algorithm need to solve a new task? Provide too little data and no representation will allow any learning, but provide too much and only asymptotic loss will matter, not efficiency.

Instead, we will construct an evaluation procedure that measures a property of the data distribution and the learning algorithm, not a particular dataset or dataset size.

Surplus Description Length

We’re going to build on the MDL idea to make a measure of representation quality. To do this, we measure the complexity of learning for a given data distribution and learning algorithm. We have two main goals for this representation evaluation measure:

  1. It should measure a fundamental property of the data distribution and learning algorithm.
  2. The measure shouldn’t depend on a particular sample of a dataset from the data distribution, the size of the dataset, or the order of the points.

Defining surplus description length

To start with, imagine trying to efficiently encode a large number of samples of some random variable \(\mathbf{e}\) which takes discrete values in \(\{1 \ldots K\}\) with probability \(p(\mathbf{e})\). The best possible code for each sample leverages knowledge of the probability of observing that sample, and assigns a code length of \(- \log p(e_i)\) to each sampled value \(e_i\). This results in an expected length per sample of \[ \mathbb{E}_\mathbf{e} [\ell_p(\mathbf{e})] = \mathbb{E}_\mathbf{e} [- \log p(\mathbf{e})] = H(\mathbf{e}) \] where we use \(\ell_p\) to denote the negative log-likelihood loss for the distribution \(p\).

If instead \(\mathbf{e}\) was encoded using some other distribution \(\hat p\), the expected length becomes \(H(\mathbf{e}) + D_{\mathrm{KL}}(p~||~\hat p)\). We call \(D_{\mathrm{KL}}(p~||~\hat p)\) the surplus description length (SDL) from encoding according to \(\hat p\) instead of \(p\). We can also write it as \[ \mathrm{SDL}(\hat p) = D_{\mathrm{KL}}(p~||~\hat p) = \mathbb{E}_{\mathbf{e} \sim p} \left[ \log p(\mathbf{e}) – \log \hat p(\mathbf{e}) \right] \] to highlight how SDL measures only the extra entropy that comes from not having the correct model.

SDL as a measure of representation quality

As our model learns we get a new \(\hat p\) at every training step. Similarly to MDL with online codes12, we measure the SDL of the learned model at each step and then sum them up. Writing the expected loss of running algorithm \(\mathcal{A}\) on a dataset with \(i\) points as \(L(\mathcal{A}_\phi, i)\), the SDL measure of representation quality is \[ m_{\mathrm{SDL}}(\phi, \mathcal{D}, \mathcal{A}) = \sum_{i=1}^\infty \Big[ L(\mathcal{A}_\phi, i) – H(\mathbf{Y} \mid \mathbf{X}) \Big]. \]

We show in the paper that MDL is a special case of SDL which assumes that the true distribution of \(\mathbf{e}\) is a delta mass, which is to say that \(\mathbf{e}\) has no randomness at all. This leads to some odd properties with real data, which typically has noise. MDL goes to infinity with the size of the dataset even for algorithms which learn the true data distribution, which makes numbers hard to compare. More worryingly, if we rank the quality of two representations using MDL, that ranking can (and in practice does) switch as we change the dataset size. That means our conclusions about which representation is better are totally dependent on how much data we have to evaluate them!

Since in practice we don’t know the true entropy of the data distribution, we also propose a version of the SDL measure where we set some threshold \(\varepsilon\) as a criterion for success instead of using the true entropy of the data. As long as \(\varepsilon > H(\mathbf{Y} \mid \mathbf{X})\), this still has most of the same nice properties. A good way to set \(\varepsilon\) would be to run the learning algorithm on a large amount of data using the raw representation of the data, then set \(\varepsilon\) to the loss of that model plus a small slack term for estimation error.

We also propose a simpler measure called \(\varepsilon\) sample complexity, or \(\varepsilon\)SC, which is the number of training points required for the expected loss to drop below \(\varepsilon\). For full details on that check out the paper!

Representation evaluation in practice

With our tools in hand, we can examine some practical representations. Looking first at MNIST, we compare using the raw pixels to using neural encoders pretrained on supervised CIFAR classification or trained without supervision as a low-dimensional VAE on MNIST.

Results on MNIST. Since SDL measures a property of the data distribution, not a particular dataset, its values don't change as the dataset grows.
Results on MNIST. Since SDL measures a property of the data distribution, not a particular dataset, its values don’t change as the dataset grows.

As you can see from the loss-data curve (right), these representations perform very differently! While the VAE representation allows the quickest learning at first, it makes achieving very low loss hard. Meanwhile the CIFAR pretrained representation supports learning that’s more efficient than raw pixels for any loss.

Looking at the evaluation measures, we see that the existing measures like validation loss and MDL tend to switch their rankings when larger datasets are used for evaluation. Meanwhile SDL and \(\varepsilon\)SC know when there isn’t enough data available to evaluate a representation, and once they make a judgement, it sticks.

To show that this phenomenon isn’t just limited to vision tasks or small datasets, we also provide experiments on a part of speech classification task using pretrained representations from ELMo2. Just like on MNIST, validation loss and MDL make very different predictions with small evaluation datasets than with large ones.

Results on part of speech classification.
Results on part of speech classification.

Better representation evaluation for everyone

Existing measures of representation quality, which are functions of a particular dataset rather than the data distribution, can have some tricky behavior. Whether you use our measures or not, we urge our fellow members of the representation learning community to think carefully about the measures and procedures that you use to evaluate representations.

Reprieve, our library for representation evaluation, is one tool that we think can help. By using the powerful program transformations provided by JAX, Reprieve is able to train the ~100 or so small networks required to construct a loss-data curve in parallel on one GPU in about two minutes. From there it can compute all of the measures that we mentioned today.

We hope that by standardizing on one codebase for evaluation, we in the representation learning community can move faster while producing results that are more comparable and more reproducible. If Reprieve is missing a measure that you think is important, submit a pull request!

Revision history

  • September 24 2020: Initial publication

  1. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ↩︎
  2. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer. Deep contextualized word representations. ↩︎
  3. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ↩︎
  4. Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton. A Simple Framework for Contrastive Learning of Visual Representations. ↩︎
  5. Aaron van den Oord, Yazhe Li, Oriol Vinyals. Representation Learning with Contrastive Predictive Coding. ↩︎
  6. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick. Momentum Contrast for Unsupervised Visual Representation Learning. ↩︎
  7. Aravind Srinivas, Michael Laskin, Pieter Abbeel. CURL: Contrastive Unsupervised Representations for Reinforcement Learning. ↩︎
  8. Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, Marc G. Bellemare. DeepMDP: Learning Continuous Latent Space Models for Representation Learning. ↩︎
  9. Amy Zhang, Rowan McAllister, Roberto Calandra, Yarin Gal, Sergey Levine. Learning Invariant Representations for Reinforcement Learning without Reconstruction. ↩︎
  10. William F. Whitney, Min Jae Song, David Brandfonbrener, Jaan Altosaar, Kyunghyun Cho. Evaluating representations by the complexity of learning low-loss predictors. ↩︎
  11. Note that to actually measure the mutual information between the random variables of the representation and data requires arbitrarily large models, infinite data, and unbounded computation. Mutual information is not a nice quantity to compute with. ↩︎
  12. Elena Voita, Ivan Titov. Information-Theoretic Probing with Minimum Description Length. ↩︎

Written by Anugya Srivastava · Categorized: Research

Aug 27 2020

Influence Functions Do Not Seem to Predict Usefulness in NLP Transfer Learning

Author(s): Vid Kocijan and Samuel R. Bowman
Publication date: August 27 2020
Reviewer: Alex Wang 
Editor: Kyunghyun Cho

One of the ways to improve the performance of a neural model on a task with scarce data is to pre-train it on a related task first. Large-scale language models, such as RoBERTa (Liu et al. ’19) are pre-trained on large corpora of text, using unsupervised training objectives, such as masked token prediction. Fine-tuning such a pre-trained language model on the target tasks usually outperforms a model trained on the target task data only. In certain cases, e.g. when the target task training-set is small, it can be beneficial to train the pre-trained language model on an “intermediate task” first and only then fine-tune it on the target task (Pruksachatkun et al. ’20). In this project, we took a closer look at this scenario. Since the intermediate task may come from a domain unrelated to that of the target task, we suspect that not all examples in the training set of the intermediate task benefit the target task.

To identify and filter out the examples that do not positively impact the performance of the model, we used influence functions (Cook et al. ’80). Influence functions are a method from robust statistics that measure the dependence of the estimator on the value of a single point in the sample. Informally, they estimate how the removal (or an addition) of a training example impacts the predictions of the model making them the obvious choice to approach the problem. In deep learning, they have already been successfully applied as explanations (Koh et al. ’17), and as an estimation of the quality of the training samples (Wang et al. ’18, Yang et al. ’20). The use in the context of transfer learning thus seems like a natural progression.

There are two potential problems of using influence functions in the context of transfer learning of deep neural networks. Firstly, they have only been proven to work for models with a convex optimization criterion. Secondly, they were not designed for a two-stage training with the model architecture changing during the stages (the last layer is a task-specific classification layer, so it changes between tasks). A few previous works have successfully used influence functions even with neural networks, despite the non-convex optimization criterion (Koh et al. ’17, Wang et al. ’18, Yang et al. ’20). We designed our experiments in a way that avoids the second issue, as the model was only trained on one training set only (SNLI training set (Bowman et al. ’15)) and evaluated on a validation set with the same output format (MNLI matched validation set (Williams et al. ’18)), avoiding the need to change the last layer.

We found that training RoBERTa (Liu et al. ’19) on SNLI training set, filtered using influence functions w.r.t. MNLI validation set resulted in a large performance drop on that same validation set. Moreover, it seems like a dataset consisting of examples with either highest or lowest influence scores resulted in a similar drop in performance. Using middle-ranked examples, on the other hand, resulted in performance similar to, or slightly better, than random downsampling. However, none of the results outperformed training on the full training set.

Experiments

We trained an instance of RoBERTa on the SNLI training set while validating it on the MNLI (matched) validation set. We used Influence functions to estimate the influence of each training example on the validation set loss. Simplified, each example is assigned a real number estimating how much the example contributes to the validation loss. By retaining only the examples with positive influence, we retain approximately 40% of the training set. The distribution of all influences can be found in Figure 1.

Figure 1: Distribution of influences of examples in the training set. We can see that the large majority of the examples are centred around 0.

To gain more insight into the impact of the influence functions, we additionally experiment with only keeping the 25%, 50%, and 75% best-ranked examples. Additionally, we conducted experiments with 25%, 50%, 75% of the worst-ranked examples, as well as an experiment with exactly all examples that were estimated to be detrimental (60% of the dataset). For additional comparison, two more series of experiments were conducted, one with randomly downsampled examples and one with exactly 50% and 25% of the dataset by taking the middle-most ranked examples. The results can be found on Figure 1.

Using subsets with only detrimental or only beneficial examples significantly reduces the performance of the model. All experiments were conducted with the same sets of hyperparameters that worked best for the full dataset. We did an additional hyperparameter search to investigate whether training on filtered dataset requires an additional hyperparameter search, however, the accuracy of the re-trained models improved marginally or did not improve at all.

Figure 2: Performance of RoBERTa on the MNLI matched validation set, fine-tuned on subsets of the SNLI according to the influence of examples. We can see that the computed influence of examples does not correlate to the performance of the trained model as subsets of either only beneficial or detrimental examples give results significantly worse than random downsampling.

Discussion

The results of the experiments showed that the signal from the influence functions is not noise as the difference from random downsampling is too large to be a coincidence. It is, however, unclear what causes this and how it could be useful. There are several potential explanations for why these experiments could fail to yield positive results, e.g. Basu et al. (2020) note that fragility of influence functions rises with the size of a network. RoBERTa most definitely constitutes a very large neural network by the standards of that paper. Moreover, it is well known that an increase/decrease in validation loss does not always correspond to an increase/decrease in classification accuracy on the validation set.

However, all potential explanations in the previous paragraph can only explain potential noise in the results, but not such an enormous drop of performance. We have manually observed and analysed the data, but we were not able to spot any obvious patterns in the filtered datasets that could explain this phenomenon. Since we could not find any potential use of such behaviour, research on this project was finished and left for future research.

Revision history

  • August 27 2020: Initial publication

Written by Anugya Srivastava · Categorized: Research

Jul 02 2020

The MiniBERTas: Testing what RoBERTa learns with varying amounts of pretraining

Author(s): Yian Zhang, Haokun Liu, Haau-Sing Li, Alex Warstadt, Samuel R. Bowman
Publication date: July 2 2020
Reviewer: Iacer Calixto 
Editor: Kyunghyun Cho

Big pretrained MLMs like BERT and RoBERTa are the bread and butter of NLP, and clearly have learned a good deal both about English grammar and about the outside world. Lots of recent work tries to understand what these models have and haven’t learned for both practical and scientific purposes (Tenney et al. ’19; Rogers et al. ’20; Ettinger et al. ’19; Warstadt et al. ’20). For either purpose, it is valuable to investigate the impact of pretraining data size on the model’s knowledge or bias. When do we need a model that has been pretrained on internet-scale corpora, and when does anything with some MLM training suffice?

To answer this question, one needs to vary or control the amount of pretraining data models are exposed to, which is usually complex and somewhat expensive. To help reduce the challenge faced by researchers interested in this topic, we decided to release the MiniBERTas. The MiniBERTas are a family of models based on RoBERTa that we pretrained with different amounts of data for part of our own ongoing work. They are available now through HuggingFace Transformers or jiant. 

The models are pretrained on datasets containing 1M, 10M, 100M, and 1B tokens sampled proportionally from Wikipedia and a reproduction of Toronto BookCorpus, the two datasets that make up the original pretraining dataset of BERT. We did not use the full RoBERTa training set because some subsets of it were too difficult to get or reproduce, and the language of the BERT training set is decently representative of the full RoBERTa training set. The model checkpoints and the user manual are available at https://huggingface.co/nyu-mll, and we provide instructions below to use them through our jiant multitask and transfer learning toolkit. 

Pretraining the MiniBERTas

The models are pretrained using the codebase shared by Liu et al. ’19. We exactly reproduced RoBERTa pretraining except that we used the pretraining datasets of BERT, we pretrained the models using smaller batch sizes, and that we varied the size of pretraining data. For each pretraining size, we selected and released 3 model checkpoints based on validation perplexity out of 25 runs (or 10 runs for the 1B dataset), each with different hyperparameters. The hyperparameters searched for include model size, max steps, peak learning rate, and batch size. Our best models pretrained on 1M, 10M, 100M, 1B tokens show validation perplexities of 134.18, 10.78, 4.61, and 3.84, in comparison with RoBERTa-base’s 3.41 and RoBERTa-large’s 2.69. Details of the hyperparameters can be found here.

Probing the MiniBERTas

As an example of what can be done with the MiniBERTas, we run edge probing (Tenney et al. ’19) with two NLP tasks to observe the effect of pretraining data size on the quality of contextualized representations encoded by RoBERTa. Edge Probing is a probing method that involves training a simple neural network (an MLP in our case) on top of a frozen pretrained encoder to perform an NLP labeling task. The figure below illustrates a forward pass when the model is performing the task of semantic role labeling. The MLP takes contextualized representations of “eat” and “strawberry ice cream”, and predicts label A1 as positive and others as negative. Since the classification network is simple, its performance on the target tasks largely reflects the quality of the contextualized representations.

Edge Probing architecture (src: Tenney et al. ’19)

We probe MiniBERTas and RoBERTa-base on dependency labeling, a syntactic task and relation classification, a semantic task out of the eight tasks adopted in the original paper. The results are shown below. On both tasks, it takes only 100M tokens for the model to outperform ELMo, which is pretrained on 1B tokens. 

Dependency labeling and relation classification results with the MiniBERTas and RoBERTa-base

MiniBERTas also paint a more nuanced picture of how these tasks are learned. The model’s performance on dependency labeling stabilizes at 10M tokens, while the performance on relation classification stabilizes at around 100M tokens. Also, RoBERTa-1M’s performance on relation classification is worse than RoBERTa-base by 24.3 f1 points, while its f1 on dependency labeling is only 7.1 lower than that of RoBERTa-base. Both contrasts suggest that syntactic knowledge needed for dependency labeling can be acquired with less pretraining data than semantic knowledge needed for relation classification.

Tutorial: Probing MiniBERTas with jiant

Edge probing with MiniBERTas is now supported by the latest GitHub development version of  jiant. You can use the commands below to reproduce the experiments in the previous section. We encourage you to set up jiant first following this tutorial if you have not used it before. 

The commands below can be used to download and preprocess the data for dependency labeling. Change directory to the root directory of the jiant repository and activate your jiant environment before running the commands.

mkdir data

mkdir data/edges

probing/data/get_ud_data.sh data/edges/dep_ewt

python probing/get_edge_data_labels.py -o data/edges/dep_ewt/labels.txt -i data/edges/dep_ewt/*.json

python probing/retokenize_edge_data.py -t nyu-mll/roberta-base-1B-3  data/edges/dep_ewt/*.json

If you have not used jiant before, you will probably need to set two critical environment variables: 

$JIANT_PROJECT_PREFIX: the directory where logs and model checkpoints will be saved.

$JIANT_DATA_DIR: The data directory. Set it to PATH/TO/LOCAL/REPO/data

You can now run the probing experiment by:

python main.py –config_file jiant/config/edgeprobe/edgeprobe_miniberta.conf –overrides “exp_name=DL_tutorial, target_tasks=edges-dep-ud-ewt, transformers_output_mode=mix, input_module=nyu-mll/roberta-base-1B-3, target_train_val_interval=1000, batch_size=32, target_train_max_vals=130, lr=0.0005”

A logging message will be printed out after each validation. You should expect validation f1 to exceed 90 in only a few validations.

The final validation result will be printed after the experiment is finished, and can also be found in $JIANT_PROJECT_PREFIX/DL_tutorial/results.tsv. You should expect the final validation f1 to be around 95.

Revision history

  • July 2 2020: Initial publication
  • July 9 2020: Changed some font styles. No edits made to the content.

 

Written by Anugya Srivastava · Categorized: Research · Tagged: Masked Language Modelling, Pretraining, Probing, RoBERTa, Transfer Learning, Transformer

Jul 01 2020

Welcome to CILVR Blog

Welcome to CILVR Blog!

CILVR is a machine learning lab at NYU, led by 11 faculty members from five units, including computer science, mathematics, linguistics, psychology and data science, and 7 affiliated faculty members from a broader NYU community, including physics, radiology and NYU Shanghai. CILVR bustles with more than 50 members, including postdoctoral fellows, PhD students, MSc students and software engineers, in addition to the faculty members, and broadly covers the core area of artificial intelligence, from theoretical foundations, algorithmic research to various application areas, including computer vision, natural language processing, neuroscience, physics, psychology and robotics. 

In this blog, we feature three types of blogs posts. Research posts feature our own research and explains it in a more accessible manner so that it can be more easily consumed by general public. In review posts, our members summarize and review a particular research subject to provide a broader overview of the subject. We also host perspective posts in which our members provide their view on particular research subjects as well as broader fields.

Each post is prepared by one or a group of CILVR members and is lightly reviewed by another CILVR member however with the goal of broader dissemination rather than keeping up with the standard of academic publications. 

For any concerns or feedback, please contact Kyunghyun Cho who is a temporary editor of the blog. 

Please enjoy!

Written by Anugya Srivastava · Categorized: Announcement

Copyright © 2025 · Altitude Pro on Genesis Framework · WordPress · Log in