Author(s): Yian Zhang, Haokun Liu, Haau-Sing Li, Alex Warstadt, Samuel R. Bowman
Publication date: July 2 2020
Reviewer: Iacer Calixto
Editor: Kyunghyun Cho

Big pretrained MLMs like BERT and RoBERTa are the bread and butter of NLP, and clearly have learned a good deal both about English grammar and about the outside world. Lots of recent work tries to understand what these models have and haven’t learned for both practical and scientific purposes (Tenney et al. ’19; Rogers et al. ’20; Ettinger et al. ’19; Warstadt et al. ’20). For either purpose, it is valuable to investigate the impact of pretraining data size on the model’s knowledge or bias. When do we need a model that has been pretrained on internet-scale corpora, and when does anything with some MLM training suffice?

To answer this question, one needs to vary or control the amount of pretraining data models are exposed to, which is usually complex and somewhat expensive. To help reduce the challenge faced by researchers interested in this topic, we decided to release the MiniBERTas. The MiniBERTas are a family of models based on RoBERTa that we pretrained with different amounts of data for part of our own ongoing work. They are available now through HuggingFace Transformers or jiant.

The models are pretrained on datasets containing 1M, 10M, 100M, and 1B tokens sampled proportionally from Wikipedia and a reproduction of Toronto BookCorpus, the two datasets that make up the original pretraining dataset of BERT. We did not use the full RoBERTa training set because some subsets of it were too difficult to get or reproduce, and the language of the BERT training set is decently representative of the full RoBERTa training set. The model checkpoints and the user manual are available at https://huggingface.co/nyu-mll, and we provide instructions below to use them through our jiant multitask and transfer learning toolkit.

Pretraining the MiniBERTas

The models are pretrained using the codebase shared by Liu et al. ’19. We exactly reproduced RoBERTa pretraining except that we used the pretraining datasets of BERT, we pretrained the models using smaller batch sizes, and that we varied the size of pretraining data. For each pretraining size, we selected and released 3 model checkpoints based on validation perplexity out of 25 runs (or 10 runs for the 1B dataset), each with different hyperparameters. The hyperparameters searched for include model size, max steps, peak learning rate, and batch size. Our best models pretrained on 1M, 10M, 100M, 1B tokens show validation perplexities of 134.18, 10.78, 4.61, and 3.84, in comparison with RoBERTa-base’s 3.41 and RoBERTa-large’s 2.69. Details of the hyperparameters can be found here.

Probing the MiniBERTas

As an example of what can be done with the MiniBERTas, we run edge probing (Tenney et al. ’19) with two NLP tasks to observe the effect of pretraining data size on the quality of contextualized representations encoded by RoBERTa. Edge Probing is a probing method that involves training a simple neural network (an MLP in our case) on top of a frozen pretrained encoder to perform an NLP labeling task. The figure below illustrates a forward pass when the model is performing the task of semantic role labeling. The MLP takes contextualized representations of “eat” and “strawberry ice cream”, and predicts label A1 as positive and others as negative. Since the classification network is simple, its performance on the target tasks largely reflects the quality of the contextualized representations.

Edge Probing architecture (src: Tenney et al. ’19)

We probe MiniBERTas and RoBERTa-base on dependency labeling, a syntactic task and relation classification, a semantic task out of the eight tasks adopted in the original paper. The results are shown below. On both tasks, it takes only 100M tokens for the model to outperform ELMo, which is pretrained on 1B tokens.

Dependency labeling and relation classification results with the MiniBERTas and RoBERTa-base

MiniBERTas also paint a more nuanced picture of how these tasks are learned. The model’s performance on dependency labeling stabilizes at 10M tokens, while the performance on relation classification stabilizes at around 100M tokens. Also, RoBERTa-1M’s performance on relation classification is worse than RoBERTa-base by 24.3 f1 points, while its f1 on dependency labeling is only 7.1 lower than that of RoBERTa-base. Both contrasts suggest that syntactic knowledge needed for dependency labeling can be acquired with less pretraining data than semantic knowledge needed for relation classification.

Tutorial: Probing MiniBERTas with jiant

Edge probing with MiniBERTas is now supported by the latest GitHub development version of jiant. You can use the commands below to reproduce the experiments in the previous section. We encourage you to set up jiant first following this tutorial if you have not used it before.

The commands below can be used to download and preprocess the data for dependency labeling. Change directory to the root directory of the jiant repository and activate your jiant environment before running the commands.

mkdir data

mkdir data/edges

probing/data/get_ud_data.sh data/edges/dep_ewt

python probing/get_edge_data_labels.py -o data/edges/dep_ewt/labels.txt -i data/edges/dep_ewt/*.json

python probing/retokenize_edge_data.py -t nyu-mll/roberta-base-1B-3 data/edges/dep_ewt/*.json

If you have not used jiant before, you will probably need to set two critical environment variables:

$JIANT_PROJECT_PREFIX: the directory where logs and model checkpoints will be saved.

$JIANT_DATA_DIR: The data directory. Set it to PATH/TO/LOCAL/REPO/data

You can now run the probing experiment by:

python main.py –config_file jiant/config/edgeprobe/edgeprobe_miniberta.conf –overrides “exp_name=DL_tutorial, target_tasks=edges-dep-ud-ewt, transformers_output_mode=mix, input_module=nyu-mll/roberta-base-1B-3, target_train_val_interval=1000, batch_size=32, target_train_max_vals=130, lr=0.0005”

A logging message will be printed out after each validation. You should expect validation f1 to exceed 90 in only a few validations.

The final validation result will be printed after the experiment is finished, and can also be found in $JIANT_PROJECT_PREFIX/DL_tutorial/results.tsv. You should expect the final validation f1 to be around 95.

Revision history

July 2 2020: Initial publication
July 9 2020: Changed some font styles. No edits made to the content.