Author(s): Vid Kocijan and Samuel R. Bowman
Publication date: August 27 2020
Reviewer: Alex Wang
Editor: Kyunghyun Cho
One of the ways to improve the performance of a neural model on a task with scarce data is to pre-train it on a related task first. Large-scale language models, such as RoBERTa (Liu et al. ’19) are pre-trained on large corpora of text, using unsupervised training objectives, such as masked token prediction. Fine-tuning such a pre-trained language model on the target tasks usually outperforms a model trained on the target task data only. In certain cases, e.g. when the target task training-set is small, it can be beneficial to train the pre-trained language model on an “intermediate task” first and only then fine-tune it on the target task (Pruksachatkun et al. ’20). In this project, we took a closer look at this scenario. Since the intermediate task may come from a domain unrelated to that of the target task, we suspect that not all examples in the training set of the intermediate task benefit the target task.
To identify and filter out the examples that do not positively impact the performance of the model, we used influence functions (Cook et al. ’80). Influence functions are a method from robust statistics that measure the dependence of the estimator on the value of a single point in the sample. Informally, they estimate how the removal (or an addition) of a training example impacts the predictions of the model making them the obvious choice to approach the problem. In deep learning, they have already been successfully applied as explanations (Koh et al. ’17), and as an estimation of the quality of the training samples (Wang et al. ’18, Yang et al. ’20). The use in the context of transfer learning thus seems like a natural progression.
There are two potential problems of using influence functions in the context of transfer learning of deep neural networks. Firstly, they have only been proven to work for models with a convex optimization criterion. Secondly, they were not designed for a two-stage training with the model architecture changing during the stages (the last layer is a task-specific classification layer, so it changes between tasks). A few previous works have successfully used influence functions even with neural networks, despite the non-convex optimization criterion (Koh et al. ’17, Wang et al. ’18, Yang et al. ’20). We designed our experiments in a way that avoids the second issue, as the model was only trained on one training set only (SNLI training set (Bowman et al. ’15)) and evaluated on a validation set with the same output format (MNLI matched validation set (Williams et al. ’18)), avoiding the need to change the last layer.
We found that training RoBERTa (Liu et al. ’19) on SNLI training set, filtered using influence functions w.r.t. MNLI validation set resulted in a large performance drop on that same validation set. Moreover, it seems like a dataset consisting of examples with either highest or lowest influence scores resulted in a similar drop in performance. Using middle-ranked examples, on the other hand, resulted in performance similar to, or slightly better, than random downsampling. However, none of the results outperformed training on the full training set.
Experiments
We trained an instance of RoBERTa on the SNLI training set while validating it on the MNLI (matched) validation set. We used Influence functions to estimate the influence of each training example on the validation set loss. Simplified, each example is assigned a real number estimating how much the example contributes to the validation loss. By retaining only the examples with positive influence, we retain approximately 40% of the training set. The distribution of all influences can be found in Figure 1.
Figure 1: Distribution of influences of examples in the training set. We can see that the large majority of the examples are centred around 0.
To gain more insight into the impact of the influence functions, we additionally experiment with only keeping the 25%, 50%, and 75% best-ranked examples. Additionally, we conducted experiments with 25%, 50%, 75% of the worst-ranked examples, as well as an experiment with exactly all examples that were estimated to be detrimental (60% of the dataset). For additional comparison, two more series of experiments were conducted, one with randomly downsampled examples and one with exactly 50% and 25% of the dataset by taking the middle-most ranked examples. The results can be found on Figure 1.
Using subsets with only detrimental or only beneficial examples significantly reduces the performance of the model. All experiments were conducted with the same sets of hyperparameters that worked best for the full dataset. We did an additional hyperparameter search to investigate whether training on filtered dataset requires an additional hyperparameter search, however, the accuracy of the re-trained models improved marginally or did not improve at all.
Figure 2: Performance of RoBERTa on the MNLI matched validation set, fine-tuned on subsets of the SNLI according to the influence of examples. We can see that the computed influence of examples does not correlate to the performance of the trained model as subsets of either only beneficial or detrimental examples give results significantly worse than random downsampling.
Discussion
The results of the experiments showed that the signal from the influence functions is not noise as the difference from random downsampling is too large to be a coincidence. It is, however, unclear what causes this and how it could be useful. There are several potential explanations for why these experiments could fail to yield positive results, e.g. Basu et al. (2020) note that fragility of influence functions rises with the size of a network. RoBERTa most definitely constitutes a very large neural network by the standards of that paper. Moreover, it is well known that an increase/decrease in validation loss does not always correspond to an increase/decrease in classification accuracy on the validation set.
However, all potential explanations in the previous paragraph can only explain potential noise in the results, but not such an enormous drop of performance. We have manually observed and analysed the data, but we were not able to spot any obvious patterns in the filtered datasets that could explain this phenomenon. Since we could not find any potential use of such behaviour, research on this project was finished and left for future research.
Revision history
- August 27 2020: Initial publication