Parameter-Efficient Transfer Learning For NLP
Parameter-Efficient Transfer Learning For NLP
Neil Houlsby 1 Andrei Giurgiu 1 * Stanisław Jastrzȩbski 2 * Bruna Morrone 1 Quentin de Laroussilhe 1
Andrea Gesmundo 1 Mona Attariyan 1 Sylvain Gelly 1
Abstract 5
Fine-tuning large pre-trained models is an effec-
0
tive transfer mechanism in NLP. However, in the
Both feature-based transfer and fine-tuning require a new 2. Adapter tuning for NLP
set of weights for each task. Fine-tuning is more parameter
efficient if the lower layers of a network are shared between We present a strategy for tuning a large text model on several
tasks. However, our proposed adapter tuning method is even downstream tasks. Our strategy has three key properties:
more parameter efficient. Figure 1 demonstrates this trade- (i) it attains good performance, (ii) it permits training on
off. The x-axis shows the number of parameters trained per tasks sequentially, that is, it does not require simultaneous
task; this corresponds to the marginal increase in the model access to all datasets, and (iii) it adds only a small number
size required to solve each additional task. Adapter-based of additional parameters per task. These properties are
tuning requires training two orders of magnitude fewer pa- especially useful in the context of cloud services, where
rameters to fine-tuning, while attaining similar performance. many models need to be trained on a series of downstream
tasks, so a high degree of sharing is desirable.
Adapters are new modules added between layers of a
pre-trained network. Adapter-based tuning differs from To achieve these properties, we propose a new bottleneck
feature-based transfer and fine-tuning in the following way. adapter module. Tuning with adapter modules involves
Consider a function (neural network) with parameters w: adding a small number of new parameters to a model, which
φw (x). Feature-based transfer composes φw with a new are trained on the downstream task (Rebuffi et al., 2017).
function, χv , to yield χv (φw (x)). Only the new, task- When performing vanilla fine-tuning of deep networks, a
specific, parameters, v, are then trained. Fine-tuning in- modification is made to the top layer of the network. This is
volves adjusting the original parameters, w, for each new required because the label spaces and losses for the upstream
task, limiting compactness. For adapter tuning, a new and downstream tasks differ. Adapter modules perform
function, ψw,v (x), is defined, where parameters w are more general architectural modifications to re-purpose a pre-
copied over from pre-training. The initial parameters v0 trained network for a downstream task. In particular, the
are set such that the new function resembles the original: adapter tuning strategy involves injecting new layers into
ψw,v0 (x) ≈ φw (x). During training, only v are tuned. the original network. The weights of the original network
For deep networks, defining ψw,v typically involves adding are untouched, whilst the new adapter layers are initialized
new layers to the original network, φw . If one chooses at random. In standard fine-tuning, the new top-layer and
|v| |w|, the resulting model requires ∼ |w| parameters the original weights are co-trained. In contrast, in adapter-
for many tasks. Since w is fixed, the model can be extended tuning, the parameters of the original network are frozen
to new tasks without affecting previous ones. and therefore may be shared by many tasks.
Adapter-based tuning relates to multi-task and continual Adapter modules have two main features: a small number
learning. Multi-task learning also results in compact models. of parameters, and a near-identity initialization. The adapter
However, multi-task learning requires simultaneous access modules need to be small compared to the layers of the orig-
to all tasks, which adapter-based tuning does not require. inal network. This means that the total model size grows
Continual learning systems aim to learn from an endless relatively slowly when more tasks are added. A near-identity
stream of tasks. This paradigm is challenging because net- initialization is required for stable training of the adapted
works forget previous tasks after re-training (McCloskey model; we investigate this empirically in Section 3.6. By
& Cohen, 1989; French, 1999). Adapters differ in that the initializing the adapters to a near-identity function, original
tasks do not interact and the shared parameters are frozen. network is unaffected when training starts. During training,
This means that the model has perfect memory of previous the adapters may then be activated to change the distribution
tasks using a small number of task-specific parameters. of activations throughout the network. The adapter mod-
ules may also be ignored if not required; in Section 3.6 we
We demonstrate on a large and diverse set of text classifica- observe that some adapters have more influence on the net-
tion tasks that adapters yield parameter-efficient tuning for work than others. We also observe that if the initialization
NLP. The key innovation is to design an effective adapter deviates too far from the identity function, the model may
module and its integration with the base model. We propose fail to train.
a simple yet effective, bottleneck architecture. On the GLUE
benchmark, our strategy almost matches the performance of 2.1. Instantiation for Transformer Networks
the fully fine-tuned BERT, but uses only 3% task-specific
parameters, while fine-tuning uses 100% task-specific pa- We instantiate adapter-based tuning for text Transformers.
rameters. We observe similar results on a further 17 public These models attain state-of-the-art performance in many
text datasets, and SQuAD extractive question answering. In NLP tasks, including translation, extractive QA, and text
summary, adapter-based tuning yields a single, extensible, classification problems (Vaswani et al., 2017; Radford et al.,
model that attains near state-of-the-art performance in text 2018; Devlin et al., 2018). We consider the standard Trans-
classification. former architecture, as proposed in Vaswani et al. (2017).
Parameter-Efficient Transfer Learning for NLP
Adapter modules present many architectural choices. We efficient adaptation of a network; with only 2d parameters
provide a simple design that attains good performance. We per layer. However, training the layer normalization pa-
experimented with a number of more complex designs, see rameters alone is insufficient for good performance, see
Section 3.6, but we found the following strategy performed Section 3.4.
as well as any other that we tested, across many datasets.
Figure 2 shows our adapter architecture, and its application 3. Experiments
it to the Transformer. Each layer of the Transformer contains
We show that adapters achieve parameter efficient transfer
two primary sub-layers: an attention layer and a feedforward
for text tasks. On the GLUE benchmark (Wang et al., 2018),
layer. Both layers are followed immediately by a projection
adapter tuning is within 0.4% of full fine-tuning of BERT,
that maps the features size back to the size of layer’s input.
but it adds only 3% of the number of parameters trained by
A skip-connection is applied across each of the sub-layers.
fine-tuning. We confirm this result on a further 17 public
The output of each sub-layer is fed into layer normalization.
classification tasks and SQuAD question answering. Analy-
We insert two serial adapters after each of these sub-layers.
sis shows that adapter-based tuning automatically focuses
The adapter is always applied directly to the output of the
on the higher layers of the network.
sub-layer, after the projection back to the input size, but
before adding the skip connection back. The output of
the adapter is then passed directly into the following layer 3.1. Experimental Settings
normalization. We use the public, pre-trained BERT Transformer network
To limit the number of parameters, we propose a bottle- as our base model. To perform classification with BERT,
neck architecture. The adapters first project the original we follow the approach in Devlin et al. (2018). The first
d-dimensional features into a smaller dimension, m, apply token in each sequence is a special “classification token”.
a nonlinearity, then project back to d dimensions. The total We attach a linear layer to the embedding of this token to
number of parameters added per layer, including biases, is predict the class label.
2md + d + m. By setting m d, we limit the number Our training procedure also follows Devlin et al. (2018).
of parameters added per task; in practice, we use around We optimize using Adam (Kingma & Ba, 2014), whose
0.5 − 8% of the parameters of the original model. The learning rate is increased linearly over the first 10% of the
bottleneck dimension, m, provides a simple means to trade- steps, and then decayed linearly to zero. All runs are trained
off performance with parameter efficiency. The adapter on 4 Google Cloud TPUs with a batch size of 32. For each
module itself has a skip-connection internally. With the dataset and algorithm, we run a hyperparameter sweep and
skip-connection, if the parameters of the projection layers select the best model according to accuracy on the validation
are initialized to near-zero, the module is initialized to an set. For the GLUE tasks, we report the test metrics provided
approximate identity function. by the submission website1 . For the other classification
Alongside the layers in the adapter module, we also train tasks we report test-set accuracy.
new layer normalization parameters per task. This tech- We compare to fine-tuning, the current standard for transfer
nique, similar to conditional batch normalization (De Vries of large pre-trained models, and the strategy successfully
et al., 2017), FiLM (Perez et al., 2018), and self-
1
modulation (Chen et al., 2019), also yields parameter- https://gluebenchmark.com/
Parameter-Efficient Transfer Learning for NLP
used by BERT. For N tasks, full fine-tuning requires N × We test adapters sizes in {2, 4, 8, 16, 32, 64}. Since some
the number of parameters of the pre-trained model. Our of the datasets are small, fine-tuning the entire network
goal is to attain performance equal to fine-tuning, but with may be sub-optimal. Therefore, we run an additional base-
fewer total parameters, ideally near to 1×. line: variable fine-tuning. For this, we fine-tune only
the top n layers, and freeze the remainder. We sweep
3.2. GLUE benchmark n ∈ {1, 2, 3, 5, 7, 9, 11, 12}. In these experiments, we use
the BERTBASE model with 12 layers, therefore, variable
We first evaluate on GLUE.2 For these datasets, we trans- fine-tuning subsumes full fine-tuning when n = 12.
fer from the pre-trained BERTLARGE model, which con-
tains 24 layers, and a total of 330M parameters, see Devlin Unlike the GLUE tasks, there is no comprehensive set of
et al. (2018) for details. We perform a small hyperparam- state-of-the-art numbers for this suite of tasks. Therefore, to
eter sweep for adapter tuning: We sweep learning rates confirm that our BERT-based models are competitive, we
in {3 · 10−5 , 3 · 10−4 , 3 · 10−3 }, and number of epochs collect our own benchmark performances. For this, we run
in {3, 20}. We test both using a fixed adapter size (num- a large-scale hyperparameter search over standard network
ber of units in the bottleneck), and selecting the best size topologies. Specifically, we run the single-task Neural Au-
per task from {8, 64, 256}. The adapter size is the only toML algorithm, similar to Zoph & Le (2017); Wong et al.
adapter-specific hyperparameter that we tune. Finally, due (2018). This algorithm searches over a space of feedfor-
to training instability, we re-run 5 times with different ran- ward and convolutional networks, stacked on pre-trained
dom seeds and select the best model on the validation set. text embeddings modules publicly available via TensorFlow
Hub4 . The embeddings coming from the TensorFlow Hub
Table 1 summarizes the results. Adapters achieve a mean modules may be frozen or fine-tuned. The full search space
GLUE score of 80.0, compared to 80.4 achieved by full is described in the appendix. For each task, we run AutoML
fine-tuning. The optimal adapter size varies per dataset. For for one week on CPUs, using 30 machines. In this time
example, 256 is chosen for MNLI, whereas for the smallest the algorithm explores over 10k models on average per task.
dataset, RTE, 8 is chosen. Restricting always to size 64, We select the best final model for each task according to
leads to a small decrease in average accuracy to 79.6. To validation set accuracy.
solve all of the datasets in Table 1, fine-tuning requires 9×
the total number of BERT parameters.3 In contrast, adapters The results for the AutoML benchmark (“no BERT base-
require only 1.3× parameters. line”), fine-tuning, variable fine-tuning, and adapter-tuning
are reported in Table 2. The AutoML baseline demon-
3.3. Additional Classification Tasks strates that the BERT models are competitive. This baseline
explores thousands of models, yet the BERT models per-
To further validate that adapters yields compact, performant, form better on average. We see similar pattern of results to
models, we test on additional, publicly available, text clas- GLUE. The performance of adapter-tuning is close to full
sification tasks. This suite contains a diverse set of tasks: fine-tuning (0.4% behind). Fine-tuning requires 17× the
The number of training examples ranges from 900 to 330k, number of parameters to BERTBASE to solve all tasks. Vari-
the number of classes ranges from 2 to 157, and the av- able fine-tuning performs slightly better than fine-tuning,
erage text length ranging from 57 to 1.9k characters. We whilst training fewer layers. The optimal setting of variable
supply statistics and references for all of the datasets in the fine-tuning results in training 52% of the network on average
appendix. per task, reducing the total to 9.9× parameters. Adapters,
For these datasets, we use a batch size of 32. The datasets however, offer a much more compact model. They intro-
are diverse, so we sweep a wide range of learning rates: duce 1.14% new parameters per task, resulting in 1.19×
{1 · 10−5 , 3 · 10−5 , 1 · 10−4 , 3 · 10−3 }. Due to the large parameters for all 17 tasks.
number of datasets, we select the number of training epochs
from the set {20, 50, 100} manually, from inspection of the 3.4. Parameter/Performance trade-off
validation set learning curves. We select the optimal values The adapter size controls the parameter efficiency, smaller
for both fine-tuning and adapters; the exact values are in the adapters introduce fewer parameters, at a possible cost to
appendix. performance. To explore this trade-off, we consider different
2
We omit WNLI as in Devlin et al. (2018) because the no adapter sizes, and compare to two baselines: (i) Fine-tuning
current algorithm beats the baseline of predicting the majority of only the top k layers of BERTBASE . (ii) Tuning only the
class. layer normalization parameters. The learning rate is tuned
3
We treat MNLIm and MNLImm as separate tasks with individ- using the range presented in Section 3.2.
ually tuned hyperparameters. However, they could be combined
into one model, leaving 8× overall. 4
https://www.tensorflow.org/hub
Parameter-Efficient Transfer Learning for NLP
Table 1. Results on GLUE test sets scored using the GLUE evaluation server. MRPC and QQP are evaluated using F1 score. STS-B is
evaluated using Spearman’s correlation coefficient. CoLA is evaluated using Matthew’s Correlation. The other tasks are evaluated using
accuracy. Adapter tuning achieves comparable overall score (80.0) to full fine-tuning (80.4) using 1.3× parameters in total, compared to
9×. Fixing the adapter size to 64 leads to a slightly decreased overall score of 79.6 and slightly smaller model.
Table 2. Test accuracy for additional classification tasks. In these experiments we transfer from the BERTBASE model. For each task
and algorithm, the model with the best validation set accuracy is chosen. We report the mean test accuracy and s.e.m. across runs with
different random seeds.
Figure 3 shows the parameter/performance trade-off ag- validation accuracy. For comparison, full fine-tuning attains
gregated over all classification tasks in each suite (GLUE 84.4% ± 0.02% on MNLIm . We observe a similar trend on
and “additional”). On GLUE, performance decreases dra- CoLA.
matically when fewer layers are fine-tuned. Some of the
As a further comparison, we tune the parameters of layer
additional tasks benefit from training fewer layers, so per-
normalization alone. These layers only contain point-wise
formance of fine-tuning decays much less. In both cases,
additions and multiplications, so introduce very few train-
adapters yield good performance across a range of sizes two
able parameters: 40k for BERTBASE . However this strategy
orders of magnitude fewer than fine-tuning.
performs poorly: performance decreases by approximately
Figure 4 shows more details for two GLUE tasks: MNLIm 3.5% on CoLA and 4% on MNLI.
and CoLA. Tuning the top layers trains more task-specific
To summarize, adapter tuning is highly parameter-efficient,
parameters for all k > 2. When fine-tuning using a compa-
and produces a compact model with a strong performance,
rable number of task-specific parameters, the performance
comparable to full fine-tuning. Training adapters with sizes
decreases substantially compared to adapters. For instance,
0.5 − 5% of the original model, performance is within 1%
fine-tuning just the top layer yields approximately 9M train-
of the competitive published results on BERTLARGE .
able parameters and 77.8% ± 0.1% validation accuracy on
MNLIm . In contrast, adapter tuning with size 64 yields ap-
proximately 2M trainable parameters and 83.7% ± 0.1%
Parameter-Efficient Transfer Learning for NLP
0 2
Accuracy delta (%)
Figure 3. Accuracy versus the number of trained parameters, aggregated across tasks. We compare adapters of different sizes (orange)
with fine-tuning the top n layers, for varying n (blue). The lines and shaded areas indicate the 20th, 50th, and 80th percentiles across
tasks. For each task and algorithm, the best model is selected for each point along the curve. For GLUE, the validation set accuracy is
reported. For the additional tasks, we report the test-set accuracies. To remove the intra-task variance in scores, we normalize the scores
for each model and task by subtracting the performance of full fine-tuning on the corresponding task.
84
84
82
82
80
80
78
Layer Norm. Layer Norm.
78 Adapters Adapters
76
Fine-tune top layers Fine-tune top layers
76 74
10 4 10 5 10 6 10 7 10 8 10 4 10 5 10 6 10 7 10 8
Num trainable parameters / task Num trainable parameters / task
Figure 4. Validation set accuracy versus number of trained parameters for three methods: (i) Adapter tuning with an adapter sizes 2n
for n = 0 . . . 9 (orange). (ii) Fine-tuning the top k layers for k = 1 . . . 12 (blue). (iii) Tuning the layer normalization parameters only
(green). Error bars indicate ±1 s.e.m. across three random seeds.
3.5. SQuAD Extractive Question Answering attains 90.7. SQuAD performs well even with very small
adapters, those of size 2 (0.1% parameters) attain an F1 of
Finally, we confirm that adapters work on tasks other than
89.9.
classification by running on SQuAD v1.1 (Rajpurkar et al.,
2018). Given a question and Wikipedia paragraph, this task
3.6. Analysis and Discussion
requires selecting the answer span to the question from the
paragraph. Figure 5 displays the parameter/performance We perform an ablation to determine which adapters are
trade-off of fine-tuning and adapters on the SQuAD valida- influential. For this, we remove some trained adapters and
tion set. For fine-tuning, we sweep the number of trained lay- re-evaluate the model (without re-training) on the valida-
ers, learning rate in {3·10−5 , 5·10−5 , 1·10−4 }, and number tion set. Figure 6 shows the change in performance when
of epochs in {2, 3, 5}. For adapters, we sweep the adapter removing adapters from all continuous layer spans. The
size, learning rate in {3 · 10−5 , 1 · 10−4 , 3 · 10−4 , 1 · 10−3 }, experiment is performed on BERTBASE with adapter size 64
and number of epochs in {3, 10, 20}. As for classification, on MNLI and CoLA.
adapters attain performance comparable to full fine-tuning,
while training many fewer parameters. Adapters of size 64 First, we observe that removing any single layer’s adapters
(2% parameters) attain a best F1 of 90.4%, while fine-tuning has only a small impact on performance. The elements on
Parameter-Efficient Transfer Learning for NLP
MNLIm CoLA
0 0 90
11 10 9 8 7 6 5 4 3 2 1 0
11 10 9 8 7 6 5 4 3 2 1 0
−16
80
−6
−24
75
−9
−32
70 MNLI m
−40 −12
CoLA
65
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1
Last ablated layer Last ablated layer
¾
Figure 6. Left, Center: Ablation of trained adapters from continuous layer spans. The heatmap shows the relative decrease in validation
accuracy to the fully trained adapted model. The y and x axes indicate the first and last layers ablated (inclusive), respectively. The
diagonal cells, highlighted in green, indicate ablation of a single layer’s adapters. The cell in the top-right indicates ablation of all adapters.
Cells in the lower triangle are meaningless, and are set to 0%, the best possible relative performance. Right: Performance of BERTBASE
using adapters with different initial weight magnitudes. The x-axis is the standard deviation of the initialization distribution.
Howard & Ruder, 2018; Radford et al., 2018) In NLP, the with the number of tasks, since adapters are very small, our
upstream model is usually a neural language model (Ben- models scale much more favorably.
gio et al., 2003). Recent state-of-the-art results on ques-
tion answering (Rajpurkar et al., 2016) and text classi-
fication (Wang et al., 2018) have been attained by fine- Transfer Learning in Vision Fine-tuning models pre-
tuning a Transformer network (Vaswani et al., 2017) with a trained on ImageNet (Deng et al., 2009) is ubiquitous when
Masked Language Model loss (Devlin et al., 2018). Perfor- building image recognition models (Yosinski et al., 2014;
mance aside, an advantage of fine-tuning is that it does not Huh et al., 2016). This technique attains state-of-the-art per-
require task-specific model design, unlike representation- formance on many vision tasks, including classification (Ko-
based transfer. However, vanilla fine-tuning does require a rnblith et al., 2018), fine-grained classifcation (Hermans
new set of network weights for every new task. et al., 2017), segmentation (Long et al., 2015), and de-
tection (Girshick et al., 2014). In vision, convolutional
Multi-task Learning Multi-task learning (MTL) involves adapter modules have been studied (Rebuffi et al., 2017;
training on tasks simultaneously. Early work shows that 2018; Rosenfeld & Tsotsos, 2018). These works perform
sharing network parameters across tasks exploits task reg- incremental learning in multiple domains by adding small
ularities, yielding improved performance (Caruana, 1997). convolutional layers to a ResNet (He et al., 2016) or VGG
The authors share weights in lower layers of a network, net (Simonyan & Zisserman, 2014). Adapter size is lim-
and use specialized higher layers. Many NLP systems have ited using 1 × 1 convolutions, whilst the original networks
exploited MTL. Some examples include: text processing typically use 3 × 3. This yields 11% increase in overall
systems (part of speech, chunking, named entity recogni- model size per task. Since the kernel size cannot be further
tion, etc.) (Collobert & Weston, 2008), multilingual mod- reduced other weight compression techniques must be used
els (Huang et al., 2013), semantic parsing (Peng et al., 2017), to attain further savings. Our bottleneck adapters can be
machine translation (Johnson et al., 2017), and question an- much smaller, and still perform well.
swering (Choi et al., 2017). MTL yields a single model
Concurrent work explores similar ideas for BERT (Stickland
to solve all problems. However, unlike our adapters, MTL
& Murray, 2019). The authors introduce Projected Atten-
requires simultaneous access to the tasks during training.
tion Layers (PALs), small layers with a similar role to our
adapters. The main differences are i) Stickland & Murray
Continual Learning As an alternative to simultaneous
(2019) use a different architecture, and ii) they perform mul-
training, continual, or lifelong, learning aims to learn from a
titask training, jointly fine-tuning BERT on all GLUE tasks.
sequence of tasks (Thrun, 1998). However, when re-trained,
Sina Semnani (2019) perform an emprical comparison of
deep networks tend to forget how to perform previous tasks;
our bottleneck Adpaters and PALs on SQuAD v2.0 (Ra-
a challenge termed catastrophic forgetting (McCloskey &
jpurkar et al., 2018).
Cohen, 1989; French, 1999). Techniques have been pro-
posed to mitigate forgetting (Kirkpatrick et al., 2017; Zenke
ACKNOWLEDGMENTS
et al., 2017), however, unlike for adapters, the memory is
imperfect. Progressive Networks avoid forgetting by instan- We would like to thank Andrey Khorlin, Lucas Beyer, Noé
tiating a new network “column” for each task (Rusu et al., Lutz, and Jeremiah Harmsen for useful comments and dis-
2016). However, the number of parameters grows linearly cussions.
Parameter-Efficient Transfer Learning for NLP
References He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In CVPR, 2016.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A
neural probabilistic language model. Journal of Machine Hermans, A., Beyer, L., and Leibe, B. In Defense of the
Learning Research, 2003. Triplet Loss for Person Re-Identification. arXiv preprint
arXiv:1703.07737, 2017.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. En-
riching word vectors with subword information. ACL, Howard, J. and Ruder, S. Universal language model fine-
2017. tuning for text classification. In ACL, 2018.
Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., Huang, J.-T., Li, J., Yu, D., Deng, L., and Gong, Y. Cross-
and Lai, J. C. Class-based n-gram models of natural language knowledge transfer using multilingual deep neu-
language. Computational Linguistics, 1992. ral network with shared hidden layers. In ICASSP, 2013.
Caruana, R. Multitask learning. Machine Learning, 1997. Huh, M., Agrawal, P., and Efros, A. A. What makes
Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., imagenet good for transfer learning? arXiv preprint
St. John, R., Constant, N., Guajardo-Cespedes, M., Yuan, arXiv:1608.08614, 2016.
S., Tar, C., Strope, B., and Kurzweil, R. Universal sen- Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y.,
tence encoder for english. In EMNLP, 2019. Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado,
Chen, T., Lucic, M., Houlsby, N., and Gelly, S. On self G., Hughes, M., and Dean, J. Google’s multilingual
modulation for generative adversarial networks. ICLR, neural machine translation system: Enabling zero-shot
2019. translation. ACL, 2017.
Choi, E., Hewlett, D., Uszkoreit, J., Polosukhin, I., Lacoste, Kingma, D. and Ba, J. Adam: A method for stochastic
A., and Berant, J. Coarse-to-fine question answering for optimization. ICLR, 2014.
long documents. In ACL, 2017. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-
Collobert, R. and Weston, J. A unified architecture for jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T.,
natural language processing: Deep neural networks with Grabska-Barwinska, A., et al. Overcoming catastrophic
multitask learning. In ICML, 2008. forgetting in neural networks. PNAS, 2017.
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun,
Bordes, A. Supervised learning of universal sentence R., Torralba, A., and Fidler, S. Skip-thought vectors. In
representations from natural language inference data. In NIPS. 2015.
EMNLP, 2017.
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. models transfer better? arXiv preprint arXiv:1805.08974,
In NIPS. 2015. 2018.
De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Le, Q. and Mikolov, T. Distributed representations of sen-
and Courville, A. C. Modulating early visual processing tences and documents. In ICML, 2014.
by language. In NIPS, 2017.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional
Deng, J., Dong, W., Socher, R., jia Li, L., Li, K., and Fei-fei, networks for semantic segmentation. In CVPR, 2015.
L. Imagenet: A large-scale hierarchical image database.
McCann, B., Bradbury, J., Xiong, C., and Socher, R.
In In CVPR, 2009.
Learned in translation: Contextualized word vectors. In
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-
Pre-training of deep bidirectional transformers for lan- gus, R., Vishwanathan, S., and Garnett, R. (eds.), NIPS.
guage understanding. arXiv preprint arXiv:1810.04805, 2017.
2018.
McCloskey, M. and Cohen, N. J. Catastrophic interference
French, R. M. Catastrophic forgetting in connectionist net- in connectionist networks: The sequential learning prob-
works. Trends in cognitive sciences, 1999. lem. In Psychology of learning and motivation. 1989.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
feature hierarchies for accurate object detection and se- Dean, J. Distributed representations of words and phrases
mantic segmentation. In CVPR, 2014. and their compositionality. In NIPS. 2013.
Parameter-Efficient Transfer Learning for NLP
Peng, H., Thomson, S., and Smith, N. A. Deep multitask Turian, J., Ratinov, L., and Bengio, Y. Word representa-
learning for semantic dependency parsing. In ACL, 2017. tions: A simple and general method for semi-supervised
learning. In ACL, 2010.
Pennington, J., Socher, R., and Manning, C. Glove: Global
vectors for word representation. In EMNLP, 2014. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-
Perez, E., Strub, F., de Vries, H., Dumoulin, V., and tion is all you need. In NIPS. 2017.
Courville, A. C. Film: Visual reasoning with a general
conditioning layer. AAAI, 2018. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
Bowman, S. R. Glue: A multi-task benchmark and analy-
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., sis platform for natural language understanding. ICLR,
Lee, K., and Zettlemoyer, L. Deep contextualized word 2018.
representations. In NAACL, 2018.
Wong, C., Houlsby, N., Lu, Y., and Gesmundo, A. Transfer
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, learning with neural automl. In NeurIPS. 2018.
I. Improving language understanding by genera-
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How trans-
tive pre-training. URL https://s3-us-west-2. ama-
ferable are features in deep neural networks? In Ghahra-
zonaws. com/openai-assets/research-covers/language-
mani, Z., Welling, M., Cortes, C., Lawrence, N. D., and
unsupervised/language_ understanding_paper. pdf, 2018.
Weinberger, K. Q. (eds.), NIPS. 2014.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: Zenke, F., Poole, B., and Ganguli, S. Continual learning
100,000+ questions for machine comprehension of text. through synaptic intelligence. ICML, 2017.
In EMNLP, 2016.
Zoph, B. and Le, Q. V. Neural architecture search with
Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t reinforcement learning. In ICLR, 2017.
know: Unanswerable questions for squad. In ACL, 2018.