0% found this document useful (0 votes)
21 views

Parameter-Efficient Transfer Learning For NLP

Uploaded by

kai lu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Parameter-Efficient Transfer Learning For NLP

Uploaded by

kai lu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby 1 Andrei Giurgiu 1 * Stanisław Jastrzȩbski 2 * Bruna Morrone 1 Quentin de Laroussilhe 1
Andrea Gesmundo 1 Mona Attariyan 1 Sylvain Gelly 1

Abstract 5
Fine-tuning large pre-trained models is an effec-
0
tive transfer mechanism in NLP. However, in the

Accuracy delta (%)


presence of many downstream tasks, fine-tuning −5
is parameter inefficient: an entire new model is
required for every task. As an alternative, we −10

propose transfer with adapter modules. Adapter −15


modules yield a compact and extensible model;
they add only a few trainable parameters per task, −20 Adapters (ours)
and new tasks can be added without revisiting Fine-tune top layers
−25
previous ones. The parameters of the original 10 5 10 6 10 7 10 8 10 9
network remain fixed, yielding a high degree of Num trainable parameters / task
parameter sharing. To demonstrate adapter’s ef-
fectiveness, we transfer the recently proposed
BERT Transformer model to 26 diverse text clas- Figure 1. Trade-off between accuracy and number of trained task-
sification tasks, including the GLUE benchmark. specific parameters, for adapter tuning and fine-tuning. The y-axis
is normalized by the performance of full fine-tuning, details in
Adapters attain near state-of-the-art performance,
Section 3. The curves show the 20th, 50th, and 80th performance
whilst adding only a few parameters per task. On
percentiles across nine tasks from the GLUE benchmark. Adapter-
GLUE, we attain within 0.4% of the performance based tuning attains a similar performance to full fine-tuning with
of full fine-tuning, adding only 3.6% parameters two orders of magnitude fewer trained parameters.
per task. By contrast, fine-tuning trains 100% of
the parameters per task.
tasks that arrive from customers in sequence. For this, we
propose a transfer learning strategy that yields compact and
1. Introduction extensible downstream models. Compact models are those
Transfer from pre-trained models yields strong performance that solve many tasks using a small number of additional
on many NLP tasks (Dai & Le, 2015; Howard & Ruder, parameters per task. Extensible models can be trained in-
2018; Radford et al., 2018). BERT, a Transformer network crementally to solve new tasks, without forgetting previous
trained on large text corpora with an unsupervised loss, ones. Our method yields a such models without sacrificing
attained state-of-the-art performance on text classification performance.
and extractive question answering (Devlin et al., 2018). The two most common transfer learning techniques in NLP
In this paper we address the online setting, where tasks are feature-based transfer and fine-tuning. Instead, we
arrive in a stream. The goal is to build a system that per- present an alternative transfer method based on adapter
forms well on all of them, but without training an entire new modules (Rebuffi et al., 2017). Features-based transfer in-
model for every new task. A high degree of sharing between volves pre-training real-valued embeddings vectors. These
tasks is particularly useful for applications such as cloud embeddings may be at the word (Mikolov et al., 2013), sen-
services, where models need to be trained to solve many tence (Cer et al., 2019), or paragraph level (Le & Mikolov,
2014). The embeddings are then fed to custom downstream
*
Equal contribution 1 Google Research 2 Jagiellonian University. models. Fine-tuning involves copying the weights from a
Correspondence to: Neil Houlsby <[email protected]>. pre-trained network and tuning them on the downstream
Proceedings of the 36 th International Conference on Machine task. Recent work shows that fine-tuning often enjoys better
Learning, Long Beach, California, PMLR 97, 2019. Copyright performance than feature-based transfer (Howard & Ruder,
2019 by the author(s). 2018).
Parameter-Efficient Transfer Learning for NLP

Both feature-based transfer and fine-tuning require a new 2. Adapter tuning for NLP
set of weights for each task. Fine-tuning is more parameter
efficient if the lower layers of a network are shared between We present a strategy for tuning a large text model on several
tasks. However, our proposed adapter tuning method is even downstream tasks. Our strategy has three key properties:
more parameter efficient. Figure 1 demonstrates this trade- (i) it attains good performance, (ii) it permits training on
off. The x-axis shows the number of parameters trained per tasks sequentially, that is, it does not require simultaneous
task; this corresponds to the marginal increase in the model access to all datasets, and (iii) it adds only a small number
size required to solve each additional task. Adapter-based of additional parameters per task. These properties are
tuning requires training two orders of magnitude fewer pa- especially useful in the context of cloud services, where
rameters to fine-tuning, while attaining similar performance. many models need to be trained on a series of downstream
tasks, so a high degree of sharing is desirable.
Adapters are new modules added between layers of a
pre-trained network. Adapter-based tuning differs from To achieve these properties, we propose a new bottleneck
feature-based transfer and fine-tuning in the following way. adapter module. Tuning with adapter modules involves
Consider a function (neural network) with parameters w: adding a small number of new parameters to a model, which
φw (x). Feature-based transfer composes φw with a new are trained on the downstream task (Rebuffi et al., 2017).
function, χv , to yield χv (φw (x)). Only the new, task- When performing vanilla fine-tuning of deep networks, a
specific, parameters, v, are then trained. Fine-tuning in- modification is made to the top layer of the network. This is
volves adjusting the original parameters, w, for each new required because the label spaces and losses for the upstream
task, limiting compactness. For adapter tuning, a new and downstream tasks differ. Adapter modules perform
function, ψw,v (x), is defined, where parameters w are more general architectural modifications to re-purpose a pre-
copied over from pre-training. The initial parameters v0 trained network for a downstream task. In particular, the
are set such that the new function resembles the original: adapter tuning strategy involves injecting new layers into
ψw,v0 (x) ≈ φw (x). During training, only v are tuned. the original network. The weights of the original network
For deep networks, defining ψw,v typically involves adding are untouched, whilst the new adapter layers are initialized
new layers to the original network, φw . If one chooses at random. In standard fine-tuning, the new top-layer and
|v|  |w|, the resulting model requires ∼ |w| parameters the original weights are co-trained. In contrast, in adapter-
for many tasks. Since w is fixed, the model can be extended tuning, the parameters of the original network are frozen
to new tasks without affecting previous ones. and therefore may be shared by many tasks.

Adapter-based tuning relates to multi-task and continual Adapter modules have two main features: a small number
learning. Multi-task learning also results in compact models. of parameters, and a near-identity initialization. The adapter
However, multi-task learning requires simultaneous access modules need to be small compared to the layers of the orig-
to all tasks, which adapter-based tuning does not require. inal network. This means that the total model size grows
Continual learning systems aim to learn from an endless relatively slowly when more tasks are added. A near-identity
stream of tasks. This paradigm is challenging because net- initialization is required for stable training of the adapted
works forget previous tasks after re-training (McCloskey model; we investigate this empirically in Section 3.6. By
& Cohen, 1989; French, 1999). Adapters differ in that the initializing the adapters to a near-identity function, original
tasks do not interact and the shared parameters are frozen. network is unaffected when training starts. During training,
This means that the model has perfect memory of previous the adapters may then be activated to change the distribution
tasks using a small number of task-specific parameters. of activations throughout the network. The adapter mod-
ules may also be ignored if not required; in Section 3.6 we
We demonstrate on a large and diverse set of text classifica- observe that some adapters have more influence on the net-
tion tasks that adapters yield parameter-efficient tuning for work than others. We also observe that if the initialization
NLP. The key innovation is to design an effective adapter deviates too far from the identity function, the model may
module and its integration with the base model. We propose fail to train.
a simple yet effective, bottleneck architecture. On the GLUE
benchmark, our strategy almost matches the performance of 2.1. Instantiation for Transformer Networks
the fully fine-tuned BERT, but uses only 3% task-specific
parameters, while fine-tuning uses 100% task-specific pa- We instantiate adapter-based tuning for text Transformers.
rameters. We observe similar results on a further 17 public These models attain state-of-the-art performance in many
text datasets, and SQuAD extractive question answering. In NLP tasks, including translation, extractive QA, and text
summary, adapter-based tuning yields a single, extensible, classification problems (Vaswani et al., 2017; Radford et al.,
model that attains near state-of-the-art performance in text 2018; Devlin et al., 2018). We consider the standard Trans-
classification. former architecture, as proposed in Vaswani et al. (2017).
Parameter-Efficient Transfer Learning for NLP

Layer Norm Adapter


Layer +
Transformer
+
Layer Figure 2. Architecture of the adapter module and its integration
Adapter
Feedforward with the Transformer. Left: We add the adapter module twice
2x Feed-forward up-project
layer
to each Transformer layer: after the projection following multi-
headed attention and after the two feed-forward layers. Right: The
Nonlinearity
Layer Norm
adapter consists of a bottleneck which contains few parameters rel-
ative to the attention and feedforward layers in the original model.
+
The adapter also contains a skip-connection. During adapter tun-
Adapter Feedforward
down-project
ing, the green layers are trained on the downstream data, this
Feed-forward layer
includes the adapter, the layer normalization parameters, and the
Multi-headed final classification layer (not shown in the figure).
attention

Adapter modules present many architectural choices. We efficient adaptation of a network; with only 2d parameters
provide a simple design that attains good performance. We per layer. However, training the layer normalization pa-
experimented with a number of more complex designs, see rameters alone is insufficient for good performance, see
Section 3.6, but we found the following strategy performed Section 3.4.
as well as any other that we tested, across many datasets.
Figure 2 shows our adapter architecture, and its application 3. Experiments
it to the Transformer. Each layer of the Transformer contains
We show that adapters achieve parameter efficient transfer
two primary sub-layers: an attention layer and a feedforward
for text tasks. On the GLUE benchmark (Wang et al., 2018),
layer. Both layers are followed immediately by a projection
adapter tuning is within 0.4% of full fine-tuning of BERT,
that maps the features size back to the size of layer’s input.
but it adds only 3% of the number of parameters trained by
A skip-connection is applied across each of the sub-layers.
fine-tuning. We confirm this result on a further 17 public
The output of each sub-layer is fed into layer normalization.
classification tasks and SQuAD question answering. Analy-
We insert two serial adapters after each of these sub-layers.
sis shows that adapter-based tuning automatically focuses
The adapter is always applied directly to the output of the
on the higher layers of the network.
sub-layer, after the projection back to the input size, but
before adding the skip connection back. The output of
the adapter is then passed directly into the following layer 3.1. Experimental Settings
normalization. We use the public, pre-trained BERT Transformer network
To limit the number of parameters, we propose a bottle- as our base model. To perform classification with BERT,
neck architecture. The adapters first project the original we follow the approach in Devlin et al. (2018). The first
d-dimensional features into a smaller dimension, m, apply token in each sequence is a special “classification token”.
a nonlinearity, then project back to d dimensions. The total We attach a linear layer to the embedding of this token to
number of parameters added per layer, including biases, is predict the class label.
2md + d + m. By setting m  d, we limit the number Our training procedure also follows Devlin et al. (2018).
of parameters added per task; in practice, we use around We optimize using Adam (Kingma & Ba, 2014), whose
0.5 − 8% of the parameters of the original model. The learning rate is increased linearly over the first 10% of the
bottleneck dimension, m, provides a simple means to trade- steps, and then decayed linearly to zero. All runs are trained
off performance with parameter efficiency. The adapter on 4 Google Cloud TPUs with a batch size of 32. For each
module itself has a skip-connection internally. With the dataset and algorithm, we run a hyperparameter sweep and
skip-connection, if the parameters of the projection layers select the best model according to accuracy on the validation
are initialized to near-zero, the module is initialized to an set. For the GLUE tasks, we report the test metrics provided
approximate identity function. by the submission website1 . For the other classification
Alongside the layers in the adapter module, we also train tasks we report test-set accuracy.
new layer normalization parameters per task. This tech- We compare to fine-tuning, the current standard for transfer
nique, similar to conditional batch normalization (De Vries of large pre-trained models, and the strategy successfully
et al., 2017), FiLM (Perez et al., 2018), and self-
1
modulation (Chen et al., 2019), also yields parameter- https://gluebenchmark.com/
Parameter-Efficient Transfer Learning for NLP

used by BERT. For N tasks, full fine-tuning requires N × We test adapters sizes in {2, 4, 8, 16, 32, 64}. Since some
the number of parameters of the pre-trained model. Our of the datasets are small, fine-tuning the entire network
goal is to attain performance equal to fine-tuning, but with may be sub-optimal. Therefore, we run an additional base-
fewer total parameters, ideally near to 1×. line: variable fine-tuning. For this, we fine-tune only
the top n layers, and freeze the remainder. We sweep
3.2. GLUE benchmark n ∈ {1, 2, 3, 5, 7, 9, 11, 12}. In these experiments, we use
the BERTBASE model with 12 layers, therefore, variable
We first evaluate on GLUE.2 For these datasets, we trans- fine-tuning subsumes full fine-tuning when n = 12.
fer from the pre-trained BERTLARGE model, which con-
tains 24 layers, and a total of 330M parameters, see Devlin Unlike the GLUE tasks, there is no comprehensive set of
et al. (2018) for details. We perform a small hyperparam- state-of-the-art numbers for this suite of tasks. Therefore, to
eter sweep for adapter tuning: We sweep learning rates confirm that our BERT-based models are competitive, we
in {3 · 10−5 , 3 · 10−4 , 3 · 10−3 }, and number of epochs collect our own benchmark performances. For this, we run
in {3, 20}. We test both using a fixed adapter size (num- a large-scale hyperparameter search over standard network
ber of units in the bottleneck), and selecting the best size topologies. Specifically, we run the single-task Neural Au-
per task from {8, 64, 256}. The adapter size is the only toML algorithm, similar to Zoph & Le (2017); Wong et al.
adapter-specific hyperparameter that we tune. Finally, due (2018). This algorithm searches over a space of feedfor-
to training instability, we re-run 5 times with different ran- ward and convolutional networks, stacked on pre-trained
dom seeds and select the best model on the validation set. text embeddings modules publicly available via TensorFlow
Hub4 . The embeddings coming from the TensorFlow Hub
Table 1 summarizes the results. Adapters achieve a mean modules may be frozen or fine-tuned. The full search space
GLUE score of 80.0, compared to 80.4 achieved by full is described in the appendix. For each task, we run AutoML
fine-tuning. The optimal adapter size varies per dataset. For for one week on CPUs, using 30 machines. In this time
example, 256 is chosen for MNLI, whereas for the smallest the algorithm explores over 10k models on average per task.
dataset, RTE, 8 is chosen. Restricting always to size 64, We select the best final model for each task according to
leads to a small decrease in average accuracy to 79.6. To validation set accuracy.
solve all of the datasets in Table 1, fine-tuning requires 9×
the total number of BERT parameters.3 In contrast, adapters The results for the AutoML benchmark (“no BERT base-
require only 1.3× parameters. line”), fine-tuning, variable fine-tuning, and adapter-tuning
are reported in Table 2. The AutoML baseline demon-
3.3. Additional Classification Tasks strates that the BERT models are competitive. This baseline
explores thousands of models, yet the BERT models per-
To further validate that adapters yields compact, performant, form better on average. We see similar pattern of results to
models, we test on additional, publicly available, text clas- GLUE. The performance of adapter-tuning is close to full
sification tasks. This suite contains a diverse set of tasks: fine-tuning (0.4% behind). Fine-tuning requires 17× the
The number of training examples ranges from 900 to 330k, number of parameters to BERTBASE to solve all tasks. Vari-
the number of classes ranges from 2 to 157, and the av- able fine-tuning performs slightly better than fine-tuning,
erage text length ranging from 57 to 1.9k characters. We whilst training fewer layers. The optimal setting of variable
supply statistics and references for all of the datasets in the fine-tuning results in training 52% of the network on average
appendix. per task, reducing the total to 9.9× parameters. Adapters,
For these datasets, we use a batch size of 32. The datasets however, offer a much more compact model. They intro-
are diverse, so we sweep a wide range of learning rates: duce 1.14% new parameters per task, resulting in 1.19×
{1 · 10−5 , 3 · 10−5 , 1 · 10−4 , 3 · 10−3 }. Due to the large parameters for all 17 tasks.
number of datasets, we select the number of training epochs
from the set {20, 50, 100} manually, from inspection of the 3.4. Parameter/Performance trade-off
validation set learning curves. We select the optimal values The adapter size controls the parameter efficiency, smaller
for both fine-tuning and adapters; the exact values are in the adapters introduce fewer parameters, at a possible cost to
appendix. performance. To explore this trade-off, we consider different
2
We omit WNLI as in Devlin et al. (2018) because the no adapter sizes, and compare to two baselines: (i) Fine-tuning
current algorithm beats the baseline of predicting the majority of only the top k layers of BERTBASE . (ii) Tuning only the
class. layer normalization parameters. The learning rate is tuned
3
We treat MNLIm and MNLImm as separate tasks with individ- using the range presented in Section 3.2.
ually tuned hyperparameters. However, they could be combined
into one model, leaving 8× overall. 4
https://www.tensorflow.org/hub
Parameter-Efficient Transfer Learning for NLP

Total num Trained


CoLA SST MRPC STS-B QQP MNLIm MNLImm QNLI RTE Total
params params / task
BERTLARGE 9.0× 100% 60.5 94.9 89.3 87.6 72.1 86.7 85.9 91.1 70.1 80.4
Adapters (8-256) 1.3× 3.6% 59.5 94.0 89.5 86.9 71.8 84.9 85.1 90.7 71.5 80.0
Adapters (64) 1.2× 2.1% 56.9 94.2 89.6 87.3 71.8 85.3 84.6 91.4 68.8 79.6

Table 1. Results on GLUE test sets scored using the GLUE evaluation server. MRPC and QQP are evaluated using F1 score. STS-B is
evaluated using Spearman’s correlation coefficient. CoLA is evaluated using Matthew’s Correlation. The other tasks are evaluated using
accuracy. Adapter tuning achieves comparable overall score (80.0) to full fine-tuning (80.4) using 1.3× parameters in total, compared to
9×. Fixing the adapter size to 64 leads to a slightly decreased overall score of 79.6 and slightly smaller model.

No BERT BERTBASE BERTBASE BERTBASE


Dataset
baseline Fine-tune Variable FT Adapters
20 newsgroups 91.1 92.8 ± 0.1 92.8 ± 0.1 91.7 ± 0.2
Crowdflower airline 84.5 83.6 ± 0.3 84.0 ± 0.1 84.5 ± 0.2
Crowdflower corporate messaging 91.9 92.5 ± 0.5 92.4 ± 0.6 92.9 ± 0.3
Crowdflower disasters 84.9 85.3 ± 0.4 85.3 ± 0.4 84.1 ± 0.2
Crowdflower economic news relevance 81.1 82.1 ± 0.0 78.9 ± 2.8 82.5 ± 0.3
Crowdflower emotion 36.3 38.4 ± 0.1 37.6 ± 0.2 38.7 ± 0.1
Crowdflower global warming 82.7 84.2 ± 0.4 81.9 ± 0.2 82.7 ± 0.3
Crowdflower political audience 81.0 80.9 ± 0.3 80.7 ± 0.8 79.0 ± 0.5
Crowdflower political bias 76.8 75.2 ± 0.9 76.5 ± 0.4 75.9 ± 0.3
Crowdflower political message 43.8 38.9 ± 0.6 44.9 ± 0.6 44.1 ± 0.2
Crowdflower primary emotions 33.5 36.9 ± 1.6 38.2 ± 1.0 33.9 ± 1.4
Crowdflower progressive opinion 70.6 71.6 ± 0.5 75.9 ± 1.3 71.7 ± 1.1
Crowdflower progressive stance 54.3 63.8 ± 1.0 61.5 ± 1.3 60.6 ± 1.4
Crowdflower US economic performance 75.6 75.3 ± 0.1 76.5 ± 0.4 77.3 ± 0.1
Customer complaint database 54.5 55.9 ± 0.1 56.4 ± 0.1 55.4 ± 0.1
News aggregator dataset 95.2 96.3 ± 0.0 96.5 ± 0.0 96.2 ± 0.0
SMS spam collection 98.5 99.3 ± 0.2 99.3 ± 0.2 95.1 ± 2.2
Average 72.7 73.7 74.0 73.3
Total number of params — 17× 9.9× 1.19×
Trained params/task — 100% 52.9% 1.14%

Table 2. Test accuracy for additional classification tasks. In these experiments we transfer from the BERTBASE model. For each task
and algorithm, the model with the best validation set accuracy is chosen. We report the mean test accuracy and s.e.m. across runs with
different random seeds.

Figure 3 shows the parameter/performance trade-off ag- validation accuracy. For comparison, full fine-tuning attains
gregated over all classification tasks in each suite (GLUE 84.4% ± 0.02% on MNLIm . We observe a similar trend on
and “additional”). On GLUE, performance decreases dra- CoLA.
matically when fewer layers are fine-tuned. Some of the
As a further comparison, we tune the parameters of layer
additional tasks benefit from training fewer layers, so per-
normalization alone. These layers only contain point-wise
formance of fine-tuning decays much less. In both cases,
additions and multiplications, so introduce very few train-
adapters yield good performance across a range of sizes two
able parameters: 40k for BERTBASE . However this strategy
orders of magnitude fewer than fine-tuning.
performs poorly: performance decreases by approximately
Figure 4 shows more details for two GLUE tasks: MNLIm 3.5% on CoLA and 4% on MNLI.
and CoLA. Tuning the top layers trains more task-specific
To summarize, adapter tuning is highly parameter-efficient,
parameters for all k > 2. When fine-tuning using a compa-
and produces a compact model with a strong performance,
rable number of task-specific parameters, the performance
comparable to full fine-tuning. Training adapters with sizes
decreases substantially compared to adapters. For instance,
0.5 − 5% of the original model, performance is within 1%
fine-tuning just the top layer yields approximately 9M train-
of the competitive published results on BERTLARGE .
able parameters and 77.8% ± 0.1% validation accuracy on
MNLIm . In contrast, adapter tuning with size 64 yields ap-
proximately 2M trainable parameters and 83.7% ± 0.1%
Parameter-Efficient Transfer Learning for NLP

GLUE (BERTLARGE ) Additional Tasks (BERTBASE )


5 3

0 2
Accuracy delta (%)

Accuracy delta (%)


1
−5
0
−10
−1
−15
−2
−20 Adapters −3 Adapters
Fine-tune top layers Fine-tune top layers
−25 −4
10 5 10 6 10 7 10 8 10 9 10 5 10 6 10 7 10 8
Num trainable parameters / task Num trainable parameters / task

Figure 3. Accuracy versus the number of trained parameters, aggregated across tasks. We compare adapters of different sizes (orange)
with fine-tuning the top n layers, for varying n (blue). The lines and shaded areas indicate the 20th, 50th, and 80th percentiles across
tasks. For each task and algorithm, the best model is selected for each point along the curve. For GLUE, the validation set accuracy is
reported. For the additional tasks, we report the test-set accuracies. To remove the intra-task variance in scores, we normalize the scores
for each model and task by subtracting the performance of full fine-tuning on the corresponding task.

MNLIm (BERTBASE ) CoLA (BERTBASE )


86 86
Validation accuracy (%)

Validation accuracy (%)

84
84

82
82
80
80
78
Layer Norm. Layer Norm.
78 Adapters Adapters
76
Fine-tune top layers Fine-tune top layers
76 74
10 4 10 5 10 6 10 7 10 8 10 4 10 5 10 6 10 7 10 8
Num trainable parameters / task Num trainable parameters / task

Figure 4. Validation set accuracy versus number of trained parameters for three methods: (i) Adapter tuning with an adapter sizes 2n
for n = 0 . . . 9 (orange). (ii) Fine-tuning the top k layers for k = 1 . . . 12 (blue). (iii) Tuning the layer normalization parameters only
(green). Error bars indicate ±1 s.e.m. across three random seeds.

3.5. SQuAD Extractive Question Answering attains 90.7. SQuAD performs well even with very small
adapters, those of size 2 (0.1% parameters) attain an F1 of
Finally, we confirm that adapters work on tasks other than
89.9.
classification by running on SQuAD v1.1 (Rajpurkar et al.,
2018). Given a question and Wikipedia paragraph, this task
3.6. Analysis and Discussion
requires selecting the answer span to the question from the
paragraph. Figure 5 displays the parameter/performance We perform an ablation to determine which adapters are
trade-off of fine-tuning and adapters on the SQuAD valida- influential. For this, we remove some trained adapters and
tion set. For fine-tuning, we sweep the number of trained lay- re-evaluate the model (without re-training) on the valida-
ers, learning rate in {3·10−5 , 5·10−5 , 1·10−4 }, and number tion set. Figure 6 shows the change in performance when
of epochs in {2, 3, 5}. For adapters, we sweep the adapter removing adapters from all continuous layer spans. The
size, learning rate in {3 · 10−5 , 1 · 10−4 , 3 · 10−4 , 1 · 10−3 }, experiment is performed on BERTBASE with adapter size 64
and number of epochs in {3, 10, 20}. As for classification, on MNLI and CoLA.
adapters attain performance comparable to full fine-tuning,
while training many fewer parameters. Adapters of size 64 First, we observe that removing any single layer’s adapters
(2% parameters) attain a best F1 of 90.4%, while fine-tuning has only a small impact on performance. The elements on
Parameter-Efficient Transfer Learning for NLP

95 size we calculate the mean validation accuracy across the


eight classification tasks by selecting the optimal learning
90
rate and number of epochs5 . For adapter sizes 8, 64, and
256, the mean validation accuracies are 86.2%, 85.8% and
85
f1 score

Adapters 85.7%, respectively. This message is further corroborated


Fine-tune top layers by Figures 4 and 5, which show a stable performance across
80
a few orders of magnitude.
75
Finally, we tried a number of extensions to the adapter’s
architecture that did not yield a significant boost in perfor-
70
10 5 10 6 10 7 10 8 10 9 mance. We document them here for completeness. We
Num trainable parameters experimented with (i) adding a batch/layer normalization to
the adapter, (ii) increasing the number of layers per adapter,
(iii) different activation functions, such as tanh, (iv) inserting
Figure 5. Validation accuracy versus the number of trained param-
adapters only inside the attention layer, (v) adding adapters
eters for SQuAD v1.1. Error bars indicate the s.e.m. across three
seeds, using the best hyperparameters. in parallel to the main layers, and possibly with a multi-
plicative interaction. In all cases we observed the resulting
the heatmaps’ diagonals show the performances of removing performance to be similar to the bottleneck proposed in
adapters from single layers, where largest performance drop Section 2.1. Therefore, due to its simplicity and strong per-
is 2%. In contrast, when all of the adapters are removed formance, we recommend the original adapter architecture.
from the network, the performance drops substantially: to
37% on MNLI and 69% on CoLA – scores attained by 4. Related Work
predicting the majority class. This indicates that although
each adapter has a small influence on the overall network, Pre-trained text representations Pre-trained textual rep-
the overall effect is large. resentations are widely used to improve performance on
NLP tasks. These representations are trained on large cor-
Second, Figure 6 suggests that adapters on the lower layers pora (usually, but not always, unsupervised), and fed as
have a smaller impact than the higher-layers. Removing features to downstream models. In deep networks, these fea-
the adapters from the layers 0 − 4 on MNLI barely affects tures may also be fine-tuned on the downstream task. Brown
performance. This indicates that adapters perform well clusters, trained on distributional information, are a classic
because they automatically prioritize higher layers. Indeed, example of pre-trained representations (Brown et al., 1992).
focusing on the upper layers is a popular strategy in fine- Turian et al. (2010) show that pre-trained embeddings of
tuning (Howard & Ruder, 2018). One intuition is that the words outperform those trained from scratch. Since the
lower layers extract lower-level features that are shared deep-learning era, word embeddings have been widely used,
among tasks, while the higher layers build features that are and training strategies these have arisen (Mikolov et al.,
unique to different tasks. This relates to our observation that 2013; Pennington et al., 2014; Bojanowski et al., 2017).
for some tasks, fine-tuning only the top layers outperforms Embeddings of longer texts, sentences and paragraphs, have
full fine-tuning, see Table 2. also been developed (Le & Mikolov, 2014; Kiros et al.,
Next, we investigate the robustness of the adapter modules 2015; Conneau et al., 2017; Cer et al., 2019).
to the number of neurons and initialization scale. In our To encode context in these representations, features are
main experiments the weights in the adapter module were extracted from internal representations of sequence models,
drawn from a zero-mean Gaussian with standard deviation such as MT systems (McCann et al., 2017), and BiLSTM
10−2 , truncated to two standard deviations. To analyze the language models, as used in ELMo (Peters et al., 2018). As
impact of the initialization scale on the performance, we with adapters, ELMo exploits the layers other than the top
test standard deviations in the interval [10−7 , 1]. Figure 6 layer of a pre-trained network. However, this strategy only
summarizes the results. We observe that on both datasets, reads from the inner layers. In contrast, adapters write to
the performance of adapters is robust for standard deviations the inner layers, re-configuring the processing of features
below 10−2 . However, when the initialization is too large, through the entire network.
performance degrades, more substantially on CoLA.
To investigate robustness of adapters to the number of neu- Fine-tuning Fine-tuning an entire pre-trained model has
rons, we re-examine the experimental data from Section 3.2. become a popular alternative to features (Dai & Le, 2015;
We find that the quality of the model across adapter sizes is 5
We treat here MNLIm and MNLImm as separate tasks. For
stable, and a fixed adapter size across all the tasks could be consistency, for all datasets we use accuracy metric and exclude
used with small detriment to performance. For each adapter the regression STS-B task.
Parameter-Efficient Transfer Learning for NLP

MNLIm CoLA
0 0 90

11 10 9 8 7 6 5 4 3 2 1 0
11 10 9 8 7 6 5 4 3 2 1 0

Validation accuracy (%)


−8
−3 85

First ablated layer


First ablated layer

−16
80
−6
−24
75
−9
−32

70 MNLI m
−40 −12
CoLA
65
0 1 2 3 4 5 6 7 8 9 10 11 0 1 2 3 4 5 6 7 8 9 10 11
10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1
Last ablated layer Last ablated layer
¾

Figure 6. Left, Center: Ablation of trained adapters from continuous layer spans. The heatmap shows the relative decrease in validation
accuracy to the fully trained adapted model. The y and x axes indicate the first and last layers ablated (inclusive), respectively. The
diagonal cells, highlighted in green, indicate ablation of a single layer’s adapters. The cell in the top-right indicates ablation of all adapters.
Cells in the lower triangle are meaningless, and are set to 0%, the best possible relative performance. Right: Performance of BERTBASE
using adapters with different initial weight magnitudes. The x-axis is the standard deviation of the initialization distribution.

Howard & Ruder, 2018; Radford et al., 2018) In NLP, the with the number of tasks, since adapters are very small, our
upstream model is usually a neural language model (Ben- models scale much more favorably.
gio et al., 2003). Recent state-of-the-art results on ques-
tion answering (Rajpurkar et al., 2016) and text classi-
fication (Wang et al., 2018) have been attained by fine- Transfer Learning in Vision Fine-tuning models pre-
tuning a Transformer network (Vaswani et al., 2017) with a trained on ImageNet (Deng et al., 2009) is ubiquitous when
Masked Language Model loss (Devlin et al., 2018). Perfor- building image recognition models (Yosinski et al., 2014;
mance aside, an advantage of fine-tuning is that it does not Huh et al., 2016). This technique attains state-of-the-art per-
require task-specific model design, unlike representation- formance on many vision tasks, including classification (Ko-
based transfer. However, vanilla fine-tuning does require a rnblith et al., 2018), fine-grained classifcation (Hermans
new set of network weights for every new task. et al., 2017), segmentation (Long et al., 2015), and de-
tection (Girshick et al., 2014). In vision, convolutional
Multi-task Learning Multi-task learning (MTL) involves adapter modules have been studied (Rebuffi et al., 2017;
training on tasks simultaneously. Early work shows that 2018; Rosenfeld & Tsotsos, 2018). These works perform
sharing network parameters across tasks exploits task reg- incremental learning in multiple domains by adding small
ularities, yielding improved performance (Caruana, 1997). convolutional layers to a ResNet (He et al., 2016) or VGG
The authors share weights in lower layers of a network, net (Simonyan & Zisserman, 2014). Adapter size is lim-
and use specialized higher layers. Many NLP systems have ited using 1 × 1 convolutions, whilst the original networks
exploited MTL. Some examples include: text processing typically use 3 × 3. This yields 11% increase in overall
systems (part of speech, chunking, named entity recogni- model size per task. Since the kernel size cannot be further
tion, etc.) (Collobert & Weston, 2008), multilingual mod- reduced other weight compression techniques must be used
els (Huang et al., 2013), semantic parsing (Peng et al., 2017), to attain further savings. Our bottleneck adapters can be
machine translation (Johnson et al., 2017), and question an- much smaller, and still perform well.
swering (Choi et al., 2017). MTL yields a single model
Concurrent work explores similar ideas for BERT (Stickland
to solve all problems. However, unlike our adapters, MTL
& Murray, 2019). The authors introduce Projected Atten-
requires simultaneous access to the tasks during training.
tion Layers (PALs), small layers with a similar role to our
adapters. The main differences are i) Stickland & Murray
Continual Learning As an alternative to simultaneous
(2019) use a different architecture, and ii) they perform mul-
training, continual, or lifelong, learning aims to learn from a
titask training, jointly fine-tuning BERT on all GLUE tasks.
sequence of tasks (Thrun, 1998). However, when re-trained,
Sina Semnani (2019) perform an emprical comparison of
deep networks tend to forget how to perform previous tasks;
our bottleneck Adpaters and PALs on SQuAD v2.0 (Ra-
a challenge termed catastrophic forgetting (McCloskey &
jpurkar et al., 2018).
Cohen, 1989; French, 1999). Techniques have been pro-
posed to mitigate forgetting (Kirkpatrick et al., 2017; Zenke
ACKNOWLEDGMENTS
et al., 2017), however, unlike for adapters, the memory is
imperfect. Progressive Networks avoid forgetting by instan- We would like to thank Andrey Khorlin, Lucas Beyer, Noé
tiating a new network “column” for each task (Rusu et al., Lutz, and Jeremiah Harmsen for useful comments and dis-
2016). However, the number of parameters grows linearly cussions.
Parameter-Efficient Transfer Learning for NLP

References He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
learning for image recognition. In CVPR, 2016.
Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A
neural probabilistic language model. Journal of Machine Hermans, A., Beyer, L., and Leibe, B. In Defense of the
Learning Research, 2003. Triplet Loss for Person Re-Identification. arXiv preprint
arXiv:1703.07737, 2017.
Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. En-
riching word vectors with subword information. ACL, Howard, J. and Ruder, S. Universal language model fine-
2017. tuning for text classification. In ACL, 2018.
Brown, P. F., deSouza, P. V., Mercer, R. L., Pietra, V. J. D., Huang, J.-T., Li, J., Yu, D., Deng, L., and Gong, Y. Cross-
and Lai, J. C. Class-based n-gram models of natural language knowledge transfer using multilingual deep neu-
language. Computational Linguistics, 1992. ral network with shared hidden layers. In ICASSP, 2013.
Caruana, R. Multitask learning. Machine Learning, 1997. Huh, M., Agrawal, P., and Efros, A. A. What makes
Cer, D., Yang, Y., Kong, S.-y., Hua, N., Limtiaco, N., imagenet good for transfer learning? arXiv preprint
St. John, R., Constant, N., Guajardo-Cespedes, M., Yuan, arXiv:1608.08614, 2016.
S., Tar, C., Strope, B., and Kurzweil, R. Universal sen- Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y.,
tence encoder for english. In EMNLP, 2019. Chen, Z., Thorat, N., Viégas, F., Wattenberg, M., Corrado,
Chen, T., Lucic, M., Houlsby, N., and Gelly, S. On self G., Hughes, M., and Dean, J. Google’s multilingual
modulation for generative adversarial networks. ICLR, neural machine translation system: Enabling zero-shot
2019. translation. ACL, 2017.

Choi, E., Hewlett, D., Uszkoreit, J., Polosukhin, I., Lacoste, Kingma, D. and Ba, J. Adam: A method for stochastic
A., and Berant, J. Coarse-to-fine question answering for optimization. ICLR, 2014.
long documents. In ACL, 2017. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des-
Collobert, R. and Weston, J. A unified architecture for jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T.,
natural language processing: Deep neural networks with Grabska-Barwinska, A., et al. Overcoming catastrophic
multitask learning. In ICML, 2008. forgetting in neural networks. PNAS, 2017.

Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun,
Bordes, A. Supervised learning of universal sentence R., Torralba, A., and Fidler, S. Skip-thought vectors. In
representations from natural language inference data. In NIPS. 2015.
EMNLP, 2017.
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. models transfer better? arXiv preprint arXiv:1805.08974,
In NIPS. 2015. 2018.

De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Le, Q. and Mikolov, T. Distributed representations of sen-
and Courville, A. C. Modulating early visual processing tences and documents. In ICML, 2014.
by language. In NIPS, 2017.
Long, J., Shelhamer, E., and Darrell, T. Fully convolutional
Deng, J., Dong, W., Socher, R., jia Li, L., Li, K., and Fei-fei, networks for semantic segmentation. In CVPR, 2015.
L. Imagenet: A large-scale hierarchical image database.
McCann, B., Bradbury, J., Xiong, C., and Socher, R.
In In CVPR, 2009.
Learned in translation: Contextualized word vectors. In
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-
Pre-training of deep bidirectional transformers for lan- gus, R., Vishwanathan, S., and Garnett, R. (eds.), NIPS.
guage understanding. arXiv preprint arXiv:1810.04805, 2017.
2018.
McCloskey, M. and Cohen, N. J. Catastrophic interference
French, R. M. Catastrophic forgetting in connectionist net- in connectionist networks: The sequential learning prob-
works. Trends in cognitive sciences, 1999. lem. In Psychology of learning and motivation. 1989.

Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
feature hierarchies for accurate object detection and se- Dean, J. Distributed representations of words and phrases
mantic segmentation. In CVPR, 2014. and their compositionality. In NIPS. 2013.
Parameter-Efficient Transfer Learning for NLP

Peng, H., Thomson, S., and Smith, N. A. Deep multitask Turian, J., Ratinov, L., and Bengio, Y. Word representa-
learning for semantic dependency parsing. In ACL, 2017. tions: A simple and general method for semi-supervised
learning. In ACL, 2010.
Pennington, J., Socher, R., and Manning, C. Glove: Global
vectors for word representation. In EMNLP, 2014. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Atten-
Perez, E., Strub, F., de Vries, H., Dumoulin, V., and tion is all you need. In NIPS. 2017.
Courville, A. C. Film: Visual reasoning with a general
conditioning layer. AAAI, 2018. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
Bowman, S. R. Glue: A multi-task benchmark and analy-
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., sis platform for natural language understanding. ICLR,
Lee, K., and Zettlemoyer, L. Deep contextualized word 2018.
representations. In NAACL, 2018.
Wong, C., Houlsby, N., Lu, Y., and Gesmundo, A. Transfer
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, learning with neural automl. In NeurIPS. 2018.
I. Improving language understanding by genera-
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. How trans-
tive pre-training. URL https://s3-us-west-2. ama-
ferable are features in deep neural networks? In Ghahra-
zonaws. com/openai-assets/research-covers/language-
mani, Z., Welling, M., Cortes, C., Lawrence, N. D., and
unsupervised/language_ understanding_paper. pdf, 2018.
Weinberger, K. Q. (eds.), NIPS. 2014.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad: Zenke, F., Poole, B., and Ganguli, S. Continual learning
100,000+ questions for machine comprehension of text. through synaptic intelligence. ICML, 2017.
In EMNLP, 2016.
Zoph, B. and Le, Q. V. Neural architecture search with
Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t reinforcement learning. In ICLR, 2017.
know: Unanswerable questions for squad. In ACL, 2018.

Rebuffi, S., Vedaldi, A., and Bilen, H. Efficient parametriza-


tion of multi-domain deep neural networks. In CVPR,
2018.

Rebuffi, S.-A., Bilen, H., and Vedaldi, A. Learning multiple


visual domains with residual adapters. In NIPS. 2017.

Rosenfeld, A. and Tsotsos, J. K. Incremental learning


through deep adaptation. IEEE transactions on pattern
analysis and machine intelligence, 2018.

Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H.,


Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., and Had-
sell, R. Progressive neural networks. arXiv preprint
arXiv:1606.04671, 2016.

Simonyan, K. and Zisserman, A. Very deep convolutional


networks for large-scale image recognition. ICLR, 2014.

Sina Semnani, Kaushik Sadagopan, F. T. BERT-A: Fine-


tuning BERT with Adapters and Data Augmentation.
http://web.stanford.edu/class/cs224n/reports/default/15848417.pdf,
2019.

Stickland, A. C. and Murray, I. BERT and PALs: Projected


Attention Layers for Efficient Adaptation in Multi-Task
Learning. arXiv preprint arXiv:1902.02671, 2019.

Thrun, S. Learning to learn. chapter Lifelong Learning


Algorithms. 1998.

You might also like