Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
Recently, the emergence of pre-trained models (PTMs)∗ has brought natural language processing (NLP) to a new era. In this
survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its
research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next,
we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for
future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP
tasks.
Deep Learning, Neural Network, Natural Language Processing, Pre-trained Model, Distributed Representation, Word
Embedding, Self-Supervised Learning, Language Modelling
1 Introduction to the Computer Vision (CV) field. The main reason is that
current datasets for most supervised NLP tasks are rather small
With the development of deep learning, various neural net- (except machine translation). Deep neural networks usually
works have been widely used to solve Natural Language Pro- have a large number of parameters, which make them overfit
cessing (NLP) tasks, such as convolutional neural networks on these small training data and do not generalize well in
(CNNs) [80, 86, 49], recurrent neural networks (RNNs) [173, practice. Therefore, the early neural models for many NLP
109], graph-based neural networks (GNNs) [159, 174, 124] tasks were relatively shallow and usually consisted of only
and attention mechanisms [7, 184]. One of the advantages 1∼3 neural layers.
of these neural models is their ability to alleviate the fea- Recently, substantial work has shown that pre-trained mod-
ture engineering problem. Non-neural NLP methods usually els (PTMs), on the large corpus can learn universal language
heavily rely on the discrete handcrafted features, while neural representations, which are beneficial for downstream NLP
methods usually use low-dimensional and dense vectors (aka. tasks and can avoid training a new model from scratch. With
distributed representation) to implicitly represent the syntactic the development of computational power, the emergence of
or semantic features of the language. These representations the deep models (i.e., Transformer [184]), and the constant
are learned in specific NLP tasks. Therefore, neural methods enhancement of training skills, the architecture of PTMs has
make it easy for people to develop various NLP systems. been advanced from shallow to deep. The first-generation
Despite the success of neural models for NLP tasks, the PTMs aim to learn good word embeddings. Since these mod-
performance improvement may be less significant compared els themselves are no longer needed by downstream tasks, they
* Corresponding author (email: [email protected])
∗ PTMs are also known as pre-trained language models (PLMs). In this survey, we use PTMs for NLP instead of PLMs to avoid confusion with the narrow
concept of statistical (or probabilistic) language models.
2 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
are usually very shallow for computational efficiencies, such as knowledge hiding in text data, such as lexical meanings, syn-
Skip-Gram [129] and GloVe [133]. Although these pre-trained tactic structures, semantic roles, and even pragmatics.
embeddings can capture semantic meanings of words, they are The core idea of distributed representation is to describe the
context-free and fail to capture higher-level concepts in con- meaning of a piece of text by low-dimensional real-valued vec-
text, such as polysemous disambiguation, syntactic structures, tors. And each dimension of the vector has no corresponding
semantic roles, anaphora. The second-generation PTMs focus sense, while the whole represents a concrete concept. Figure
on learning contextual word embeddings, such as CoVe [126], 1 illustrates the generic neural architecture for NLP. There are
ELMo [135], OpenAI GPT [142] and BERT [36]. These two kinds of word embeddings: non-contextual and contex-
learned encoders are still needed to represent words in context tual embeddings. The difference between them is whether the
by downstream tasks. Besides, various pre-training tasks are embedding for a word dynamically changes according to the
also proposed to learn PTMs for different purposes. context it appears in.
The contributions of this survey can be summarized as
follows: Task-Specifc Model
2. New taxonomy. We propose a taxonomy of PTMs for Non-contextual ex1 ex2 ex3 ex4 ex5 ex6 ex7
Embeddings
NLP, which categorizes existing PTMs from four dif-
ferent perspectives: 1) representation type, 2) model Figure 1: Generic Neural Architecture for NLP
architecture; 3) type of pre-training task; 4) extensions
for specific types of scenarios.
Non-contextual Embeddings The first step of represent-
3. Abundant resources. We collect abundant resources ing language is to map discrete language symbols into a dis-
on PTMs, including open-source implementations of tributed embedding space. Formally, for each word (or sub-
PTMs, visualization tools, corpora, and paper lists. word) x in a vocabulary V, we map it to a vector e x ∈ RDe with
a lookup table E ∈ RDe ×|V| , where De is a hyper-parameter
4. Future directions. We discuss and analyze the limi- indicating the dimension of token embeddings. These em-
tations of existing PTMs. Also, we suggest possible beddings are trained on task data along with other model
future research directions. parameters.
There are two main limitations to this kind of embeddings.
The rest of the survey is organized as follows. Section 2
The first issue is that the embeddings are static. The embed-
outlines the background concepts and commonly used nota-
ding for a word does is always the same regardless of its
tions of PTMs. Section 3 gives a brief overview of PTMs
context. Therefore, these non-contextual embeddings fail to
and clarifies the categorization of PTMs. Section 4 provides
model polysemous words. The second issue is the out-of-
extensions of PTMs. Section 5 discusses how to transfer the
vocabulary problem. To tackle this problem, character-level
knowledge of PTMs to downstream tasks. Section 6 gives the
word representations or sub-word representations are widely
related resources on PTMs. Section 7 presents a collection of
used in many NLP tasks, such as CharCNN [87], FastText [14]
applications across various NLP tasks. Section 8 discusses the
and Byte-Pair Encoding (BPE) [154].
current challenges and suggests future directions. Section 9
summarizes the paper. Contextual Embeddings To address the issue of polyse-
mous and the context-dependent nature of words, we need
2 Background distinguish the semantics of words in different contexts. Given
a text x1 , x2 , · · · , xT where each token xt ∈ V is a word or
2.1 Language Representation Learning sub-word, the contextual representation of xt depends on the
whole text.
As suggested by Bengio et al. [13], a good representation
should express general-purpose priors that are not task-specific [h1 , h2 , · · · , hT ] = fenc (x1 , x2 , · · · , xT ), (1)
but would be likely to be useful for a learning machine to solve
AI-tasks. When it comes to language, a good representation where fenc (·) is neural encoder, which is described in Sec-
should capture the implicit linguistic rules and common sense tion 2.2, ht is called contextual embedding or dynamical em-
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 3
h1 h2 h3 h4 h5 h1 h2 h3 h4 h5 h1 h2 h3 h4 h5
x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
(a) Convolutional Model (b) Recurrent Model (c) Fully-Connected Self-Attention Model
bedding of token xt because of the contextual information successful instance of fully-connected self-attention model
included in. is the Transformer [184], which also needs other supplement
modules, such as positional embeddings, layer normalization,
2.2 Neural Contextual Encoders residual connections and position-wise feed-forward network
(FFN) layers.
Most of the neural contextual encoders can be classified into
two categories: sequence models and graph-based models.
2.2.3 Analysis
Figure 2 illustrates the architecture of these models.
Sequence models learn the contextual representation of the
2.2.1 Sequence Models word with locality bias and are hard to capture the long-range
interactions between words. Nevertheless, sequence models
Sequence models usually capture local context of a word in
are usually easy to train and get good results for various NLP
sequential order.
tasks.
Convolutional Models Convolutional models take the em- In contrast, as an instantiated fully-connected self-attention
beddings of words in the input sentence and capture the mean- model, the Transformer can directly model the dependency
ing of a word by aggregating the local information from its between every two words in a sequence, which is more power-
neighbors by convolution operations [86]. ful and suitable to model long range dependency of language.
Recurrent Models Recurrent models capture the contextual However, due to its heavy structure and less model bias, the
representations of words with short memory, such as LSTMs Transformer usually requires a large training corpus and is
[64] and GRUs [23]. In practice, bi-directional LSTMs or easy to overfit on small or modestly-sized datasets [142, 53].
GRUs are used to collect information from both sides of a Currently, the Transformer has become the mainstream
word, but its performance is often affected by the long-term architecture of PTMs due to its powerful capacity.
dependency problem.
2.3 Why Pre-training?
2.2.2 Non-Sequence Models With the development of deep learning, the number of model
Non-sequence models learn the contextual representation with parameters has increased rapidly. The much larger dataset is
a pre-defined tree or graph structure between words, such as needed to fully train model parameters and prevent overfit-
the syntactic structure or semantic relation. Some popular ting. However, building large-scale labeled datasets is a great
non-sequence models include Recursive NN [159], TreeL- challenge for most NLP tasks due to the extremely expen-
STM [174, 222], and GCN [88]. sive annotation costs, especially for syntax and semantically
Although the linguistic-aware graph structure can provide related tasks.
useful inductive bias, how to build a good graph structure is In contrast, large-scale unlabeled corpora are relatively easy
also a challenging problem. Besides, the structure depends to construct. To leverage the huge unlabeled text data, we can
heavily on expert knowledge or external NLP tools, such as first learn a good representation from them and then use these
the dependency parser. representations for other tasks. Recent studies have demon-
strated significant performance gains on many NLP tasks with
Fully-Connected Self-Attention Model In practice, a the help of the representation extracted from the PTMs on the
more straightforward way is to use a fully-connected graph large unannotated corpora.
to model the relation of every two words and let the model The advantages of pre-training can be summarized as fol-
learn the structure by itself. Usually, the connection weights lows:
are dynamically computed by the self-attention mechanism,
which implicitly indicates the connection between words. A 1. Pre-training on the huge text corpus can learn universal
4 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
language representations and help with the downstream the rest of the whole model still needs to be learned from
tasks. scratch.
During the same time period, many researchers also try to
2. Pre-training provides a better model initialization, learn embeddings of paragraph, sentence or document, such
which usually leads to a better generalization perfor- as paragraph vector [96], Skip-thought vectors [89], Con-
mance and speeds up convergence on the target task. text2Vec [127]. Different from their modern successors, these
3. Pre-training can be regarded as a kind of regularization sentence embedding models try to encode input sentences
to avoid overfitting on small data [43]. into a fixed-dimensional vector representation, rather than the
contextual representation for each token.
2.4 A Brief History of PTMs for NLP
2.4.2 Second-Generation PTMs: Pre-trained Contextual En-
Pre-training has always been an effective strategy to learn the coders
parameters of deep neural networks, which are then fine-tuned
on downstream tasks. As early as 2006, the breakthrough Since most NLP tasks are beyond word-level, it is natural to
of deep learning came with greedy layer-wise unsupervised pre-train the neural encoders on sentence-level or higher. The
pre-training followed by supervised fine-tuning [62]. In CV, it output vectors of neural encoders are also called contextual
has been in practice to pre-train models on the huge ImageNet word embeddings since they represent the word semantics
corpus, and then fine-tune further on smaller data for different depending on its context.
tasks. This is much better than a random initialization because Dai and Le [30] proposed the first successful instance of
the model learns general image features, which can then be PTM for NLP. They initialized LSTMs with a language model
used in various vision tasks. (LM) or a sequence autoencoder, and found the pre-training
In NLP, PTMs on large corpus have also been proved to be can improve the training and generalization of LSTMs in many
beneficial for the downstream NLP tasks, from the shallow text classification tasks. Liu et al. [109] pre-trained a shared
word embedding to deep neural models. LSTM encoder with LM and fine-tuned it under the multi-task
learning (MTL) framework. They found the pre-training and
fine-tuning can further improve the performance of MTL for
2.4.1 First-Generation PTMs: Pre-trained Word Embeddings
several text classification tasks. Ramachandran et al. [146]
Representing words as dense vectors has a long history [60]. found the Seq2Seq models can be significantly improved by
The “modern” word embedding is introduced in pioneer work unsupervised pre-training. The weights of both encoder and
of neural network language model (NNLM) [12]. Collobert decoder are initialized with pre-trained weights of two lan-
et al. [26] showed that the pre-trained word embedding on the guage models and then fine-tuned with labeled data. Besides
unlabelled data could significantly improve many NLP tasks. pre-training the contextual encoder with LM, McCann et al.
To address the computational complexity, they learned word [126] pre-trained a deep LSTM encoder from an attentional
embeddings with pairwise ranking task instead of language sequence-to-sequence model with machine translation (MT).
modeling. Their work is the first attempt to obtain generic The context vectors (CoVe) output by the pre-trained encoder
word embeddings useful for other tasks from unlabeled data. can improve the performance of a wide variety of common
Mikolov et al. [129] showed that there is no need for deep NLP tasks.
neural networks to build good word embeddings. They pro- Since these precursor PTMs, the modern PTMs are usually
pose two shallow architectures: Continuous Bag-of-Words trained with larger scale corpora, more powerful or deeper
(CBOW) and Skip-Gram (SG) models. Despite their sim- architectures (e.g., Transformer), and new pre-training tasks.
plicity, they can still learn high-quality word embeddings to Peters et al. [135] pre-trained 2-layer LSTM encoder with
capture the latent syntactic and semantic similarities among a bidirectional language model (BiLM), consisting of a for-
words. Word2vec is one of the most popular implementations ward LM and a backward LM. The contextual representations
of these models and makes the pre-trained word embeddings output by the pre-trained BiLM, ELMo (Embeddings from
accessible for different tasks in NLP. Besides, GloVe [133] Language Models), are shown to bring large improvements
is also a widely-used model for obtaining pre-trained word on a broad range of NLP tasks. Akbik et al. [1] captured word
embeddings, which are computed by global word-word co- meaning with contextual string embeddings pre-trained with
occurrence statistics from a large corpus. character-level LM. However, these two PTMs are usually
Although pre-trained word embeddings have been shown ef- used as a feature extractor to produce the contextual word
fective in NLP tasks, they are context-independent and mostly embeddings, which are fed into the main model for down-
trained by shallow models. When used on a downstream task, stream tasks. Their parameters are fixed, and the rest pa-
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 5
rameters of the main model are still trained from scratch. automatically. The key idea of SSL is to predict any part
ULMFiT (Universal Language Model Fine-tuning) [67] at- of the input from other parts in some form. For example,
tempted to fine-tune pre-trained LM for text classification the masked language model (MLM) is a self-supervised
(TC) and achieved state-of-the-art results on six widely-used task that attempts to predict the masked words in a
TC datasets. ULMFiT consists of 3 phases: 1) pre-training sentence given the rest words.
LM on general-domain data; 2) fine-tuning LM on target data;
In CV, many PTMs are trained on large supervised training
3) fine-tuning on the target task. ULMFiT also investigates
sets like ImageNet. However, in NLP, the datasets of most
some effective fine-tuning strategies, including discrimina-
supervised tasks are not large enough to train a good PTM.
tive fine-tuning, slanted triangular learning rates, and gradual
The only exception is machine translation (MT). A large-scale
unfreezing.
MT dataset, WMT 2017, consists of more than 7 million sen-
More recently, the very deep PTMs have shown their pow-
tence pairs. Besides, MT is one of the most challenging tasks
erful ability in learning universal language representations:
in NLP, and an encoder pre-trained on MT can benefit a va-
e.g., OpenAI GPT (Generative Pre-training) [142] and BERT
riety of downstream NLP tasks. As a successful PTM, CoVe
(Bidirectional Encoder Representation from Transformer) [36].
[126] is an encoder pre-trained on MT task and improves a
Besides LM, an increasing number of self-supervised tasks
wide variety of common NLP tasks: sentiment analysis (SST,
(see Section 3.1) is proposed to make the PTMs capturing
IMDb), question classification (TREC), entailment (SNLI),
more knowledge form large scale text corpora.
and question answering (SQuAD).
Since ULMFiT and BERT, fine-tuning has become the
In this section, we introduce some widely-used pre-training
mainstream approach to adapt PTMs for the downstream tasks.
tasks in existing PTMs. We can regard these tasks as self-
supervised learning. Table 1 also summarizes their loss func-
3 Overview of PTMs tions.
The major differences between PTMs are the usages of con- 3.1.1 Language Modeling (LM)
textual encoders, pre-training tasks, and purposes. We have
briefly introduced the architectures of contextual encoders in The most common unsupervised task in NLP is probabilistic
Section 2.2. In this section, we focus on the description of language modeling (LM), which is a classic probabilistic den-
pre-training tasks and give a taxonomy of PTMs. sity estimation problem. Although LM is a general concept,
in practice, LM often refers in particular to auto-regressive
3.1 Pre-training Tasks LM or unidirectional LM.
Given a text sequence x1:T = [x1 , x2 , · · · , xT ], its joint prob-
The pre-training tasks are crucial for learning the universal ability p(x1:T ) can be decomposed as
representation of language. Usually, these pre-training tasks T
should be challenging and have substantial training data. In
Y
p(x1:T ) = p(xt |x0:t−1 ), (2)
this section, we summarize the pre-training tasks into three t=1
categories: supervised learning, unsupervised learning, and where x0 is special token indicating the begin of sequence.
self-supervised learning. The conditional probability p(xt |x0:t−1 ) can be modeled by
a probability distribution over the vocabulary given linguistic
1. Supervised learning (SL) is to learn a function that maps
context x0:t−1 . The context x0:t−1 is modeled by neural encoder
an input to an output based on training data consisting
fenc (·), and the conditional probability is
of input-output pairs.
p(xt |x0:t−1 ) = gLM fenc (x0:t−1 ) , (3)
2. Unsupervised learning (UL) is to find some intrinsic
knowledge from unlabeled data, such as clusters, densi- where gLM (·) is prediction layer.
ties, latent representations. Given a huge corpus, we can train the entire network with
maximum likelihood estimation (MLE).
3. Self-Supervised learning (SSL) is a blend of supervised A drawback of unidirectional LM is that the representa-
learning and unsupervised learning1) . The learning tion of each token encodes only the leftward context tokens
paradigm of SSL is entirely the same as supervised and itself. However, better contextual representations of text
learning, but the labels of training data are generated should encode contextual information from both directions.
1) Indeed, it is hard to clearly distinguish the unsupervised learning and self-supervised learning. For clarification, we refer “unsupervised learning” to the
learning without human-annotated supervised labels. The purpose of “self-supervised learning” is to learn the general knowledge from data rather than standard
unsupervised objectives, such as density estimation.
6 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
An improved solution is bidirectional LM (BiLM), which con- MLM), which is used in MASS [160] and T5 [144]. Seq2Seq
sists of two unidirectional LMs: a forward left-to-right LM MLM can benefit the Seq2Seq-style downstream tasks, such
and a backward right-to-left LM. For BiLM, Baevski et al. as question answering, summarization, and machine transla-
[6] proposed a two-tower model that the forward tower oper- tion.
ates the left-to-right LM and the backward tower operates the
Enhanced Masked Language Modeling (E-MLM) Con-
right-to-left LM.
currently, there are multiple research proposing different en-
hanced versions of MLM to further improve on BERT. Instead
3.1.2 Masked Language Modeling (MLM) of static masking, RoBERTa [117] improves BERT by dy-
Masked language modeling (MLM) is first proposed by Taylor namic masking.
[178] in the literature, who referred to this as a Cloze task. UniLM [39, 8] extends the task of mask prediction on three
Devlin et al. [36] adapted this task as a novel pre-training task types of language modeling tasks: unidirectional, bidirec-
to overcome the drawback of the standard unidirectional LM. tional, and sequence-to-sequence prediction. XLM [27] per-
Loosely speaking, MLM first masks out some tokens from the forms MLM on a concatenation of parallel bilingual sentence
input sentences and then trains the model to predict the masked pairs, called Translation Language Modeling (TLM). Span-
tokens by the rest of the tokens. However, this pre-training BERT [77] replaces MLM with Random Contiguous Words
method will create a mismatch between the pre-training phase Masking and Span Boundary Objective (SBO) to integrate
and the fine-tuning phase because the mask token does not structure information into pre-training, which requires the
appear during the fine-tuning phase. Empirically, to deal with system to predict masked spans based on span boundaries. Be-
this issue, Devlin et al. [36] used a special [MASK] token 80% sides, StructBERT [193] introduces the Span Order Recovery
of the time, a random token 10% of the time and the original task to further incorporate language structures.
token 10% of the time to perform masking. Another way to enrich MLM is to incorporate external
knowledge (see Section 4.1).
Sequence-to-Sequence MLM (Seq2Seq MLM) MLM is
usually solved as classification problem. We feed the masked
3.1.3 Permuted Language Modeling (PLM)
sequences to a neural encoder whose output vectors are fur-
ther fed into a softmax classifier to predict the masked token. Despite the wide use of the MLM task in pre-training, Yang
Alternatively, we can use encoder-decoder (aka. sequence-to- et al. [209] claimed that some special tokens used in the pre-
sequence) architecture for MLM, in which the encoder is fed training of MLM, like [MASK], are absent when the model is
a masked sequence, and the decoder sequentially produces applied on downstream tasks, leading to a gap between pre-
the masked tokens in auto-regression fashion. We refer to training and fine-tuning. To overcome this issue, Permuted
this kind of MLM as sequence-to-sequence MLM (Seq2Seq Language Modeling (PLM) [209] is a pre-training objective
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 7
to replace MLM. In short, PLM is a language modeling task learnable neural encoder in two ways: s(x, y) = fenc(x)
T
fenc(y) or
on a random permutation of input sequences. A permutation s(x, y) = fenc (x ⊕ y).
is randomly sampled from all possible permutations. Then The idea behind CTL is “learning by comparison”. Com-
some of the tokens in the permuted sequence are chosen as pared to LM, CTL usually has less computational complex-
the target, and the model is trained to predict these targets, ity and therefore is desirable alternative training criteria for
depending on the rest of the tokens and the natural positions of PTMs.
targets. Note that this permutation does not affect the natural Collobert et al. [26] proposed pairwise ranking task to dis-
positions of sequences and only defines the order of token pre- tinguish real and fake phrases. The model needs to predict
dictions. In practice, only the last few tokens in the permuted a higher score for a legal phrase than an incorrect phrase
sequences are predicted, due to the slow convergence. And a obtained by replacing its central word with a random word.
special two-stream self-attention is introduced for target-aware Mnih and Kavukcuoglu [131] trained word embeddings effi-
representations. ciently with Noise-Contrastive Estimation (NCE) [55], which
trains a binary classifier to distinguish real and fake samples.
3.1.4 Denoising Autoencoder (DAE) The idea of NCE is also used in the well-known word2vec
embedding [129].
Denoising autoencoder (DAE) takes a partially corrupted input
We briefly describe some recently proposed CTL tasks in
and aims to recover the original undistorted input. Specific to
the following paragraphs.
language, a sequence-to-sequence model, such as the standard
Transformer, is used to reconstruct the original text. There are Deep InfoMax (DIM) Deep InfoMax (DIM) [63] is origi-
several ways to corrupt text [100]: nally proposed for images, which improves the quality of the
(1) Token Masking: Randomly sampling tokens from the representation by maximizing the mutual information between
input and replacing them with [MASK] elements. an image representation and local regions of the image.
(2) Token Deletion: Randomly deleting tokens from the in- Kong et al. [90] applied DIM to language representation
put. Different from token masking, the model needs to decide learning. The global representation of a sequence x is defined
the positions of missing inputs. to be the hidden state of the first token (assumed to be a spe-
(3) Text Infilling: Like SpanBERT, a number of text spans cial start of sentence symbol) output by contextual encoder
are sampled and replaced with a single [MASK] token. Each fenc (x). The objective of DIM is to assign a higher score for
span length is drawn from a Poisson distribution (λ = 3). The fenc (xi: j )T fenc (x̂i: j ) than fenc (x̃i: j )T fenc (x̂i: j ), where xi: j denotes
model needs to predict how many tokens are missing from a an n-gram2) span from i to j in x, x̂i: j denotes a sentence
span. masked at position i to j, and x̃i: j denotes a randomly-sampled
(4) Sentence Permutation: Dividing a document into sen- negative n-gram from corpus.
tences based on full stops and shuffling these sentences in
random order. Replaced Token Detection (RTD) Replaced Token Detec-
(5) Document Rotation: Selecting a token uniformly at tion (RTD) is the same as NCE but predicts whether a token
random and rotating the document so that it begins with that is replaced given its surrounding context.
token. The model needs to identify the real start position of CBOW with negative sampling (CBOW-NS) [129] can be
the document. viewed as a simple version of RTD, in which the negative
samples are randomly sampled from vocabulary with simple
3.1.5 Contrastive Learning (CTL) proposal distribution.
ELECTRA [24] improves RTD by utilizing a generator to
Contrastive learning [153] assumes some observed pairs of replacing some tokens of a sequence. A generator G and a dis-
text that are more semantically similar than randomly sampled criminator D are trained following a two-stage procedure: (1)
text. A score function s(x, y) for text pair (x, y) is learned to Train only the generator with MLM task for n1 steps; (2) Ini-
minimize the objective function: tialize the weights of the discriminator with the weights of the
h exp s(x, y+ )
i generator. Then train the discriminator with a discriminative
LCTL = E x,y+ ,y− − log +
, (4) task for n2 steps, keeping G frozen. Here the discriminative
exp s(x, y ) + exp s(x, y )−
task indicates justifying whether the input token has been re-
where (x, y+ ) are a similar pair and y− is presumably dissimi- placed by G or not. The generator is thrown after pre-training,
lar to x. y+ and y− are typically called positive and negative and only the discriminator will be fine-tuned on downstream
sample. The score function s(x, y) is often computed by a tasks.
2) n is drawn from a Gaussian distribution N(5, 1) clipped at 1 (minimum length) and 10 (maximum length).
8 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
RTD is also an alternative solution for the mismatch prob- multi-modal applications (see Section 4.3), or other specific
lem. The network sees [MASK] during pre-training but not tasks (see Section 4.4).
when being fine-tuned in downstream tasks.
Similarly, WKLM [202] replaces words on the entity-level 3.2 Taxonomy of PTMs
instead of token-level. Concretely, WKLM replaces entity
To clarify the relations of existing PTMs for NLP, we build the
mentions with names of other entities of the same type and
taxonomy of PTMs, which categorizes existing PTMs from
train the models to distinguish whether the entity has been
four different perspectives:
replaced.
Next Sentence Prediction (NSP) Punctuations are the nat- 1. Representation Type: According to the representation
ural separators of text data. So, it is reasonable to construct used for downstream tasks, we can divide PTMs into
pre-training methods by utilizing them. Next Sentence Predic- non-contextual and contextual models.
tion (NSP) [36] is just a great example of this. As its name 2. Architectures: The backbone network used by PTMs,
suggests, NSP trains the model to distinguish whether two including LSTM, Transformer encoder, Transformer
input sentences are continuous segments from the training cor- decoder, and the full Transformer architecture. “Trans-
pus. Specifically, when choosing the sentences pair for each former” means the standard encoder-decoder architec-
pre-training example, 50% of the time, the second sentence ture. “Transformer encoder” and “Transformer decoder”
is the actual next sentence of the first one, and 50% of the mean the encoder and decoder part of the standard
time, it is a random sentence from the corpus. By doing so, it Transformer architecture, respectively. Their difference
is capable to teach the model to understand the relationship is that the decoder part uses masked self-attention with
between two input sentences and thus benefit downstream a triangular matrix to prevent tokens from attending
tasks that are sensitive to this information, such as Question their future (right) positions.
Answering and Natural Language Inference.
However, the necessity of the NSP task has been ques- 3. Pre-Training Task Types: The type of pre-training tasks
tioned by subsequent work [77, 209, 117, 93]. Yang et al. used by PTMs. We have discussed them in Section 3.1.
[209] found the impact of the NSP task unreliable, while Joshi
4. Extensions: PTMs designed for various scenarios, in-
et al. [77] found that single-sentence training without the NSP
cluding knowledge-enriched PTMs, multilingual or
loss is superior to sentence-pair training with the NSP loss.
language-specific PTMs, multi-model PTMs, domain-
Moreover, Liu et al. [117] conducted a further analysis for the
specific PTMs and compressed PTMs. We will particu-
NSP task, which shows that when training with blocks of text
larly introduce these extensions in Section 4.
from a single document, removing the NSP loss matches or
slightly improves performance on downstream tasks. Figure 3 shows the taxonomy as well as some correspond-
ing representative PTMs. Besides, Table 2 distinguishes some
Sentence Order Prediction (SOP) To better model inter-
representative PTMs in more detail.
sentence coherence, ALBERT [93] replaces the NSP loss with
a sentence order prediction (SOP) loss. As conjectured in
Lan et al. [93], NSP conflates topic prediction and coherence 3.3 Model Analysis
prediction in a single task. Thus, the model is allowed to make Due to the great success of PTMs, it is important to understand
predictions merely rely on the easier task, topic prediction. what kinds of knowledge are captured by them, and how to in-
Different from NSP, SOP uses two consecutive segments from duce knowledge from them. There is a wide range of literature
the same document as positive examples, and the same two analyzing linguistic knowledge and world knowledge stored
consecutive segments but with their order swapped as negative in pre-trained non-contextual and contextual embeddings.
examples. As a result, ALBERT consistently outperforms
BERT on various downstream tasks. 3.3.1 Non-Contextual Embeddings
StructBERT [193] and BERTje [33] also take SOP as their
self-supervised learning task. Static word embeddings are first probed for kinds of knowl-
edge. Mikolov et al. [130] found that word representa-
tions learned by neural network language models are able
3.1.6 Others
to capture linguistic regularities in language, and the rela-
Apart from the above tasks, there are many other auxiliary tionship between words can be characterized by a relation-
pre-training tasks designated to incorporate factual knowledge specific vector offset. Further analogy experiments [129]
(see Section 4.1), improve cross-lingual tasks (see Section 4.2), demonstrated that word vectors produced by skip-gram model
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 9
Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117]
Architectures
Transformer Dec. GPT [142], GPT-2 [143]
XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42]
Multilingual
XLG MASS [160], mBART [118], XNLG [19]
can capture both syntactic and semantic word relationships, simple syntactic tasks.
such as vec(“China”) − vec(“Beijing”) ≈ vec(“Japan”) − Besides, Tenney et al. [179] analyzed the roles of BERT’s
vec(“Tokyo”). Besides, they find compositionality property of layers in different tasks and found that BERT solves tasks in a
word vectors, for example, vec(“Germany”) + vec(“capital”) similar order to that in NLP pipelines. Furthermore, knowl-
is close to vec(“Berlin”). Inspired by these work, Rubinstein edge of subject-verb agreement [50] and semantic roles [44]
et al. [151] found that distributional word representations are are also confirmed to exist in BERT. Besides, Hewitt and
good at predicting taxonomic properties (e.g., dog is an ani- Manning [59], Jawahar et al. [72], Kim et al. [85] proposed
mal) but fail to learn attributive properties (e.g., swan is white). several methods to extract dependency trees and constituency
Similarly, Gupta et al. [54] showed that word2vec embeddings trees from BERT, which proved the BERT’s ability to encode
implicitly encode referential attributes of entities. The dis- syntax structure. Reif et al. [148] explored the geometry of
tributed word vectors, along with a simple supervised model, internal representations in BERT and find some evidence: 1)
can learn to predict numeric and binary attributes of entities linguistic features seem to be represented in separate semantic
with a reasonable degree of accuracy. and syntactic subspaces; 2) attention matrices contain gram-
matical representations; 3) BERT distinguishes word senses
3.3.2 Contextual Embeddings at a very fine level.
generation procedure in LAMA, Jiang et al. [74] argued that ing knowledge, which may suffer from catastrophic forgetting
LAMA just measures a lower bound for what language models when injecting multiple kinds of knowledge. To address this,
know and propose more advanced methods to generate more K-Adapter [191] injects multiple kinds of knowledge by train-
efficient queries. Despite the surprising findings of LAMA, it ing different adapters independently for different pre-training
has also been questioned by subsequent work [141, 82]. Sim- tasks, which allows continual knowledge infusion.
ilarly, several studies induce relational knowledge [15] and On the other hand, one can incorporate external knowledge
commonsense knowledge [32] from BERT for downstream into pre-trained models without retraining them from scratch.
tasks. As an example, K-BERT [111] allows injecting factual knowl-
edge during fine-tuning on downstream tasks. Guan et al. [52]
employed commonsense knowledge bases, ConceptNet and
4 Extensions of PTMs ATOMIC, to enhance GPT-2 for story generation. Yang et al.
[207] proposed a knowledge-text fusion model to acquire re-
4.1 Knowledge-Enriched PTMs lated linguistic and factual knowledge for machine reading
comprehension.
PTMs usually learn universal language representation from
Besides, Logan IV et al. [119] and Hayashi et al. [57] ex-
general-purpose large-scale text corpora but lack domain-
tended language model to knowledge graph language model
specific knowledge. Incorporating domain knowledge from
(KGLM) and latent relation language model (LRLM) respec-
external knowledge bases into PTM has been shown to
tively, both of which allow prediction conditioned on knowl-
be effective. The external knowledge ranges from linguis-
edge graph. These novel KG-conditioned language models
tic [94, 83, 136, 191], semantic [99], commonsense [52],
show potential for pre-training.
factual [214, 136, 111, 202, 195], to domain-specific knowl-
edge [58, 111].
4.2 Multilingual and Language-Specific PTMs
On the one hand, external knowledge can be injected during
pre-training. Early studies [197, 217, 201, 205] focused on 4.2.1 Multilingual PTMs
learning knowledge graph embeddings and word embedding
Learning multilingual text representations shared across lan-
jointly. Since BERT, some auxiliary pre-training tasks are
guages plays an important role in many cross-lingual NLP
designed to incorporate external knowledge into deep PTMs.
tasks.
LIBERT [94] (linguistically-informed BERT) incorporates lin-
guistic knowledge via an additional linguistic constraint task. Cross-Lingual Language Understanding (XLU) Most of
Ke et al. [83] integrated sentiment polarity of each word to the early works focus on learning multilingual word embed-
extend the MLM to Label-Aware MLM (LA-MLM). As a re- ding [45, 123, 158], which represents text from multiple lan-
sult, their proposed model, SentiLR, achieves state-of-the-art guages in a single semantic space. However, these methods
performance on several sentence- and aspect-level sentiment usually need (weak) alignment between languages.
classification tasks. Levine et al. [99] proposed SenseBERT, Multilingual BERT3) (mBERT) is pre-trained by MLM with
which is pre-trained to predict not only the masked tokens but the shared vocabulary and weights on Wikipedia text from the
also their supersenses in WordNet. ERNIE(THU) [214] inte- top 104 languages. Each training sample is a monolingual doc-
grates entity embeddings pre-trained on a knowledge graph ument, and there are no cross-lingual objectives specifically
with corresponding entity mentions in the text to enhance designed nor any cross-lingual data. Even so, mBERT per-
the text representation. Similarly, KnowBERT [136] trains forms cross-lingual generalization surprisingly well [140]. K
BERT jointly with an entity linking model to incorporate en- et al. [79] showed that the lexical overlap between languages
tity representation in an end-to-end fashion. Wang et al. [195] plays a negligible role in cross-lingual success.
proposed KEPLER, which jointly optimizes knowledge em- XLM [27] improves mBERT by incorporating a cross-
bedding and language modeling objectives. These work inject lingual task, translation language modeling (TLM), which
structure information of knowledge graph via entity embed- performs MLM on a concatenation of parallel bilingual sen-
ding. In contrast, K-BERT [111] explicitly injects related tence pairs. Unicoder [68] further propose three new cross-
triples extracted from KG into the sentence to obtain an ex- lingual pre-training tasks, including cross-lingual word recov-
tended tree-form input for BERT. Moreover, Xiong et al. [202] ery, cross-lingual paraphrase classification and cross-lingual
adopted entity replacement identification to encourage the masked language model (XMLM).
model to be more aware of factual knowledge. However, most XLM-RoBERTa (XLM-R) [28] is a scaled multilingual
of these methods update the parameters of PTMs when inject- encoder pre-trained on a significantly increased amount of
3) https://github.com/google-research/bert/blob/master/multilingual.md
12 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
training data, 2.5TB clean CommonCrawl data in 100 differ- PTMs. A great majority of these models are designed for
ent languages. The pre-training task of XLM-RoBERTa is a general visual and linguistic feature encoding. And these
monolingual MLM only. XLM-R achieves state-of-the-arts models are pre-trained on some huge corpus of cross-modal
results on multiple cross-lingual benchmarks, including XNLI, data, such as videos with spoken words or images with cap-
MLQA, and NER. tions, incorporating extended pre-training tasks to fully utilize
the multi-modal feature. Typically, tasks like visual-based
Cross-Lingual Language Generation (XLG) Multilin-
MLM, masked visual-feature modeling and visual-linguistic
gual generation is a kind of tasks to generate text with different
matching are widely used in multi-modal pre-training, such as
languages from the input language, such as machine transla-
VideoBERT [165], VisualBERT [103], ViLBERT [120].
tion and cross-lingual abstractive summarization.
Different from the PTMs for multilingual classification, the
PTMs for multilingual generation usually needs to pre-train
both the encoder and decoder jointly, rather than only focusing 4.3.1 Video-Text PTMs
on the encoder.
VideoBERT [165] and CBT [164] are joint video and text mod-
MASS [160] pre-trains a Seq2Seq model with monolingual
els. To obtain sequences of visual and linguistic tokens used
Seq2Seq MLM on multiple languages and achieves significant
for pre-training, the videos are pre-processed by CNN-based
improvement for unsupervised NMT. XNLG [19] performs
encoders and off-the-shelf speech recognition techniques, re-
two-stage pre-training for cross-lingual natural language gen-
spectively. And a single Transformer encoder is trained on the
eration. The first stage pre-trains the encoder with monolin-
processed data to learn the vision-language representations
gual MLM and Cross-Lingual MLM (XMLM) tasks. The sec-
for downstream tasks like video caption. Furthermore, Uni-
ond stage pre-trains the decoder by using monolingual DAE
ViLM [122] proposes to bring in generation tasks to further
and Cross-Lingual Auto-Encoding (XAE) tasks while keeping
pre-train the decoder using in downstream tasks.
the encoder fixed. Experiments show the benefit of XNLG on
cross-lingual question generation and cross-lingual abstrac-
tive summarization. mBART [118], a multilingual extension
4.3.2 Image-Text PTMs
of BART [100], pre-trains the encoder and decoder jointly
with Seq2Seq denoising auto-encoder (DAE) task on large- Besides methods for video-language pre-training, several
scale monolingual corpora across 25 languages. Experiments works introduce PTMs on image-text pairs, aiming to fit down-
demonstrate that mBART produces significant performance stream tasks like visual question answering(VQA) and vi-
gains across a wide variety of machine translation (MT) tasks. sual commonsense reasoning(VCR). Several proposed models
adopt two separate encoders for image and text representation
4.2.2 Language-Specific PTMs independently, such as ViLBERT [120] and LXMERT [175].
Although multilingual PTMs perform well on many lan- While other methods like VisualBERT [103], B2T2 [2], VL-
guages, recent work showed that PTMs trained on a sin- BERT [163], Unicoder-VL [101] and UNITER [17] propose
gle language significantly outperform the multilingual re- single-stream unified Transformer. Though these model ar-
sults [125, 95, 186]. chitectures are different, similar pre-training tasks, such as
For Chinese, which does not have explicit word bound- MLM and image-text matching, are introduced in these ap-
aries, modeling larger granularity [29, 37, 198] and multi- proaches. And to better exploit visual elements, images are
granularity [170, 171] word representations have shown great converted into sequences of regions by applying RoI or bound-
success. Kuratov and Arkhipov [92] used transfer learning ing box retrieval techniques before encoded by pre-trained
techniques to adapt a multilingual PTM to a monolingual Transformers.
PTM for Russian language. In addition, some monolingual
PTMs have been released for different languages, such as
CamemBERT [125] and FlauBERT [95] for French, Fin- 4.3.3 Audio-Text PTMs
BERT [186] for Finnish, BERTje [33] and RobBERT [35]
Moreover, several methods have explored the chance of PTMs
for Dutch, AraBERT [4] for Arabic language.
on audio-text pairs, such as SpeechBERT [22]. This work tries
to build an end-to-end Speech Question Answering (SQA)
4.3 Multi-Modal PTMs
model by encoding audio and text with a single Transformer
Observing the success of PTMs across many NLP tasks, some encoder, which is pre-trained with MLM on speech and text
research has focused on obtaining a cross-modal version of corpus and fine-tuned on Question Answering.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 13
Most publicly available PTMs are trained on general do- Quantization refers to the compression of higher precision
main corpora such as Wikipedia, which limits their appli- parameters to lower precision. Works from Shen et al. [156]
cations to specific domains or tasks. Recently, some studies and Zafrir et al. [211] solely focus on this area. Note that
have proposed PTMs trained on specialty corpora, such as quantization often requires compatible hardware.
BioBERT [98] for biomedical text, SciBERT [11] for scien-
tific text, ClinicalBERT [69, 3] for clinical text. 4.5.3 Parameter Sharing
In addition to pre-training a domain-specific PTM, some Another well-known approach to reduce the number of pa-
work attempts to adapt available pre-trained models to target rameters is parameter sharing, which is widely used in CNNs,
applications, such as biomedical entity normalization [73], RNNs, and Transformer [34]. ALBERT [93] uses cross-layer
patent classification [97], progress notes classification and parameter sharing and factorized embedding parameteriza-
keyword extraction [176]. tion to reduce the parameters of PTMs. Although the number
Some task-oriented pre-training tasks were also proposed, of parameters is greatly reduced, the training and inference
such as sentiment Label-Aware MLM in SentiLR [83] for sen- time of ALBERT are even longer than the standard BERT.
timent analysis, Gap Sentence Generation (GSG) [212] for Generally, parameter sharing does not improve the compu-
text summarization, and Noisy Words Detection for disfluency tational efficiency at inference phase.
detection [192].
4.5.4 Knowledge Distillation
Method Type #Layer Loss Function∗ Speed Up Params Source PTM GLUE‡
BERTBASE [36] 12 LMLM + LNSP 110M 79.6
Baseline
BERTLARGE [36] 24 LMLM + LNSP 340M 81.9
Q-BERT [156] 12 HAWQ + GWQ - BERTBASE ≈ 99% BERT
Quantization
Q8BERT [211] 12 DQ + QAT - BERTBASE ≈ 99% BERT
ALBERT§ [93] Param. Sharing 12 LMLM + LSOP ×5.6 ∼ 0.3 12 ∼ 235M 89.4 (ensemble)
DistilBERT [152] 6 LKD-CE +CosKD + LMLM ×1.63 66M BERTBASE 77.0 (dev)
TinyBERT§ † [75] 4 MSEembed +MSEattn + MSEhidn +LKD-CE ×9.4 14.5M BERTBASE 76.5
BERT-PKD [169] 3∼6 LKD-CE +PTKD + LTask ×3.73 ∼ 1.64 45.7 ∼ 67 M BERTBASE 76.0 ∼ 80.6]
PD [183] Distillation 6 LKD-CE +LTask + LMLM ×2.0 67.5M BERTBASE 81.2]
MobileBERT§[172] 24 FMT+AT+PKT+ LKD-CE +LMLM ×4.0 25.3M BERTLARGE 79.7
MiniLM [194] 6 AT+AR ×1.99 66M BERTBASE 81.0[
DualTrain§ †[216] 12 Dual Projection+LMLM - 1.8 ∼ 19.2M BERTBASE 75.8 ∼ 81.9\
BERT-of-Theseus [203] Module Replacing 6 LTask ×1.94 66M BERTBASE 78.6
1
The desing of this table is borrowed from [203, 150].
‡
The averaged score on 8 tasks (without WNLI) of GLUE benchmark (see Section 7.1). Here MNLI-m and MNLI-mm are regarded as two different tasks. ‘dev’ indicates the result
is on dev set. ‘ensemble’ indicates the result is from the ensemble model.
∗
‘LMLM ’, ‘LNSP ’, and ‘LSOP ’ indicate pre-training objective (see Section 3.1 and Table 1).‘LTask ’ means task-specific loss.
‘HAWQ’, ‘GWQ’, ‘DQ’, and ‘QAT’ indicate Hessian AWare Quantization, Group-wise Quantization, Quantization-Aware Training, and Dynamically Quantized, respectively.
‘KD’ means knowledge distillation. ‘FMT’, ‘AT’, and ‘PKT’ mean Feature Map Transfer, Attention Transfer, and Progressive Knowledge Transfer, respectively. ‘AR’ means
Self-Attention value relation.
§
The dimensionality of the hidden or embedding layers is reduced.
†
Use a smaller vocabulary.
[
Generally, the F1 score is usually used as the main metric of the QQP task. But MiniLM reports the accuracy, which is incomparable to other works.
Result on MNLI and SST-2 only.
]
Result on the other tasks except for STS-B and CoLA.
\
Result on MRPC, MNLI, and SST-2 only.
the teacher model and distilling more knowledge can bring only requires one task-specific loss function. The compressed
improvement to the student model. model, BERT-of-Theseus, is 1.94× faster while retaining more
TinyBERT [75] performs layer-to-layer distillation with em- than 98% performance of the source model.
bedding outputs, hidden states, and self-attention distributions.
MobileBERT [172] also perform layer-to-layer distillation 4.5.6 Others
with soft target probabilities, hidden states, and self-attention
In addition to reducing model sizes, there are other ways to
distributions. MiniLM [194] distill self-attention distributions
improve the computational efficiency of PTMs in practical
and self-attention value relation from teacher model.
scenarios with limited resources. Liu et al. [112] proposed a
Besides, other models distill knowledge through many ap-
practical speed-tunable BERT, namely FastBERT, which can
proaches. Sun et al. [169] introduced a “patient” teacher-
dynamically reduce computational steps with sample-wise
student mechanism, Liu et al. [113] exploited KD to improve
adaptive mechanism.
a pre-trained multi-task deep neural network.
(3) Distillation to other structures. Generally, the structure
of the student model is the same as the teacher model, except
5 Adapting PTMs to Downstream Tasks
for a smaller layer size and a smaller hidden size. However, Although PTMs capture the general language knowledge from
not only decreasing parameters but also simplifying model a large corpus, how effectively adapting their knowledge to
structures from Transformer to RNN [177] or CNN [20] can the downstream task is still a key problem.
reduce the computational complexity.
5.1 Transfer Learning
4.5.5 Module Replacing
Transfer learning [132] is to adapt the knowledge from a
Module replacing is an interesting and simple way to reduce source task (or domain) to a target task (or domain). Fig-
the model size, which replaces the large modules of original ure 4 gives an illustration of transfer learning.
PTMs with more compact substitutes. Xu et al. [203] pro- There are many types of transfer learning in NLP, such as
posed Theseus Compression motivated by a famous thought domain adaptation, cross-lingual learning, multi-task learning.
experiment called “Ship of Theseus”, which progressively Adapting PTMs to downstream tasks is sequential transfer
substitutes modules from the source model with modules of learning task, in which tasks are learned sequentially and the
fewer parameters. Different from KD, Theseus Compression target task has labeled data.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 15
NLP problems [143]. However, different pre-training tasks where αl is the softmax-normalized weight for layer l and γ is
have their own bias and give different effects for different a scalar to scale the vectors output by pre-trained model. The
tasks. For example, the NSP task [36] makes PTM understand mixup representation is fed into the task-specific model g(rt ).
the relationship between two sentences. Thus, the PTM can
benefit downstream tasks such as Question Answering (QA) 5.2.3 To tune or not to tune?
and Natural Language Inference (NLI).
Currently, there are two common ways of model transfer: fea-
(2) The architecture of PTM is also important for the down-
ture extraction (where the pre-trained parameters are frozen),
stream task. For example, although BERT helps with most
and fine-tuning (where the pre-trained parameters are unfrozen
natural language understanding tasks, it is hard to generate
and fine-tuned).
language.
In feature extraction way, the pre-trained models are re-
(3) The data distribution of the downstream task should be
garded as off-the-shelf feature extractors. Moreover, it is im-
approximate to PTMs. Currently, there are a large number of
portant to expose the internal layers as they typically encode
off-the-shelf PTMs, which can just as conveniently be used
the most transferable representations [137].
for various domain-specific or language-specific downstream
Although both these two ways can significantly benefit
tasks.
most of NLP tasks, feature extraction way requires more com-
Therefore, given a target task, it is always a good solution
plex task-specific architecture. Therefore, the fine-tuning way
to choose the PTMs trained with appropriate pre-training task,
is usually more general and convenient for many different
architecture, and corpus.
downstream tasks than feature extraction way.
Table 4 gives some common combinations of adapting
5.2.2 Choosing appropriate layers
PTMs.
Given a pre-trained deep model, different layers should cap-
Table 4: Some common combinations of adapting PTMs.
ture different kinds of information, such as POS tagging, pars-
ing, long-term dependencies, semantic roles, coreference. For Where FT/FE?† PTMs
RNN-based models, Belinkov et al. [10] and Melamud et al. Embedding Only FT/FE Word2vec [129],GloVe [133]
[127] showed that representations learned from different lay- Top Layer FT BERT [36],RoBERTa [117]
ers in a multi-layer LSTM encoder benefit different tasks Top Layer FE BERT§ [218, 221]
(e.g., predicting POS tags and understanding word sense). For All Layers FE ELMo [135]
transformer-based PTMs, Tenney et al. [179] found BERT † FT and FE mean Fine-tuning and Feature Extraction respectively.
represents the steps of the traditional NLP pipeline: basic § BERT used as feature extractor.
16 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
However, motivated by the fact that the progress in recent (HotpotQA) [208].
years has eroded headroom on the GLUE benchmark dra-
BERT creatively transforms the extractive QA task to the
matically, a new benchmark called SuperGLUE [189] was
spans prediction task that predicts the starting span as well
presented. Compared to GLUE, SuperGLUE has more chal-
as the ending span of the answer [36]. After that, PTM as
lenging tasks and more diverse task formats (e.g., coreference
an encoder for predicting spans has become a competitive
resolution and question answering).
baseline. For extractive QA, Zhang et al. [215] proposed a ret-
State-of-the-art PTMs are listed in the corresponding leader-
rospective reader architecture and initialize the encoder with
board4) 5) .
PTM (e.g., ALBERT). For multi-round generative QA, Ju
et al. [78] proposed a “PTM+Adversarial Training+Rationale
7.2 Question Answering Tagging+Knowledge Distillation” model. For multi-hop QA,
Tu et al. [182] proposed an interpretable “Select, Answer, and
Question answering (QA), or a narrower concept machine
Explain” (SAE) system that PTM acts as the encoder in the
reading comprehension (MRC), is an important application in
selection module.
the NLP community. From easy to hard, there are three types
of QA tasks: single-round extractive QA (SQuAD) [145], Generally, encoder parameters in the proposed QA model
multi-round generative QA (CoQA) [147], and multi-hop QA are initialized through a PTM, and other parameters are ran-
4) https://gluebenchmark.com/
5) https://super.gluebenchmark.com/
6) https://rajpurkar.github.io/SQuAD-explorer/
7) https://stanfordnlp.github.io/coqa/
8) https://hotpotqa.github.io/
18 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
input tokens by a pre-trained BERT and use the output of the 7.7 Adversarial Attacks and Defenses
last layer as extra memory. Then, the NMT model can access
The deep neural models are vulnerable to adversarial exam-
the memory via an extra attention module in each layer of
ples that can mislead a model to produce a specific wrong
both encoder and decoder. And they show a noticeable im-
prediction with imperceptible perturbations from the origi-
provement in supervised, semi-supervised, and unsupervised
nal input. In CV, adversarial attacks and defenses have been
MT.
widely studied. However, it is still challenging for text due
Instead of only pre-training the encoder, MASS (Masked
to the discrete nature of languages. Generating of adversarial
Sequence-to-Sequence Pre-Training) [160] utilizes Seq2Seq
samples for text needs to possess such qualities: (1) imper-
MLM to pre-train the encoder and decoder jointly. In the
ceptible to human judges yet misleading to neural models; (2)
experiment, this approach can surpass the BERT-style pre-
fluent in grammar and semantically consistent with original in-
training proposed by Conneau and Lample [27] both on un-
puts. Jin et al. [76] successfully attacked the fine-tuned BERT
supervised MT and English-Romanian supervised MT. Dif-
on text classification and textual entailment with adversarial
ferent from MASS, mBART [118], a multilingual extension
examples. Wallace et al. [188] defined universal adversarial
of BART [100], pre-trains the encoder and decoder jointly
triggers that can induce a model to produce a specific-purpose
with Seq2Seq denoising auto-encoder (DAE) task on large-
prediction when concatenated to any input. Some triggers can
scale monolingual corpora across 25 languages. Experiments
even cause the GPT-2 model to generate racist text. Sun et al.
demonstrated that mBART could significantly improve both
[168] showed BERT is not robust on misspellings.
supervised and unsupervised machine translation at both the
PTMs also have great potential to generate adversarial sam-
sentence level and document level.
ples. Li et al. [102] proposed BERT-Attack, a BERT-based
high-quality and effective attacker. They turned BERT against
7.6 Summarization
another fine-tuned BERT on downstream tasks and success-
Summarization, aiming at producing a shorter text which pre- fully misguided the target model to predict incorrectly, out-
serves the most meaning of a longer text, has attracted the performing state-of-the-art attack strategies in both success
attention of the NLP community in recent years. The task rate and perturb percentage, while the generated adversarial
has been improved significantly since the widespread use of samples are fluent and semantically preserved.
PTM. Zhong et al. [218] introduced transferable knowledge Besides, adversarial defenses for PTMs are also promis-
(e.g., BERT) for summarization and surpassed previous mod- ing, which improve the robustness of PTMs and make them
els. Zhang et al. [213] tries to pre-trained a document-level immune against adversarial attack.
model that predicts sentences instead of words, and then apply Adversarial training aims to improve the generalization
it on downstream tasks such as summarization. More elabo- by minimizes the maximal risk for label-preserving perturba-
rately, Zhang et al. [212] designed a Gap Sentence Generation tions in embedding space. Recent work [220, 115] showed
(GSG) task for pre-training, whose objective involves generat- that adversarial pre-training or fine-tuning can improve both
ing summary-like text from the input. Furthermore, Liu and generalization and robustness of PTMs for NLP.
Lapata [116] proposed BERTSUM. BERTSUM included a
novel document-level encoder, and a general framework for
both extractive summarization and abstractive summarization.
8 Future Directions
In the encoder frame, BERTSUM extends BERT by inserting
Though PTMs have proven their power for various NLP tasks,
multiple [CLS] tokens to learn the sentence representations.
challenges still exist due to the complexity of language. In
For extractive summarization, BERTSUM stacks several inter-
this section, we suggest five future directions of PTMs.
sentence Transformer layers. For abstractive summarization,
BERTSUM proposes a two-staged fine-tuning approach using (1) Upper Bound of PTMs Currently, PTMs have not yet
a new fine-tuning schedule. Zhong et al. [219] proposed a reached its upper bound. Most of the current PTMs can be
novel summary-level framework MATCHSUM and conceptu- further improved by more training steps and larger corpora.
alized extractive summarization as a semantic text matching The state of the art in NLP can be further advanced by
problem. They proposed a Siamese-BERT architecture to increasing the depth of models, such as Megatron-LM [157]
compute the similarity between the source document and the (8.3 billion parameters, 72 Transformer layers with a hidden
candidate summary and achieved a state-of-the-art result on size of 3072 and 32 attention heads) and Turing-NLG9) (17
CNN/DailyMail (44.41 in ROUGE-1) by only using the base billion parameters, 78 Transformer layers with a hidden size
version of BERT. of 4256 and 28 attention heads).
9) https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
20 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
The general-purpose PTMs are always our pursuits for (4) Knowledge Transfer Beyond Fine-tuning Currently,
learning the intrinsic universal knowledge of languages (even fine-tuning is the dominant method to transfer PTMs’ knowl-
world knowledge). However, such PTMs usually need deeper edge to downstream tasks, but one deficiency is its parameter
architecture, larger corpus, and challenging pre-training tasks, inefficiency: every downstream task has its own fine-tuned
which further result in higher training costs. However, train- parameters. An improved solution is to fix the original pa-
ing huge models is also a challenging problem, which needs rameters of PTMs and by adding small fine-tunable adap-
more sophisticated and efficient training techniques such as tion modules for specific task [162, 66]. Thus, we can use
distributed training, mixed precision, gradient accumulation, a shared PTM to serve multiple downstream tasks. Indeed,
etc. Therefore, a more practical direction is to design more mining knowledge from PTMs can be more flexible, such as
efficient model architecture, self-supervised pre-training tasks, feature extraction, knowledge distillation [210], data augmen-
optimizers, and training skills using existing hardware and tation [199, 91], using PTMs as external knowledge [138].
software. ELECTRA [24] is a good solution towards this More efficient methods are expected.
direction.
(5) Interpretability and Reliability of PTMs Although
(2) Architecture of PTMs The transformer has been proved PTMs reach impressive performance, their deep non-linear
to be an effective architecture for pre-training. However, the architecture makes the procedure of decision-making highly
main limitation of the Transformer is its computation com- non-transparent.
plexity, which is quadratic to the input length. Limited by the Recently, explainable artificial intelligence (XAI) [5] has
memory of GPUs, most of current PTMs cannot deal with become a hotspot in the general AI community. Unlike CNNs
the sequence longer than 512 tokens. Breaking this limit for images, interpreting PTMs is harder due to the complex-
needs to improve the architecture of the Transformer, such ities of both the Transformer-like architecture and language.
as Transformer-XL [31]. Therefore, searching for more ef- Extensive efforts (see Section 3.3) have been made to analyze
ficient model architecture for PTMs is important to capture the linguistic and world knowledge included in PTMs, which
longer-range contextual information. help us understand these PMTs with some degree of trans-
The design of deep architecture is challenging, and we parency. However, much work on model analysis depends on
may seek help from some automatic methods, such as neural the attention mechanism, and the effectiveness of attention for
architecture search (NAS) [223]. interpretability is still controversial [71, 155].
Besides, PTMs are also vulnerable to adversarial attacks
(3) Task-oriented Pre-training and Model Compression
(see Section 7.7). The reliability of PTMs is also becoming
In practice, different downstream tasks require the different
an issue of great concern with the extensive use of PTMs in
abilities of PTMs. The discrepancy between PTMs and down-
production systems. The studies of adversarial attacks against
stream tasks usually lies in two aspects: model architecture
PTMs help us understand their capabilities by fully exposing
and data distribution. A larger discrepancy may result in that
their vulnerabilities. Adversarial defenses for PTMs are also
the benefit of PTMs may be insignificant. For example, text
promising, which improve the robustness of PTMs and make
generation usually needs a specific task to pre-train both the
them immune against adversarial attack.
encoder and decoder, while text matching needs pre-training
Overall, as key components in many NLP applications,
tasks designed for sentence pairs.
the interpretability and reliability of PTMs remain to be ex-
Besides, although larger PTMs can usually lead to better
plored further in many respects, which helps us understand
performance, a practical problem is how to leverage these
how PTMs work and provides a guide for better usage and
huge PTMs on special scenarios, such as low-capacity devices
further improvement.
and low-latency applications. Therefore, we can carefully de-
sign the specific model architecture and pre-training tasks for
downstream tasks or extract partial task-specific knowledge 9 Conclusion
from existing PTMs.
Instead of training task-oriented PTMs from scratch, we In this survey, we conduct a comprehensive overview of
can teach them with existing general-purpose PTMs by us- PTMs for NLP, including background knowledge, model ar-
ing techniques such as model compression (see Section 4.5). chitecture, pre-training tasks, various extensions, adaption
Although model compression is widely studied for CNNs in approaches, related resources, and applications. Based on
CV [18], compression for PTMs for NLP is just beginning. current PTMs, we propose a new taxonomy of PTMs from
The fully-connected structure of the Transformer also makes four different perspectives. We also suggest several possible
model compression more challenging. future research directions for PTMs.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 21
[28] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav [42] Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin
Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Kadras, Sylvain Gugger, and Jeremy Howard. MultiFiT: Effi-
Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. cient multi-lingual language model fine-tuning. In EMNLP-
Unsupervised cross-lingual representation learning at scale. IJCNLP, pages 5701–5706, 2019.
arXiv preprint arXiv:1911.02116, 2019. [43] Dumitru Erhan, Yoshua Bengio, Aaron C. Courville, Pierre-
[29] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why
Shijin Wang, and Guoping Hu. Pre-training with whole word does unsupervised pre-training help deep learning? J. Mach.
masking for chinese BERT. arXiv preprint arXiv:1906.08101, Learn. Res., 11:625–660, 2010.
2019. [44] Allyson Ettinger. What BERT is not: Lessons from a new suite
[30] Andrew M Dai and Quoc V Le. Semi-supervised sequence of psycholinguistic diagnostics for language models. TACL,
learning. In NeurIPS, pages 3079–3087, 2015. 8:34–48, 2020.
[31] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, [45] Manaal Faruqui and Chris Dyer. Improving vector space word
Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Atten- representations using multilingual correlation. In EACL, pages
tive language models beyond a fixed-length context. In ACL, 462–471, 2014.
pages 2978–2988, 2019. [46] Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan,
[32] Joe Davison, Joshua Feldman, and Alexander M. Rush. Com- Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad,
monsense knowledge mining from pretrained models. In and Preslav Nakov. Compressing large-scale transformer-
EMNLP-IJCNLP, pages 1173–1178, 2019. based models: A case study on BERT. arXiv preprint
[33] Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, arXiv:2002.11985, 2020.
Tommaso Caselli, Gertjan van Noord, and Malvina Nis- [47] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord,
sim. BERTje: A Dutch BERT model. arXiv preprint Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael
arXiv:1912.09582, 2019. Schmitz, and Luke S. Zettlemoyer. Allennlp: A deep semantic
[34] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob natural language processing platform. 2017.
Uszkoreit, and Lukasz Kaiser. Universal transformers. In [48] Siddhant Garg, Thuy Vu, and Alessandro Moschitti. Tanda:
ICLR, 2019. Transfer and adapt pre-trained transformer models for answer
[35] Pieter Delobelle, Thomas Winters, and Bettina Berendt. Rob- sentence selection. In AAAI, 2019.
BERT: a Dutch RoBERTa-based language model. arXiv [49] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats,
preprint arXiv:2001.06286, 2020. and Yann N Dauphin. Convolutional sequence to sequence
[36] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina learning. In ICML, pages 1243–1252, 2017.
Toutanova. BERT: pre-training of deep bidirectional trans- [50] Yoav Goldberg. Assessing BERT’s syntactic abilities. arXiv
formers for language understanding. In NAACL-HLT, 2019. preprint arXiv:1901.05287, 2019.
[37] Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yong- [51] Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. Com-
gang Wang. ZEN: pre-training chinese text encoder enhanced pressing BERT: Studying the effects of weight pruning on
by n-gram representations. arXiv preprint arXiv:1911.00720, transfer learning. arXiv preprint arXiv:2002.08307, 2020.
2019. [52] Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie
[38] Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Huang. A knowledge-enhanced pretraining model for com-
Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained monsense story generation. arXiv preprint arXiv:2001.05139,
language models: Weight initializations, data orders, and early 2020.
stopping. arXiv preprint arXiv:2002.06305, 2020. [53] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xi-
[39] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, angyang Xue, and Zheng Zhang. Star-transformer. In NAACL-
Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. HLT, pages 1315–1325, 2019.
Unified language model pre-training for natural language un- [54] Abhijeet Gupta, Gemma Boleda, Marco Baroni, and Sebastian
derstanding and generation. In NeurIPS, pages 13042–13054, Padó. Distributional vectors encode referential attributes. In
2019. EMNLP, pages 12–21, 2015.
[40] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Ma- [55] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive
honey, and Kurt Keutzer. Hawq: Hessian aware quantization estimation: A new estimation principle for unnormalized sta-
of neural networks with mixed-precision. In ICCV, pages tistical models. In AISTATS, pages 297–304, 2010.
293–302, 2019.
[56] Kai Hakala and Sampo Pyysalo. Biomedical named entity
[41] Sergey Edunov, Alexei Baevski, and Michael Auli. Pre-trained recognition with multilingual BERT. In BioNLP Open Shared
language model representations for language generation. In Tasks@EMNLP, pages 56–61, 2019.
Jill Burstein, Christy Doran, and Thamar Solorio, editors,
[57] Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham
NAACL-HLT, pages 4052–4059, 2019.
Neubig. Latent relation language models. In AAAI, 2019.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 23
[58] Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, arXiv:1908.03548, 2019.
Nicholas Jing Yuan, and Tong Xu. Integrating graph contex- [74] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neu-
tualized knowledge into pre-trained language models. arXiv big. How can we know what language models know? arXiv
preprint arXiv:1912.00147, 2019. preprint arXiv:1911.12543, 2019.
[59] John Hewitt and Christopher D. Manning. A structural probe [75] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen,
for finding syntax in word representations. In NAACL-HLT, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling
pages 4129–4138, 2019. BERT for natural language understanding. arXiv preprint
[60] GE Hinton, JL McClelland, and DE Rumelhart. Distributed arXiv:1909.10351, 2019.
representations. In Parallel distributed processing: explo- [76] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is
rations in the microstructure of cognition, vol. 1: foundations, BERT really robust? natural language attack on text classifi-
pages 77–109. 1986. cation and entailment. In AAAI, 2019.
[61] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- [77] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke
ing the knowledge in a neural network. arXiv preprint Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-
arXiv:1503.02531, 2015. training by representing and predicting spans. Transactions
[62] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing of the Association for Computational Linguistics, 8:64–77,
the dimensionality of data with neural networks. Science, 313 2019.
(5786):504–507, 2006. [78] Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng
[63] R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Yang, and Yunfeng Liu. Technical report on conversational
Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua question answering. arXiv preprint arXiv:1909.10772, 2019.
Bengio. Learning deep representations by mutual information [79] Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth.
estimation and maximization. In ICLR, 2019. Cross-lingual ability of multilingual BERT: An empirical
[64] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term study. In ICLR, 2020.
memory. Neural Computation, 1997. [80] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom.
[65] Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. A convolutional neural network for modelling sentences. In
exbert: A visual analysis tool to explore learned rep- ACL, 2014.
resentations in transformers models. arXiv preprint [81] Akbar Karimi, Leonardo Rossi, Andrea Prati, and Katharina
arXiv:1910.05276, 2019. Full. Adversarial training for aspect-based sentiment analysis
[66] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna with BERT. arXiv preprint arXiv:2001.11316, 2020.
Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona [82] Nora Kassner and Hinrich Schütze. Negated LAMA: birds
Attariyan, and Sylvain Gelly. Parameter-efficient transfer cannot fly. arXiv preprint arXiv:1911.03343, 2019.
learning for NLP. In ICML, pages 2790–2799, 2019.
[83] Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, and Min-
[67] Jeremy Howard and Sebastian Ruder. Universal language lie Huang. SentiLR: Linguistic knowledge enhanced lan-
model fine-tuning for text classification. In ACL, pages 328– guage representation for sentiment analysis. arXiv preprint
339, 2018. arXiv:1911.02493, 2019.
[68] Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun [84] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caim-
Shou, Daxin Jiang, and Ming Zhou. Unicoder: A universal ing Xiong, and Richard Socher. CTRL: A conditional trans-
language encoder by pre-training with multiple cross-lingual former language model for controllable generation. arXiv
tasks. In EMNLP-IJCNLP, pages 2485–2494, 2019. preprint arXiv:1909.05858, 2019.
[69] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clin- [85] Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang goo Lee.
icalBERT: Modeling clinical notes and predicting hospital Are pre-trained language models aware of phrases? simple
readmission. arXiv preprint arXiv:1904.05342, 2019. but strong baselines for grammar induction. In ICLR, 2020.
[70] Kenji Imamura and Eiichiro Sumita. Recycling a pre-trained [86] Yoon Kim. Convolutional neural networks for sentence classi-
BERT encoder for neural machine translation. In Proceedings fication. In EMNLP, pages 1746–1751, 2014.
of the 3rd Workshop on Neural Generation and Translation,
[87] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M
Hong Kong, November 2019.
Rush. Character-aware neural language models. In AAAI,
[71] Sarthak Jain and Byron C Wallace. Attention is not explana- 2016.
tion. In NAACL-HLT, pages 3543–3556, 2019.
[88] Thomas N Kipf and Max Welling. Semi-supervised classifica-
[72] Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah. What does tion with graph convolutional networks. In ICLR, 2017.
BERT learn about the structure of language? In ACL, pages
[89] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard
3651–3657, 2019.
Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.
[73] Zongcheng Ji, Qiang Wei, and Hua Xu. BERT-based rank- Skip-thought vectors. In NeurIPS, pages 3294–3302, 2015.
ing for biomedical entity normalization. arXiv preprint
24 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
[90] Lingpeng Kong, Cyprien de Masson d’Autume, Lei Yu, Wang 2019.
Ling, Zihang Dai, and Dani Yogatama. A mutual information [104] Xiang Lisa Li and Jason Eisner. Specializing word embed-
maximization perspective of language representation learning. dings (for parsing) by information bottleneck. In EMNLP-
In ICLR, 2019. IJCNLP, pages 2744–2754, 2019.
[91] Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Data [105] Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam. Exploit-
augmentation using pre-trained transformer models. arXiv ing BERT for end-to-end aspect-based sentiment analysis. In
preprint arXiv:2003.02245, 2020. W-NUT@EMNLP, 2019.
[92] Yuri Kuratov and Mikhail Arkhipov. Adaptation of deep bidi- [106] Zhongyang Li, Xiao Ding, and Ting Liu. Story ending predic-
rectional multilingual transformers for russian language. arXiv tion by transferable bert. In IJCAI, pages 1800–1806, 2019.
preprint arXiv:1905.07213, 2019.
[107] Liyuan Liu, Xiang Ren, Jingbo Shang, Xiaotao Gu, Jian Peng,
[93] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin and Jiawei Han. Efficient contextualized representation: Lan-
Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite guage model pruning for sequence labeling. In EMNLP, pages
BERT for self-supervised learning of language representa- 1215–1225, 2018.
tions. In International Conference on Learning Representa-
[108] Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E.
tions, 2020.
Peters, and Noah A. Smith. Linguistic knowledge and transfer-
[94] Anne Lauscher, Ivan Vulic, Edoardo Maria Ponti, Anna Ko- ability of contextual representations. In NAACL-HLT, pages
rhonen, and Goran Glavas. Informing unsupervised pre- 1073–1094, 2019.
training with external linguistic knowledge. arXiv preprint
[109] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent
arXiv:1909.02339, 2019.
neural network for text classification with multi-task learning.
[95] Hang Le, Loı̈c Vial, Jibril Frej, Vincent Segonne, Maximin In IJCAI, 2016.
Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoı̂t
[110] Qi Liu, Matt J Kusner, and Phil Blunsom. A survey on con-
Crabbé, Laurent Besacier, and Didier Schwab. FlauBERT:
textual embeddings. arXiv preprint arXiv:2003.07278, 2020.
Unsupervised language model pre-training for French. arXiv
preprint arXiv:1912.05372, 2019. [111] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju,
Haotang Deng, and Ping Wang. K-BERT: Enabling language
[96] Quoc Le and Tomas Mikolov. Distributed representations of
representation with knowledge graph. In AAAI, 2019.
sentences and documents. In ICML, pages 1188–1196, 2014.
[112] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang
[97] Jieh-Sheng Lee and Jieh Hsiang. PatentBERT: Patent clas-
Deng, and Qi Ju. FastBERT: a self-distilling BERT with
sification with fine-tuning a pre-trained BERT model. arXiv
adaptive inference time. In ACL, 2020.
preprint arXiv:1906.02124, 2019.
[113] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng
[98] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim,
Gao. Improving multi-task deep neural networks via knowl-
Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT:
edge distillation for natural language understanding. arXiv
a pre-trained biomedical language representation model for
preprint arXiv:1904.09482, 2019.
biomedical text mining. Bioinformatics, 2019.
[114] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng
[99] Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, Or Sharir,
Gao. Multi-task deep neural networks for natural language
Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham.
understanding. In ACL, 2019.
SenseBERT: Driving some sense into BERT. arXiv preprint
arXiv:1908.05646, 2019. [115] Xiulei Liu, Hao Cheng, Peng cheng He, Weizhu Chen,
Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial
[100] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-
training for large neural language models. arXiv preprint
jad, Abdelrahman Mohamed, Omer Levy, Veselin Stoy-
arXiv:2004.08994, 2020.
anov, and Luke Zettlemoyer. BART: denoising sequence-to-
sequence pre-training for natural language generation, transla- [116] Yang Liu and Mirella Lapata. Text summarization with pre-
tion, and comprehension. arXiv preprint arXiv:1910.13461, trained encoders. In EMNLP/IJCNLP, 2019.
2019. [117] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
[101] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Unicoder-vl: A universal encoder for vision and language by Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro-
cross-modal pre-training. In AAAI, 2020. bustly optimized BERT pretraining approach. arXiv preprint
arXiv:1907.11692, 2019.
[102] Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and
Xipeng Qiu. BERT-ATTACK: Adversarial attack against [118] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov,
BERT using BERT. arXiv preprint arXiv:2004.09984, 2020. Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer.
Multilingual denoising pre-training for neural machine trans-
[103] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and
lation. arXiv preprint arXiv:2001.08210, 2020.
Kai-Wei Chang. VisualBERT: A simple and performant base-
line for vision and language. arXiv preprint arXiv:1908.03557, [119] Robert L. Logan IV, Nelson F. Liu, Matthew E. Peters, Matt
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 25
Gardner, and Sameer Singh. Barack’s wife hillary: Using EMNLP, 2014.
knowledge graphs for fact-aware language modeling. In ACL, [134] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula,
2019. and Russell Power. Semi-supervised sequence tagging with
[120] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViL- bidirectional language models. In ACL, pages 1756–1765,
BERT: Pretraining task-agnostic visiolinguistic representa- 2017.
tions for vision-and-language tasks. In NeurIPS, pages 13–23, [135] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gard-
2019. ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.
[121] Wenhao Lu, Jian Jiao, and Ruofei Zhang. TwinBERT: Distill- Deep contextualized word representations. In NAACL-HLT,
ing knowledge to twin-structured BERT models for efficient 2018.
retrieval. arXiv preprint arXiv:2002.06275, 2020. [136] Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy
[122] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith.
Tianrui Li, Xilin Chen, and Ming Zhou. UniViLM: A unified Knowledge enhanced contextual word representations. In
video and language pre-training model for multimodal under- EMNLP-IJCNLP, 2019.
standing and generation. arXiv preprint arXiv:2002.06353, [137] Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. To
2020. tune or not to tune? adapting pretrained representations to
[123] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. diverse tasks. In Proceedings of the 4th Workshop on Repre-
Bilingual word representations with monolingual quality in sentation Learning for NLP, RepL4NLP@ACL 2019, Florence,
mind. In Proceedings of the 1st Workshop on Vector Space Italy, August 2, 2019, pages 7–14, 2019.
Modeling for Natural Language Processing, pages 151–159, [138] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick
2015. S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H.
[124] Diego Marcheggiani, Joost Bastings, and Ivan Titov. Ex- Miller. Language models as knowledge bases? In EMNLP-
ploiting semantics in neural machine translation with graph IJCNLP, pages 2463–2473, 2019.
convolutional networks. In NAACL-HLT, pages 486–492, [139] Jason Phang, Thibault Févry, and Samuel R Bowman. Sen-
2018. tence encoders on STILTs: Supplementary training on inter-
[125] Louis Martin, Benjamin Müller, Pedro Javier Ortiz Suárez, mediate labeled-data tasks. arXiv preprint arXiv:1811.01088,
Yoann Dupont, Laurent Romary, Éric Villemonte de la Clerg- 2018.
erie, Djamé Seddah, and Benoı̂t Sagot. CamemBERT: a tasty [140] Telmo Pires, Eva Schlinger, and Dan Garrette. How multi-
French language model. arXiv preprint arXiv:1911.03894, lingual is multilingual BERT? In ACL, pages 4996–5001,
2019. 2019.
[126] Bryan McCann, James Bradbury, Caiming Xiong, and Richard [141] Nina Pörner, Ulli Waltinger, and Hinrich Schütze. BERT is not
Socher. Learned in translation: Contextualized word vectors. a knowledge base (yet): Factual knowledge vs. name-based
In NeurIPS, 2017. reasoning in unsupervised QA. CoRR, abs/1911.03681, 2019.
[127] Oren Melamud, Jacob Goldberger, and Ido Dagan. Con- [142] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
text2Vec: Learning generic context embedding with bidirec- Sutskever. Improving language understanding by generative
tional LSTM. In CoNLL, pages 51–61, 2016. pre-training. 2018. URL https://s3-us-west-2.amazonaws.
[128] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen com/openai-assets/researchcovers/languageunsupervised/
heads really better than one? In NeurIPS, pages 14014–14024, languageunderstandingpaper.pdf.
2019. [143] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
[129] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Cor- Amodei, and Ilya Sutskever. Language models are unsuper-
rado, and Jeffrey Dean. Distributed representations of words vised multitask learners. OpenAI Blog, 2019.
and phrases and their compositionality. In NeurIPS, 2013. [144] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
[130] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguis- Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Pe-
tic regularities in continuous space word representations. In ter J. Liu. Exploring the limits of transfer learning with a uni-
HLT-NAACL, pages 746–751, 2013. fied text-to-text transformer. arXiv preprint arXiv:1910.10683,
[131] Andriy Mnih and Koray Kavukcuoglu. Learning word embed- 2019.
dings efficiently with noise-contrastive estimation. In NeurIPS, [145] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy
pages 2265–2273, 2013. Liang. Squad: 100, 000+ questions for machine comprehen-
[132] Sinno Jialin Pan and Qiang Yang. A survey on transfer learn- sion of text. In Jian Su, Xavier Carreras, and Kevin Duh,
ing. IEEE Transactions on knowledge and data engineering, editors, EMNLP, pages 2383–2392, 2016.
22(10):1345–1359, 2009. [146] Prajit Ramachandran, Peter J Liu, and Quoc Le. Unsupervised
[133] Jeffrey Pennington, Richard Socher, and Christopher D. Man- pretraining for sequence to sequence learning. In EMNLP,
ning. GloVe: Global vectors for word representation. In pages 383–391, 2017.
26 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)
[147] Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: sentiment analysis and natural language inference. arXiv
A conversational question answering challenge. TACL, 7:249– preprint arXiv:2002.04815, 2020.
266, 2019. [162] Asa Cooper Stickland and Iain Murray. BERT and PALs:
[148] Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Vie- Projected attention layers for efficient adaptation in multi-task
gas, Andy Coenen, Adam Pearce, and Been Kim. Visualizing learning. In ICML, pages 5986–5995, 2019.
and measuring the geometry of BERT. In NeurIPS, pages [163] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
8592–8600, 2019. Wei, and Jifeng Dai. VL-BERT: Pre-training of generic visual-
[149] Alexander Rietzler, Sebastian Stabinger, Paul Opitz, and Ste- linguistic representations. In ICLR, 2020.
fan Engl. Adapt or get left behind: Domain adaptation through [164] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia
BERT language model finetuning for aspect-target sentiment Schmid. Contrastive bidirectional transformer for temporal
classification. arXiv preprint arXiv:1908.11860, 2019. representation learning. arXiv preprint arXiv:1906.05743,
[150] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer 2019.
in BERTology: What we know about how BERT works. arXiv [165] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and
preprint arXiv:2002.12327, 2020. Cordelia Schmid. VideoBERT: A joint model for video and
[151] Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari Rappoport. language representation learning. In ICCV, pages 7463–7472.
How well do distributional models capture different types of IEEE, 2019.
semantic knowledge? In ACL, pages 726–730, 2015. [166] Chi Sun, Luyao Huang, and Xipeng Qiu. Utilizing BERT
[152] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas for aspect-based sentiment analysis via constructing auxiliary
Wolf. DistilBERT, a distilled version of BERT: smaller, faster, sentence. In NAACL-HLT, 2019.
cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. [167] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How
[153] Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail to fine-tune BERT for text classification? In China National
Khodak, and Hrishikesh Khandeparkar. A theoretical analysis Conference on Chinese Computational Linguistics, pages 194–
of contrastive unsupervised representation learning. In ICML, 206, 2019.
pages 5628–5637, 2019. [168] Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari Asai,
[154] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Jia Li, Philip Yu, and Caiming Xiong. Adv-BERT: BERT
machine translation of rare words with subword units. In ACL, is not robust on misspellings! generating nature adversarial
2016. samples on BERT. arXiv preprint arXiv:2003.04985, 2020.
[155] Sofia Serrano and Noah A Smith. Is attention interpretable? [169] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowl-
In ACL, pages 2931–2951, 2019. edge distillation for BERT model compression. In EMNLP-
[156] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, IJCNLP, pages 4323–4332, 2019.
Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q- [170] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen,
BERT: Hessian based ultra low precision quantization of Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua
BERT. In AAAI, 2020. Wu. ERNIE: enhanced representation through knowledge
[157] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick integration. arXiv preprint arXiv:1904.09223, 2019.
LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- [171] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian,
LM: Training multi-billion parameter language models using Hua Wu, and Haifeng Wang. ERNIE 2.0: A continual pre-
gpu model parallelism. arXiv preprint arXiv:1909.08053, training framework for language understanding. In AAAI,
2019. 2019.
[158] Karan Singla, Doğan Can, and Shrikanth Narayanan. A multi- [172] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yim-
task approach to learning multilingual representations. In ing Yang, and Denny Zhou. MobileBERT: a compact task-
ACL, pages 214–220, 2018. agnostic BERT for resource-limited devices. arXiv preprint
[159] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, arXiv:2004.02984, 2020.
Christopher D Manning, Andrew Y Ng, and Christopher Potts. [173] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to
Recursive deep models for semantic compositionality over sequence learning with neural networks. In NeurIPS, pages
a sentiment treebank. In EMNLP, pages 1631–1642. ACL, 3104–3112, 2014.
2013. [174] Kai Sheng Tai, Richard Socher, and Christopher D. Manning.
[160] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Improved semantic representations from tree-structured long
Liu. MASS: masked sequence to sequence pre-training for short-term memory networks. In ACL, pages 1556–1566,
language generation. In ICML, volume 97 of Proceedings of 2015.
Machine Learning Research, pages 5926–5936, 2019. [175] Hao Tan and Mohit Bansal. LXMERT: Learning cross-
[161] Youwei Song, Jiahai Wang, Zhiwei Liang, Zhiyue Liu, and modality encoder representations from transformers. In
Tao Jiang. Utilizing BERT intermediate layers for aspect based EMNLP-IJCNLP, pages 5099–5110, 2019.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 27
[176] Matthew Tang, Priyanka Gandhi, Md Ahsanul Kabir, Christo- 3261–3275, 2019.
pher Zou, Jordyn Blakey, and Xiao Luo. Progress notes clas- [190] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill,
sification and keyword extraction using attention-based deep Omer Levy, and Samuel R. Bowman. GLUE: A multi-task
learning models with BERT. arXiv preprint arXiv:1910.05786, benchmark and analysis platform for natural language under-
2019. standing. In ICLR, 2019.
[177] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechto- [191] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing
mova, and Jimmy Lin. Distilling task-specific knowledge Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou.
from BERT into simple neural networks. arXiv preprint K-adapter: Infusing knowledge into pre-trained models with
arXiv:1903.12136, 2019. adapters. arXiv preprint arXiv:2002.01808, 2020.
[178] Wilson L. Taylor. “cloze procedure”: A new tool for measur- [192] Shaolei Wang, Wanxiang Che, Qi Liu, Pengda Qin, Ting Liu,
ing readability. Journalism Quarterly, 30(4):415–433, 1953. and William Yang Wang. Multi-task self-supervised learning
[179] Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscov- for disfluency detection. In AAAI, 2019.
ers the classical NLP pipeline. In Anna Korhonen, David R. [193] Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei
Traum, and Lluı́s Màrquez, editors, ACL, pages 4593–4601, Peng, and Luo Si. StructBERT: Incorporating language struc-
2019. tures into pre-training for deep language understanding. In
[180] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Po- ICLR, 2020.
liak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, [194] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang,
Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. What and Ming Zhou. MiniLM: Deep self-attention distillation for
do you learn from context? probing for sentence structure in task-agnostic compression of pre-trained transformers. arXiv
contextualized word representations. In ICLR, 2019. preprint arXiv:2002.10957, 2020.
[181] Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazha- [195] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu,
gan, Xin Li, and Amelia Archer. Small and practical BERT Juanzi Li, and Jian Tang. KEPLER: A unified model for
models for sequence labeling. In EMNLP-IJCNLP, pages knowledge embedding and pre-trained language representa-
3632–3636, 2019. tion. arXiv preprint arXiv:1911.06136, 2019.
[182] Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xi- [196] Yuxuan Wang, Yutai Hou, Wanxiang Che, and Ting Liu. From
aodong He, and Bowen Zhou. Select, answer and explain: static to dynamic word representations: a survey. International
Interpretable multi-hop reading comprehension over multiple Journal of Machine Learning and Cybernetics, pages 1–20,
documents. In AAAI, 2020. 2020.
[183] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina [197] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen.
Toutanova. Well-read students learn better: The impact of Knowledge graph and text jointly embedding. In EMNLP,
student initialization on knowledge distillation. arXiv preprint pages 1591–1601, 2014.
arXiv:1908.08962, 2019.
[198] Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang,
[184] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen,
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia and Qun Liu. NEZHA: Neural contextualized representa-
Polosukhin. Attention is all you need. In NeurIPS, 2017. tion for chinese language understanding. arXiv preprint
[185] Jesse Vig. A multiscale visualization of attention in the trans- arXiv:1909.00204, 2019.
former model. In ACL, 2019. [199] Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and
[186] Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Songlin Hu. Conditional BERT contextual augmentation. In
Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. International Conference on Computational Science, pages
Multilingual is not enough: BERT for Finnish. arXiv preprint 84–95, 2019.
arXiv:1912.07076, 2019. [200] Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, and
[187] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Songlin Hu. ”mask and infill” : Applying masked language
Ivan Titov. Analyzing multi-head self-attention: Specialized model to sentiment transfer. In IJCAI, 2019.
heads do the heavy lifting, the rest can be pruned. In ACL, [201] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and
pages 5797–5808, 2019. Maosong Sun. Representation learning of knowledge graphs
[188] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and with entity descriptions. In IJCAI, 2016.
Sameer Singh. Universal adversarial triggers for attacking and [202] Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin
analyzing NLP. In EMNLP-IJCNLP, pages 2153–2162, 2019. Stoyanov. Pretrained encyclopedia: Weakly supervised
[189] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet knowledge-pretrained language model. In ICLR, 2020.
Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. [203] Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and
Bowman. SuperGLUE: A stickier benchmark for general- Ming Zhou. BERT-of-Theseus: Compressing BERT by pro-
purpose language understanding systems. In NeurIPS, pages gressive module replacing. arXiv preprint arXiv:2002.02925,
28 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)