0% found this document useful (0 votes)
19 views

Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing

This survey provides a comprehensive review of pre-trained models (PTMs) in natural language processing (NLP), discussing their development, categorization, and adaptation for various tasks. It introduces a new taxonomy for PTMs based on representation type, model architecture, pre-training tasks, and extensions, while also highlighting the advantages of using PTMs for improved performance in NLP applications. The document outlines future research directions and offers resources for understanding and implementing PTMs.

Uploaded by

waylonli98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing

This survey provides a comprehensive review of pre-trained models (PTMs) in natural language processing (NLP), discussing their development, categorization, and adaptation for various tasks. It introduces a new taxonomy for PTMs based on representation type, model architecture, pre-training tasks, and extensions, while also highlighting the advantages of using PTMs for improved performance in NLP applications. The document outlines future research directions and offers resources for understanding and implementing PTMs.

Uploaded by

waylonli98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

. Invited Review .

Pre-trained Models for Natural Language Processing: A Survey


Xipeng Qiu* , Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai & Xuanjing Huang
School of Computer Science, Fudan University, Shanghai 200433, China;
arXiv:2003.08271v3 [cs.CL] 24 Apr 2020

Shanghai Key Laboratory of Intelligent Information Processing, Shanghai 200433, China

Recently, the emergence of pre-trained models (PTMs)∗ has brought natural language processing (NLP) to a new era. In this
survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its
research progress. Then we systematically categorize existing PTMs based on a taxonomy from four different perspectives. Next,
we describe how to adapt the knowledge of PTMs to downstream tasks. Finally, we outline some potential directions of PTMs for
future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP
tasks.

Deep Learning, Neural Network, Natural Language Processing, Pre-trained Model, Distributed Representation, Word
Embedding, Self-Supervised Learning, Language Modelling

1 Introduction to the Computer Vision (CV) field. The main reason is that
current datasets for most supervised NLP tasks are rather small
With the development of deep learning, various neural net- (except machine translation). Deep neural networks usually
works have been widely used to solve Natural Language Pro- have a large number of parameters, which make them overfit
cessing (NLP) tasks, such as convolutional neural networks on these small training data and do not generalize well in
(CNNs) [80, 86, 49], recurrent neural networks (RNNs) [173, practice. Therefore, the early neural models for many NLP
109], graph-based neural networks (GNNs) [159, 174, 124] tasks were relatively shallow and usually consisted of only
and attention mechanisms [7, 184]. One of the advantages 1∼3 neural layers.
of these neural models is their ability to alleviate the fea- Recently, substantial work has shown that pre-trained mod-
ture engineering problem. Non-neural NLP methods usually els (PTMs), on the large corpus can learn universal language
heavily rely on the discrete handcrafted features, while neural representations, which are beneficial for downstream NLP
methods usually use low-dimensional and dense vectors (aka. tasks and can avoid training a new model from scratch. With
distributed representation) to implicitly represent the syntactic the development of computational power, the emergence of
or semantic features of the language. These representations the deep models (i.e., Transformer [184]), and the constant
are learned in specific NLP tasks. Therefore, neural methods enhancement of training skills, the architecture of PTMs has
make it easy for people to develop various NLP systems. been advanced from shallow to deep. The first-generation
Despite the success of neural models for NLP tasks, the PTMs aim to learn good word embeddings. Since these mod-
performance improvement may be less significant compared els themselves are no longer needed by downstream tasks, they
* Corresponding author (email: [email protected])
∗ PTMs are also known as pre-trained language models (PLMs). In this survey, we use PTMs for NLP instead of PLMs to avoid confusion with the narrow
concept of statistical (or probabilistic) language models.
2 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

are usually very shallow for computational efficiencies, such as knowledge hiding in text data, such as lexical meanings, syn-
Skip-Gram [129] and GloVe [133]. Although these pre-trained tactic structures, semantic roles, and even pragmatics.
embeddings can capture semantic meanings of words, they are The core idea of distributed representation is to describe the
context-free and fail to capture higher-level concepts in con- meaning of a piece of text by low-dimensional real-valued vec-
text, such as polysemous disambiguation, syntactic structures, tors. And each dimension of the vector has no corresponding
semantic roles, anaphora. The second-generation PTMs focus sense, while the whole represents a concrete concept. Figure
on learning contextual word embeddings, such as CoVe [126], 1 illustrates the generic neural architecture for NLP. There are
ELMo [135], OpenAI GPT [142] and BERT [36]. These two kinds of word embeddings: non-contextual and contex-
learned encoders are still needed to represent words in context tual embeddings. The difference between them is whether the
by downstream tasks. Besides, various pre-training tasks are embedding for a word dynamically changes according to the
also proposed to learn PTMs for different purposes. context it appears in.
The contributions of this survey can be summarized as
follows: Task-Specifc Model

1. Comprehensive review. We provide a comprehensive Contextual


h1 h2 h3 h4 h5 h6 h7
review of PTMs for NLP, including background knowl- Embeddings

edge, model architecture, pre-training tasks, various


extensions, adaption approaches, and applications. Contextual Encoder

2. New taxonomy. We propose a taxonomy of PTMs for Non-contextual ex1 ex2 ex3 ex4 ex5 ex6 ex7
Embeddings
NLP, which categorizes existing PTMs from four dif-
ferent perspectives: 1) representation type, 2) model Figure 1: Generic Neural Architecture for NLP
architecture; 3) type of pre-training task; 4) extensions
for specific types of scenarios.
Non-contextual Embeddings The first step of represent-
3. Abundant resources. We collect abundant resources ing language is to map discrete language symbols into a dis-
on PTMs, including open-source implementations of tributed embedding space. Formally, for each word (or sub-
PTMs, visualization tools, corpora, and paper lists. word) x in a vocabulary V, we map it to a vector e x ∈ RDe with
a lookup table E ∈ RDe ×|V| , where De is a hyper-parameter
4. Future directions. We discuss and analyze the limi- indicating the dimension of token embeddings. These em-
tations of existing PTMs. Also, we suggest possible beddings are trained on task data along with other model
future research directions. parameters.
There are two main limitations to this kind of embeddings.
The rest of the survey is organized as follows. Section 2
The first issue is that the embeddings are static. The embed-
outlines the background concepts and commonly used nota-
ding for a word does is always the same regardless of its
tions of PTMs. Section 3 gives a brief overview of PTMs
context. Therefore, these non-contextual embeddings fail to
and clarifies the categorization of PTMs. Section 4 provides
model polysemous words. The second issue is the out-of-
extensions of PTMs. Section 5 discusses how to transfer the
vocabulary problem. To tackle this problem, character-level
knowledge of PTMs to downstream tasks. Section 6 gives the
word representations or sub-word representations are widely
related resources on PTMs. Section 7 presents a collection of
used in many NLP tasks, such as CharCNN [87], FastText [14]
applications across various NLP tasks. Section 8 discusses the
and Byte-Pair Encoding (BPE) [154].
current challenges and suggests future directions. Section 9
summarizes the paper. Contextual Embeddings To address the issue of polyse-
mous and the context-dependent nature of words, we need
2 Background distinguish the semantics of words in different contexts. Given
a text x1 , x2 , · · · , xT where each token xt ∈ V is a word or
2.1 Language Representation Learning sub-word, the contextual representation of xt depends on the
whole text.
As suggested by Bengio et al. [13], a good representation
should express general-purpose priors that are not task-specific [h1 , h2 , · · · , hT ] = fenc (x1 , x2 , · · · , xT ), (1)
but would be likely to be useful for a learning machine to solve
AI-tasks. When it comes to language, a good representation where fenc (·) is neural encoder, which is described in Sec-
should capture the implicit linguistic rules and common sense tion 2.2, ht is called contextual embedding or dynamical em-
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 3

h1 h2 h3 h4 h5 h1 h2 h3 h4 h5 h1 h2 h3 h4 h5

x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
(a) Convolutional Model (b) Recurrent Model (c) Fully-Connected Self-Attention Model

Figure 2: Neural Contextual Encoders

bedding of token xt because of the contextual information successful instance of fully-connected self-attention model
included in. is the Transformer [184], which also needs other supplement
modules, such as positional embeddings, layer normalization,
2.2 Neural Contextual Encoders residual connections and position-wise feed-forward network
(FFN) layers.
Most of the neural contextual encoders can be classified into
two categories: sequence models and graph-based models.
2.2.3 Analysis
Figure 2 illustrates the architecture of these models.
Sequence models learn the contextual representation of the
2.2.1 Sequence Models word with locality bias and are hard to capture the long-range
interactions between words. Nevertheless, sequence models
Sequence models usually capture local context of a word in
are usually easy to train and get good results for various NLP
sequential order.
tasks.
Convolutional Models Convolutional models take the em- In contrast, as an instantiated fully-connected self-attention
beddings of words in the input sentence and capture the mean- model, the Transformer can directly model the dependency
ing of a word by aggregating the local information from its between every two words in a sequence, which is more power-
neighbors by convolution operations [86]. ful and suitable to model long range dependency of language.
Recurrent Models Recurrent models capture the contextual However, due to its heavy structure and less model bias, the
representations of words with short memory, such as LSTMs Transformer usually requires a large training corpus and is
[64] and GRUs [23]. In practice, bi-directional LSTMs or easy to overfit on small or modestly-sized datasets [142, 53].
GRUs are used to collect information from both sides of a Currently, the Transformer has become the mainstream
word, but its performance is often affected by the long-term architecture of PTMs due to its powerful capacity.
dependency problem.
2.3 Why Pre-training?
2.2.2 Non-Sequence Models With the development of deep learning, the number of model
Non-sequence models learn the contextual representation with parameters has increased rapidly. The much larger dataset is
a pre-defined tree or graph structure between words, such as needed to fully train model parameters and prevent overfit-
the syntactic structure or semantic relation. Some popular ting. However, building large-scale labeled datasets is a great
non-sequence models include Recursive NN [159], TreeL- challenge for most NLP tasks due to the extremely expen-
STM [174, 222], and GCN [88]. sive annotation costs, especially for syntax and semantically
Although the linguistic-aware graph structure can provide related tasks.
useful inductive bias, how to build a good graph structure is In contrast, large-scale unlabeled corpora are relatively easy
also a challenging problem. Besides, the structure depends to construct. To leverage the huge unlabeled text data, we can
heavily on expert knowledge or external NLP tools, such as first learn a good representation from them and then use these
the dependency parser. representations for other tasks. Recent studies have demon-
strated significant performance gains on many NLP tasks with
Fully-Connected Self-Attention Model In practice, a the help of the representation extracted from the PTMs on the
more straightforward way is to use a fully-connected graph large unannotated corpora.
to model the relation of every two words and let the model The advantages of pre-training can be summarized as fol-
learn the structure by itself. Usually, the connection weights lows:
are dynamically computed by the self-attention mechanism,
which implicitly indicates the connection between words. A 1. Pre-training on the huge text corpus can learn universal
4 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

language representations and help with the downstream the rest of the whole model still needs to be learned from
tasks. scratch.
During the same time period, many researchers also try to
2. Pre-training provides a better model initialization, learn embeddings of paragraph, sentence or document, such
which usually leads to a better generalization perfor- as paragraph vector [96], Skip-thought vectors [89], Con-
mance and speeds up convergence on the target task. text2Vec [127]. Different from their modern successors, these
3. Pre-training can be regarded as a kind of regularization sentence embedding models try to encode input sentences
to avoid overfitting on small data [43]. into a fixed-dimensional vector representation, rather than the
contextual representation for each token.
2.4 A Brief History of PTMs for NLP
2.4.2 Second-Generation PTMs: Pre-trained Contextual En-
Pre-training has always been an effective strategy to learn the coders
parameters of deep neural networks, which are then fine-tuned
on downstream tasks. As early as 2006, the breakthrough Since most NLP tasks are beyond word-level, it is natural to
of deep learning came with greedy layer-wise unsupervised pre-train the neural encoders on sentence-level or higher. The
pre-training followed by supervised fine-tuning [62]. In CV, it output vectors of neural encoders are also called contextual
has been in practice to pre-train models on the huge ImageNet word embeddings since they represent the word semantics
corpus, and then fine-tune further on smaller data for different depending on its context.
tasks. This is much better than a random initialization because Dai and Le [30] proposed the first successful instance of
the model learns general image features, which can then be PTM for NLP. They initialized LSTMs with a language model
used in various vision tasks. (LM) or a sequence autoencoder, and found the pre-training
In NLP, PTMs on large corpus have also been proved to be can improve the training and generalization of LSTMs in many
beneficial for the downstream NLP tasks, from the shallow text classification tasks. Liu et al. [109] pre-trained a shared
word embedding to deep neural models. LSTM encoder with LM and fine-tuned it under the multi-task
learning (MTL) framework. They found the pre-training and
fine-tuning can further improve the performance of MTL for
2.4.1 First-Generation PTMs: Pre-trained Word Embeddings
several text classification tasks. Ramachandran et al. [146]
Representing words as dense vectors has a long history [60]. found the Seq2Seq models can be significantly improved by
The “modern” word embedding is introduced in pioneer work unsupervised pre-training. The weights of both encoder and
of neural network language model (NNLM) [12]. Collobert decoder are initialized with pre-trained weights of two lan-
et al. [26] showed that the pre-trained word embedding on the guage models and then fine-tuned with labeled data. Besides
unlabelled data could significantly improve many NLP tasks. pre-training the contextual encoder with LM, McCann et al.
To address the computational complexity, they learned word [126] pre-trained a deep LSTM encoder from an attentional
embeddings with pairwise ranking task instead of language sequence-to-sequence model with machine translation (MT).
modeling. Their work is the first attempt to obtain generic The context vectors (CoVe) output by the pre-trained encoder
word embeddings useful for other tasks from unlabeled data. can improve the performance of a wide variety of common
Mikolov et al. [129] showed that there is no need for deep NLP tasks.
neural networks to build good word embeddings. They pro- Since these precursor PTMs, the modern PTMs are usually
pose two shallow architectures: Continuous Bag-of-Words trained with larger scale corpora, more powerful or deeper
(CBOW) and Skip-Gram (SG) models. Despite their sim- architectures (e.g., Transformer), and new pre-training tasks.
plicity, they can still learn high-quality word embeddings to Peters et al. [135] pre-trained 2-layer LSTM encoder with
capture the latent syntactic and semantic similarities among a bidirectional language model (BiLM), consisting of a for-
words. Word2vec is one of the most popular implementations ward LM and a backward LM. The contextual representations
of these models and makes the pre-trained word embeddings output by the pre-trained BiLM, ELMo (Embeddings from
accessible for different tasks in NLP. Besides, GloVe [133] Language Models), are shown to bring large improvements
is also a widely-used model for obtaining pre-trained word on a broad range of NLP tasks. Akbik et al. [1] captured word
embeddings, which are computed by global word-word co- meaning with contextual string embeddings pre-trained with
occurrence statistics from a large corpus. character-level LM. However, these two PTMs are usually
Although pre-trained word embeddings have been shown ef- used as a feature extractor to produce the contextual word
fective in NLP tasks, they are context-independent and mostly embeddings, which are fed into the main model for down-
trained by shallow models. When used on a downstream task, stream tasks. Their parameters are fixed, and the rest pa-
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 5

rameters of the main model are still trained from scratch. automatically. The key idea of SSL is to predict any part
ULMFiT (Universal Language Model Fine-tuning) [67] at- of the input from other parts in some form. For example,
tempted to fine-tune pre-trained LM for text classification the masked language model (MLM) is a self-supervised
(TC) and achieved state-of-the-art results on six widely-used task that attempts to predict the masked words in a
TC datasets. ULMFiT consists of 3 phases: 1) pre-training sentence given the rest words.
LM on general-domain data; 2) fine-tuning LM on target data;
In CV, many PTMs are trained on large supervised training
3) fine-tuning on the target task. ULMFiT also investigates
sets like ImageNet. However, in NLP, the datasets of most
some effective fine-tuning strategies, including discrimina-
supervised tasks are not large enough to train a good PTM.
tive fine-tuning, slanted triangular learning rates, and gradual
The only exception is machine translation (MT). A large-scale
unfreezing.
MT dataset, WMT 2017, consists of more than 7 million sen-
More recently, the very deep PTMs have shown their pow-
tence pairs. Besides, MT is one of the most challenging tasks
erful ability in learning universal language representations:
in NLP, and an encoder pre-trained on MT can benefit a va-
e.g., OpenAI GPT (Generative Pre-training) [142] and BERT
riety of downstream NLP tasks. As a successful PTM, CoVe
(Bidirectional Encoder Representation from Transformer) [36].
[126] is an encoder pre-trained on MT task and improves a
Besides LM, an increasing number of self-supervised tasks
wide variety of common NLP tasks: sentiment analysis (SST,
(see Section 3.1) is proposed to make the PTMs capturing
IMDb), question classification (TREC), entailment (SNLI),
more knowledge form large scale text corpora.
and question answering (SQuAD).
Since ULMFiT and BERT, fine-tuning has become the
In this section, we introduce some widely-used pre-training
mainstream approach to adapt PTMs for the downstream tasks.
tasks in existing PTMs. We can regard these tasks as self-
supervised learning. Table 1 also summarizes their loss func-
3 Overview of PTMs tions.

The major differences between PTMs are the usages of con- 3.1.1 Language Modeling (LM)
textual encoders, pre-training tasks, and purposes. We have
briefly introduced the architectures of contextual encoders in The most common unsupervised task in NLP is probabilistic
Section 2.2. In this section, we focus on the description of language modeling (LM), which is a classic probabilistic den-
pre-training tasks and give a taxonomy of PTMs. sity estimation problem. Although LM is a general concept,
in practice, LM often refers in particular to auto-regressive
3.1 Pre-training Tasks LM or unidirectional LM.
Given a text sequence x1:T = [x1 , x2 , · · · , xT ], its joint prob-
The pre-training tasks are crucial for learning the universal ability p(x1:T ) can be decomposed as
representation of language. Usually, these pre-training tasks T
should be challenging and have substantial training data. In
Y
p(x1:T ) = p(xt |x0:t−1 ), (2)
this section, we summarize the pre-training tasks into three t=1
categories: supervised learning, unsupervised learning, and where x0 is special token indicating the begin of sequence.
self-supervised learning. The conditional probability p(xt |x0:t−1 ) can be modeled by
a probability distribution over the vocabulary given linguistic
1. Supervised learning (SL) is to learn a function that maps
context x0:t−1 . The context x0:t−1 is modeled by neural encoder
an input to an output based on training data consisting
fenc (·), and the conditional probability is
of input-output pairs.  
p(xt |x0:t−1 ) = gLM fenc (x0:t−1 ) , (3)
2. Unsupervised learning (UL) is to find some intrinsic
knowledge from unlabeled data, such as clusters, densi- where gLM (·) is prediction layer.
ties, latent representations. Given a huge corpus, we can train the entire network with
maximum likelihood estimation (MLE).
3. Self-Supervised learning (SSL) is a blend of supervised A drawback of unidirectional LM is that the representa-
learning and unsupervised learning1) . The learning tion of each token encodes only the leftward context tokens
paradigm of SSL is entirely the same as supervised and itself. However, better contextual representations of text
learning, but the labels of training data are generated should encode contextual information from both directions.
1) Indeed, it is hard to clearly distinguish the unsupervised learning and self-supervised learning. For clarification, we refer “unsupervised learning” to the
learning without human-annotated supervised labels. The purpose of “self-supervised learning” is to learn the general knowledge from data rather than standard
unsupervised objectives, such as density estimation.
6 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

Table 1: Loss Functions of Pre-training Tasks

Task Loss Function Description


XT
LM LLM = − log p(xt |x<t ) x<t = x1 , x2 , · · · , xt−1 .
t=1X  
MLM LMLM = − log p x̂|x\m(x) m(x) and x\m(x) denote the masked words from x and the rest
x̂∈m(x) words respectively.
j
X  
Seq2Seq MLM LS2SMLM = − log p xt |x\xi: j , xi:t−1 xi: j denotes an masked n-gram span from i to j in x.
t=i
T
X
PLM LPLM = − log p(zt |z<t ) z = perm(x) is a permutation of x with random order.
t=1
XT
DAE LDAE = − log p(xt |x̂, x<t ) x̂ is randomly perturbed text from x.
t=1 X
DIM LDIM = s(x̂i: j , xi: j ) − log s(x̂i: j , x̃i: j ) xi: j denotes an n-gram span from i to j in x, x̂i: j denotes a
x̃i: j ∈N sentence masked at position i to j, and x̃i: j denotes a randomly-
sampled negative n-gram from corpus.
NSP/SOP LNSP/SOP = − log p(t|x, y) t = 1 if x and y are continuous segments from corpus.
XT
RTD LRTD = − log p(yt |x̂) yt = 1( x̂t = xt ), x̂ is corrupted from x.
t=1
1 x = [x1 , x2 , · · · , xT ] denotes a sequence.

An improved solution is bidirectional LM (BiLM), which con- MLM), which is used in MASS [160] and T5 [144]. Seq2Seq
sists of two unidirectional LMs: a forward left-to-right LM MLM can benefit the Seq2Seq-style downstream tasks, such
and a backward right-to-left LM. For BiLM, Baevski et al. as question answering, summarization, and machine transla-
[6] proposed a two-tower model that the forward tower oper- tion.
ates the left-to-right LM and the backward tower operates the
Enhanced Masked Language Modeling (E-MLM) Con-
right-to-left LM.
currently, there are multiple research proposing different en-
hanced versions of MLM to further improve on BERT. Instead
3.1.2 Masked Language Modeling (MLM) of static masking, RoBERTa [117] improves BERT by dy-
Masked language modeling (MLM) is first proposed by Taylor namic masking.
[178] in the literature, who referred to this as a Cloze task. UniLM [39, 8] extends the task of mask prediction on three
Devlin et al. [36] adapted this task as a novel pre-training task types of language modeling tasks: unidirectional, bidirec-
to overcome the drawback of the standard unidirectional LM. tional, and sequence-to-sequence prediction. XLM [27] per-
Loosely speaking, MLM first masks out some tokens from the forms MLM on a concatenation of parallel bilingual sentence
input sentences and then trains the model to predict the masked pairs, called Translation Language Modeling (TLM). Span-
tokens by the rest of the tokens. However, this pre-training BERT [77] replaces MLM with Random Contiguous Words
method will create a mismatch between the pre-training phase Masking and Span Boundary Objective (SBO) to integrate
and the fine-tuning phase because the mask token does not structure information into pre-training, which requires the
appear during the fine-tuning phase. Empirically, to deal with system to predict masked spans based on span boundaries. Be-
this issue, Devlin et al. [36] used a special [MASK] token 80% sides, StructBERT [193] introduces the Span Order Recovery
of the time, a random token 10% of the time and the original task to further incorporate language structures.
token 10% of the time to perform masking. Another way to enrich MLM is to incorporate external
knowledge (see Section 4.1).
Sequence-to-Sequence MLM (Seq2Seq MLM) MLM is
usually solved as classification problem. We feed the masked
3.1.3 Permuted Language Modeling (PLM)
sequences to a neural encoder whose output vectors are fur-
ther fed into a softmax classifier to predict the masked token. Despite the wide use of the MLM task in pre-training, Yang
Alternatively, we can use encoder-decoder (aka. sequence-to- et al. [209] claimed that some special tokens used in the pre-
sequence) architecture for MLM, in which the encoder is fed training of MLM, like [MASK], are absent when the model is
a masked sequence, and the decoder sequentially produces applied on downstream tasks, leading to a gap between pre-
the masked tokens in auto-regression fashion. We refer to training and fine-tuning. To overcome this issue, Permuted
this kind of MLM as sequence-to-sequence MLM (Seq2Seq Language Modeling (PLM) [209] is a pre-training objective
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 7

to replace MLM. In short, PLM is a language modeling task learnable neural encoder in two ways: s(x, y) = fenc(x)
T
fenc(y) or
on a random permutation of input sequences. A permutation s(x, y) = fenc (x ⊕ y).
is randomly sampled from all possible permutations. Then The idea behind CTL is “learning by comparison”. Com-
some of the tokens in the permuted sequence are chosen as pared to LM, CTL usually has less computational complex-
the target, and the model is trained to predict these targets, ity and therefore is desirable alternative training criteria for
depending on the rest of the tokens and the natural positions of PTMs.
targets. Note that this permutation does not affect the natural Collobert et al. [26] proposed pairwise ranking task to dis-
positions of sequences and only defines the order of token pre- tinguish real and fake phrases. The model needs to predict
dictions. In practice, only the last few tokens in the permuted a higher score for a legal phrase than an incorrect phrase
sequences are predicted, due to the slow convergence. And a obtained by replacing its central word with a random word.
special two-stream self-attention is introduced for target-aware Mnih and Kavukcuoglu [131] trained word embeddings effi-
representations. ciently with Noise-Contrastive Estimation (NCE) [55], which
trains a binary classifier to distinguish real and fake samples.
3.1.4 Denoising Autoencoder (DAE) The idea of NCE is also used in the well-known word2vec
embedding [129].
Denoising autoencoder (DAE) takes a partially corrupted input
We briefly describe some recently proposed CTL tasks in
and aims to recover the original undistorted input. Specific to
the following paragraphs.
language, a sequence-to-sequence model, such as the standard
Transformer, is used to reconstruct the original text. There are Deep InfoMax (DIM) Deep InfoMax (DIM) [63] is origi-
several ways to corrupt text [100]: nally proposed for images, which improves the quality of the
(1) Token Masking: Randomly sampling tokens from the representation by maximizing the mutual information between
input and replacing them with [MASK] elements. an image representation and local regions of the image.
(2) Token Deletion: Randomly deleting tokens from the in- Kong et al. [90] applied DIM to language representation
put. Different from token masking, the model needs to decide learning. The global representation of a sequence x is defined
the positions of missing inputs. to be the hidden state of the first token (assumed to be a spe-
(3) Text Infilling: Like SpanBERT, a number of text spans cial start of sentence symbol) output by contextual encoder
are sampled and replaced with a single [MASK] token. Each fenc (x). The objective of DIM is to assign a higher score for
span length is drawn from a Poisson distribution (λ = 3). The fenc (xi: j )T fenc (x̂i: j ) than fenc (x̃i: j )T fenc (x̂i: j ), where xi: j denotes
model needs to predict how many tokens are missing from a an n-gram2) span from i to j in x, x̂i: j denotes a sentence
span. masked at position i to j, and x̃i: j denotes a randomly-sampled
(4) Sentence Permutation: Dividing a document into sen- negative n-gram from corpus.
tences based on full stops and shuffling these sentences in
random order. Replaced Token Detection (RTD) Replaced Token Detec-
(5) Document Rotation: Selecting a token uniformly at tion (RTD) is the same as NCE but predicts whether a token
random and rotating the document so that it begins with that is replaced given its surrounding context.
token. The model needs to identify the real start position of CBOW with negative sampling (CBOW-NS) [129] can be
the document. viewed as a simple version of RTD, in which the negative
samples are randomly sampled from vocabulary with simple
3.1.5 Contrastive Learning (CTL) proposal distribution.
ELECTRA [24] improves RTD by utilizing a generator to
Contrastive learning [153] assumes some observed pairs of replacing some tokens of a sequence. A generator G and a dis-
text that are more semantically similar than randomly sampled criminator D are trained following a two-stage procedure: (1)
text. A score function s(x, y) for text pair (x, y) is learned to Train only the generator with MLM task for n1 steps; (2) Ini-
minimize the objective function: tialize the weights of the discriminator with the weights of the
h exp s(x, y+ )
 i generator. Then train the discriminator with a discriminative
LCTL = E x,y+ ,y− − log +
 , (4) task for n2 steps, keeping G frozen. Here the discriminative
exp s(x, y ) + exp s(x, y )−

task indicates justifying whether the input token has been re-
where (x, y+ ) are a similar pair and y− is presumably dissimi- placed by G or not. The generator is thrown after pre-training,
lar to x. y+ and y− are typically called positive and negative and only the discriminator will be fine-tuned on downstream
sample. The score function s(x, y) is often computed by a tasks.
2) n is drawn from a Gaussian distribution N(5, 1) clipped at 1 (minimum length) and 10 (maximum length).
8 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

RTD is also an alternative solution for the mismatch prob- multi-modal applications (see Section 4.3), or other specific
lem. The network sees [MASK] during pre-training but not tasks (see Section 4.4).
when being fine-tuned in downstream tasks.
Similarly, WKLM [202] replaces words on the entity-level 3.2 Taxonomy of PTMs
instead of token-level. Concretely, WKLM replaces entity
To clarify the relations of existing PTMs for NLP, we build the
mentions with names of other entities of the same type and
taxonomy of PTMs, which categorizes existing PTMs from
train the models to distinguish whether the entity has been
four different perspectives:
replaced.
Next Sentence Prediction (NSP) Punctuations are the nat- 1. Representation Type: According to the representation
ural separators of text data. So, it is reasonable to construct used for downstream tasks, we can divide PTMs into
pre-training methods by utilizing them. Next Sentence Predic- non-contextual and contextual models.
tion (NSP) [36] is just a great example of this. As its name 2. Architectures: The backbone network used by PTMs,
suggests, NSP trains the model to distinguish whether two including LSTM, Transformer encoder, Transformer
input sentences are continuous segments from the training cor- decoder, and the full Transformer architecture. “Trans-
pus. Specifically, when choosing the sentences pair for each former” means the standard encoder-decoder architec-
pre-training example, 50% of the time, the second sentence ture. “Transformer encoder” and “Transformer decoder”
is the actual next sentence of the first one, and 50% of the mean the encoder and decoder part of the standard
time, it is a random sentence from the corpus. By doing so, it Transformer architecture, respectively. Their difference
is capable to teach the model to understand the relationship is that the decoder part uses masked self-attention with
between two input sentences and thus benefit downstream a triangular matrix to prevent tokens from attending
tasks that are sensitive to this information, such as Question their future (right) positions.
Answering and Natural Language Inference.
However, the necessity of the NSP task has been ques- 3. Pre-Training Task Types: The type of pre-training tasks
tioned by subsequent work [77, 209, 117, 93]. Yang et al. used by PTMs. We have discussed them in Section 3.1.
[209] found the impact of the NSP task unreliable, while Joshi
4. Extensions: PTMs designed for various scenarios, in-
et al. [77] found that single-sentence training without the NSP
cluding knowledge-enriched PTMs, multilingual or
loss is superior to sentence-pair training with the NSP loss.
language-specific PTMs, multi-model PTMs, domain-
Moreover, Liu et al. [117] conducted a further analysis for the
specific PTMs and compressed PTMs. We will particu-
NSP task, which shows that when training with blocks of text
larly introduce these extensions in Section 4.
from a single document, removing the NSP loss matches or
slightly improves performance on downstream tasks. Figure 3 shows the taxonomy as well as some correspond-
ing representative PTMs. Besides, Table 2 distinguishes some
Sentence Order Prediction (SOP) To better model inter-
representative PTMs in more detail.
sentence coherence, ALBERT [93] replaces the NSP loss with
a sentence order prediction (SOP) loss. As conjectured in
Lan et al. [93], NSP conflates topic prediction and coherence 3.3 Model Analysis
prediction in a single task. Thus, the model is allowed to make Due to the great success of PTMs, it is important to understand
predictions merely rely on the easier task, topic prediction. what kinds of knowledge are captured by them, and how to in-
Different from NSP, SOP uses two consecutive segments from duce knowledge from them. There is a wide range of literature
the same document as positive examples, and the same two analyzing linguistic knowledge and world knowledge stored
consecutive segments but with their order swapped as negative in pre-trained non-contextual and contextual embeddings.
examples. As a result, ALBERT consistently outperforms
BERT on various downstream tasks. 3.3.1 Non-Contextual Embeddings
StructBERT [193] and BERTje [33] also take SOP as their
self-supervised learning task. Static word embeddings are first probed for kinds of knowl-
edge. Mikolov et al. [130] found that word representa-
tions learned by neural network language models are able
3.1.6 Others
to capture linguistic regularities in language, and the rela-
Apart from the above tasks, there are many other auxiliary tionship between words can be characterized by a relation-
pre-training tasks designated to incorporate factual knowledge specific vector offset. Further analogy experiments [129]
(see Section 4.1), improve cross-lingual tasks (see Section 4.2), demonstrated that word vectors produced by skip-gram model
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 9

CBOW, Skip-Gram [129]


Non-Contextual
GloVe [133]
Contextual?
Contextual ELMo [135], GPT [142], BERT [36]

LSTM LM-LSTM [30], Shared LSTM[109], ELMo [135], CoVe [126]

Transformer Enc. BERT [36], SpanBERT [117], XLNet [209], RoBERTa [117]
Architectures
Transformer Dec. GPT [142], GPT-2 [143]

MASS [160], BART [100]


Transformer
XNLG [19], mBART [118]

Supervised MT CoVe [126]

LM ELMo [135], GPT [142], GPT-2 [143], UniLM [39]

BERT [36], SpanBERT [117], RoBERTa [117], XLM-R [28]


Task Types
MLM TLM XLM [27]

Unsupervised/ Seq2Seq MLM MASS [160], T5 [144]

PTMs Self-Supervised PLM XLNet [209]

DAE BART [100]

RTD CBOW-NS [129], ELECTRA [24]

CTL NSP BERT [36], UniLM [39]

SOP ALBERT [93], StructBERT [193]

ERNIE(THU) [214], KnowBERT [136], K-BERT [111]


Knowledge-Enriched
SentiLR [83], KEPLER [195], WKLM [202]

XLU mBERT [36], Unicoder [68], XLM [27], XLM-R [28], MultiFit [42]
Multilingual
XLG MASS [160], mBART [118], XNLG [19]

ERNIE(Baidu) [170], BERT-wwm-Chinese [29], NEZHA [198], ZEN [37]


Language-Specific
BERTje [33], CamemBERT [125], FlauBERT [95], RobBERT [35]

ViLBERT [120], LXMERT [175],


Image
VisualBERT [103], B2T2 [2], VL-BERT [163]
Extensions
Multi-Modal Video VideoBERT [165], CBT [164]

Speech SpeechBERT [22]

Domain-Specific SentiLR [83], BioBERT [98], SciBERT [11], PatentBERT [97]

Model Pruning CompressingBERT [51]

Quantization Q-BERT [156], Q8BERT [211]

Model Compression Parameter Sharing ALBERT [93]

Distillation DistilBERT [152], TinyBERT [75], MiniLM [194]

Module Replacing BERT-of-Theseus [203]

Figure 3: Taxonomy of PTMs with Representative Examples


10 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

Table 2: List of Representative PTMs

PTMs Architecture† Input Pre-Training Task Corpus Params GLUE‡ FT?]


ELMo [135] LSTM Text BiLM WikiText-103 No
GPT [142] Transformer Dec. Text LM BookCorpus 117M 72.8 Yes
GPT-2 [143] Transformer Dec. Text LM WebText 117M ∼ 1542M No
BERT [36] Transformer Enc. Text MLM & NSP WikiEn+BookCorpus 110M ∼ 340M 81.9∗ Yes
InfoWord [90] Transformer Enc. Text DIM+MLM WikiEn+BookCorpus =BERT 81.1∗ Yes
RoBERTa [117] Transformer Enc. Text MLM BookCorpus+CC- 355M 88.5 Yes
News+OpenWebText+ STORIES
XLNet [209] Two-Stream Text PLM WikiEn+ BookCorpus+Giga5 ≈BERT 90.5§ Yes
Transformer Enc. +ClueWeb+Common Crawl
ELECTRA [24] Transformer Enc. Text RTD+MLM same to XLNet 335M 88.6 Yes
UniLM [39] Transformer Enc. Text MLM+ NSP WikiEn+BookCorpus 340M 80.8 Yes
MASS [160] Transformer Text Seq2Seq MLM *Task-dependent Yes
BART [100] Transformer Text DAE same to RoBERTa 110% of BERT 88.4∗ Yes
T5 [144] Transformer Text Seq2Seq MLM Colossal Clean Crawled Corpus (C4) 220M ∼ 11B 89.7∗ Yes
ERNIE(THU) [214] Transformer Enc. Text+Entities MLM+NSP+dEA WikiEn + Wikidata 114M 79.6 Yes
KnowBERT [136] Transformer Enc. Text MLM+NSP+EL WikiEn + WordNet/Wiki 253M ∼ 523M Yes
K-BERT [111] Transformer Enc. Text+Triples MLM+NSP WikiZh + WebtextZh + CN-DBpedia + =BERT Yes
HowNet + MedicalKG
KEPLER [195] Transformer Enc. Text MLM+KE WikiEn + Wikidata/WordNet Yes
WKLM [202] Transformer Enc. Text MLM+ERD WikiEn + Wikidata =BERT Yes

“Transformer Enc.” and “Transformer Dec.” mean the encoder and decoder part of the standard Transformer architecture respectively. Their difference is that the
decoder part uses masked self-attention with triangular matrix to prevent tokens from attending their future (right) positions. “Transformer” means the standard
encoder-decoder architecture.

the averaged score on 9 tasks of GLUE benchmark (see Section 7.1).

without WNLI task.
§
indicates ensemble result.
]
means whether is model usually used in fine-tuning fashion.

The MLM of UniLM is built on three versions of LMs: Unidirectional LM, Bidirectional LM, and Sequence-to-Sequence LM.

can capture both syntactic and semantic word relationships, simple syntactic tasks.
such as vec(“China”) − vec(“Beijing”) ≈ vec(“Japan”) − Besides, Tenney et al. [179] analyzed the roles of BERT’s
vec(“Tokyo”). Besides, they find compositionality property of layers in different tasks and found that BERT solves tasks in a
word vectors, for example, vec(“Germany”) + vec(“capital”) similar order to that in NLP pipelines. Furthermore, knowl-
is close to vec(“Berlin”). Inspired by these work, Rubinstein edge of subject-verb agreement [50] and semantic roles [44]
et al. [151] found that distributional word representations are are also confirmed to exist in BERT. Besides, Hewitt and
good at predicting taxonomic properties (e.g., dog is an ani- Manning [59], Jawahar et al. [72], Kim et al. [85] proposed
mal) but fail to learn attributive properties (e.g., swan is white). several methods to extract dependency trees and constituency
Similarly, Gupta et al. [54] showed that word2vec embeddings trees from BERT, which proved the BERT’s ability to encode
implicitly encode referential attributes of entities. The dis- syntax structure. Reif et al. [148] explored the geometry of
tributed word vectors, along with a simple supervised model, internal representations in BERT and find some evidence: 1)
can learn to predict numeric and binary attributes of entities linguistic features seem to be represented in separate semantic
with a reasonable degree of accuracy. and syntactic subspaces; 2) attention matrices contain gram-
matical representations; 3) BERT distinguishes word senses
3.3.2 Contextual Embeddings at a very fine level.

A large number of studies have probed and induced different


World Knowledge Besides linguistic knowledge, PTMs
types of knowledge in contextual embeddings. In general,
may also store world knowledge presented in the training
there are two types of knowledge: linguistic knowledge and
data. A straightforward method of probing world knowledge
world knowledge.
is to query BERT with “fill-in-the-blank” cloze statements, for
Linguistic Knowledge A wide range of probing tasks are example, “Dante was born in [MASK]”. Petroni et al. [138]
designed to investigate the linguistic knowledge in PTMs. Ten- constructed LAMA (Language Model Analysis) task by manu-
ney et al. [180], Liu et al. [108] found that BERT performs ally creating single-token cloze statements (queries) from sev-
well on many syntactic tasks such as part-of-speech tagging eral knowledge sources. Their experiments show that BERT
and constituent labeling. However, BERT is not good enough contains world knowledge competitive with traditional infor-
at semantic and fine-grained syntactic tasks, compared with mation extraction methods. Since the simplicity of query
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 11

generation procedure in LAMA, Jiang et al. [74] argued that ing knowledge, which may suffer from catastrophic forgetting
LAMA just measures a lower bound for what language models when injecting multiple kinds of knowledge. To address this,
know and propose more advanced methods to generate more K-Adapter [191] injects multiple kinds of knowledge by train-
efficient queries. Despite the surprising findings of LAMA, it ing different adapters independently for different pre-training
has also been questioned by subsequent work [141, 82]. Sim- tasks, which allows continual knowledge infusion.
ilarly, several studies induce relational knowledge [15] and On the other hand, one can incorporate external knowledge
commonsense knowledge [32] from BERT for downstream into pre-trained models without retraining them from scratch.
tasks. As an example, K-BERT [111] allows injecting factual knowl-
edge during fine-tuning on downstream tasks. Guan et al. [52]
employed commonsense knowledge bases, ConceptNet and
4 Extensions of PTMs ATOMIC, to enhance GPT-2 for story generation. Yang et al.
[207] proposed a knowledge-text fusion model to acquire re-
4.1 Knowledge-Enriched PTMs lated linguistic and factual knowledge for machine reading
comprehension.
PTMs usually learn universal language representation from
Besides, Logan IV et al. [119] and Hayashi et al. [57] ex-
general-purpose large-scale text corpora but lack domain-
tended language model to knowledge graph language model
specific knowledge. Incorporating domain knowledge from
(KGLM) and latent relation language model (LRLM) respec-
external knowledge bases into PTM has been shown to
tively, both of which allow prediction conditioned on knowl-
be effective. The external knowledge ranges from linguis-
edge graph. These novel KG-conditioned language models
tic [94, 83, 136, 191], semantic [99], commonsense [52],
show potential for pre-training.
factual [214, 136, 111, 202, 195], to domain-specific knowl-
edge [58, 111].
4.2 Multilingual and Language-Specific PTMs
On the one hand, external knowledge can be injected during
pre-training. Early studies [197, 217, 201, 205] focused on 4.2.1 Multilingual PTMs
learning knowledge graph embeddings and word embedding
Learning multilingual text representations shared across lan-
jointly. Since BERT, some auxiliary pre-training tasks are
guages plays an important role in many cross-lingual NLP
designed to incorporate external knowledge into deep PTMs.
tasks.
LIBERT [94] (linguistically-informed BERT) incorporates lin-
guistic knowledge via an additional linguistic constraint task. Cross-Lingual Language Understanding (XLU) Most of
Ke et al. [83] integrated sentiment polarity of each word to the early works focus on learning multilingual word embed-
extend the MLM to Label-Aware MLM (LA-MLM). As a re- ding [45, 123, 158], which represents text from multiple lan-
sult, their proposed model, SentiLR, achieves state-of-the-art guages in a single semantic space. However, these methods
performance on several sentence- and aspect-level sentiment usually need (weak) alignment between languages.
classification tasks. Levine et al. [99] proposed SenseBERT, Multilingual BERT3) (mBERT) is pre-trained by MLM with
which is pre-trained to predict not only the masked tokens but the shared vocabulary and weights on Wikipedia text from the
also their supersenses in WordNet. ERNIE(THU) [214] inte- top 104 languages. Each training sample is a monolingual doc-
grates entity embeddings pre-trained on a knowledge graph ument, and there are no cross-lingual objectives specifically
with corresponding entity mentions in the text to enhance designed nor any cross-lingual data. Even so, mBERT per-
the text representation. Similarly, KnowBERT [136] trains forms cross-lingual generalization surprisingly well [140]. K
BERT jointly with an entity linking model to incorporate en- et al. [79] showed that the lexical overlap between languages
tity representation in an end-to-end fashion. Wang et al. [195] plays a negligible role in cross-lingual success.
proposed KEPLER, which jointly optimizes knowledge em- XLM [27] improves mBERT by incorporating a cross-
bedding and language modeling objectives. These work inject lingual task, translation language modeling (TLM), which
structure information of knowledge graph via entity embed- performs MLM on a concatenation of parallel bilingual sen-
ding. In contrast, K-BERT [111] explicitly injects related tence pairs. Unicoder [68] further propose three new cross-
triples extracted from KG into the sentence to obtain an ex- lingual pre-training tasks, including cross-lingual word recov-
tended tree-form input for BERT. Moreover, Xiong et al. [202] ery, cross-lingual paraphrase classification and cross-lingual
adopted entity replacement identification to encourage the masked language model (XMLM).
model to be more aware of factual knowledge. However, most XLM-RoBERTa (XLM-R) [28] is a scaled multilingual
of these methods update the parameters of PTMs when inject- encoder pre-trained on a significantly increased amount of
3) https://github.com/google-research/bert/blob/master/multilingual.md
12 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

training data, 2.5TB clean CommonCrawl data in 100 differ- PTMs. A great majority of these models are designed for
ent languages. The pre-training task of XLM-RoBERTa is a general visual and linguistic feature encoding. And these
monolingual MLM only. XLM-R achieves state-of-the-arts models are pre-trained on some huge corpus of cross-modal
results on multiple cross-lingual benchmarks, including XNLI, data, such as videos with spoken words or images with cap-
MLQA, and NER. tions, incorporating extended pre-training tasks to fully utilize
the multi-modal feature. Typically, tasks like visual-based
Cross-Lingual Language Generation (XLG) Multilin-
MLM, masked visual-feature modeling and visual-linguistic
gual generation is a kind of tasks to generate text with different
matching are widely used in multi-modal pre-training, such as
languages from the input language, such as machine transla-
VideoBERT [165], VisualBERT [103], ViLBERT [120].
tion and cross-lingual abstractive summarization.
Different from the PTMs for multilingual classification, the
PTMs for multilingual generation usually needs to pre-train
both the encoder and decoder jointly, rather than only focusing 4.3.1 Video-Text PTMs
on the encoder.
VideoBERT [165] and CBT [164] are joint video and text mod-
MASS [160] pre-trains a Seq2Seq model with monolingual
els. To obtain sequences of visual and linguistic tokens used
Seq2Seq MLM on multiple languages and achieves significant
for pre-training, the videos are pre-processed by CNN-based
improvement for unsupervised NMT. XNLG [19] performs
encoders and off-the-shelf speech recognition techniques, re-
two-stage pre-training for cross-lingual natural language gen-
spectively. And a single Transformer encoder is trained on the
eration. The first stage pre-trains the encoder with monolin-
processed data to learn the vision-language representations
gual MLM and Cross-Lingual MLM (XMLM) tasks. The sec-
for downstream tasks like video caption. Furthermore, Uni-
ond stage pre-trains the decoder by using monolingual DAE
ViLM [122] proposes to bring in generation tasks to further
and Cross-Lingual Auto-Encoding (XAE) tasks while keeping
pre-train the decoder using in downstream tasks.
the encoder fixed. Experiments show the benefit of XNLG on
cross-lingual question generation and cross-lingual abstrac-
tive summarization. mBART [118], a multilingual extension
4.3.2 Image-Text PTMs
of BART [100], pre-trains the encoder and decoder jointly
with Seq2Seq denoising auto-encoder (DAE) task on large- Besides methods for video-language pre-training, several
scale monolingual corpora across 25 languages. Experiments works introduce PTMs on image-text pairs, aiming to fit down-
demonstrate that mBART produces significant performance stream tasks like visual question answering(VQA) and vi-
gains across a wide variety of machine translation (MT) tasks. sual commonsense reasoning(VCR). Several proposed models
adopt two separate encoders for image and text representation
4.2.2 Language-Specific PTMs independently, such as ViLBERT [120] and LXMERT [175].
Although multilingual PTMs perform well on many lan- While other methods like VisualBERT [103], B2T2 [2], VL-
guages, recent work showed that PTMs trained on a sin- BERT [163], Unicoder-VL [101] and UNITER [17] propose
gle language significantly outperform the multilingual re- single-stream unified Transformer. Though these model ar-
sults [125, 95, 186]. chitectures are different, similar pre-training tasks, such as
For Chinese, which does not have explicit word bound- MLM and image-text matching, are introduced in these ap-
aries, modeling larger granularity [29, 37, 198] and multi- proaches. And to better exploit visual elements, images are
granularity [170, 171] word representations have shown great converted into sequences of regions by applying RoI or bound-
success. Kuratov and Arkhipov [92] used transfer learning ing box retrieval techniques before encoded by pre-trained
techniques to adapt a multilingual PTM to a monolingual Transformers.
PTM for Russian language. In addition, some monolingual
PTMs have been released for different languages, such as
CamemBERT [125] and FlauBERT [95] for French, Fin- 4.3.3 Audio-Text PTMs
BERT [186] for Finnish, BERTje [33] and RobBERT [35]
Moreover, several methods have explored the chance of PTMs
for Dutch, AraBERT [4] for Arabic language.
on audio-text pairs, such as SpeechBERT [22]. This work tries
to build an end-to-end Speech Question Answering (SQA)
4.3 Multi-Modal PTMs
model by encoding audio and text with a single Transformer
Observing the success of PTMs across many NLP tasks, some encoder, which is pre-trained with MLM on speech and text
research has focused on obtaining a cross-modal version of corpus and fine-tuned on Question Answering.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 13

4.4 Domain-Specific and Task-Specific PTMs 4.5.2 Quantization

Most publicly available PTMs are trained on general do- Quantization refers to the compression of higher precision
main corpora such as Wikipedia, which limits their appli- parameters to lower precision. Works from Shen et al. [156]
cations to specific domains or tasks. Recently, some studies and Zafrir et al. [211] solely focus on this area. Note that
have proposed PTMs trained on specialty corpora, such as quantization often requires compatible hardware.
BioBERT [98] for biomedical text, SciBERT [11] for scien-
tific text, ClinicalBERT [69, 3] for clinical text. 4.5.3 Parameter Sharing
In addition to pre-training a domain-specific PTM, some Another well-known approach to reduce the number of pa-
work attempts to adapt available pre-trained models to target rameters is parameter sharing, which is widely used in CNNs,
applications, such as biomedical entity normalization [73], RNNs, and Transformer [34]. ALBERT [93] uses cross-layer
patent classification [97], progress notes classification and parameter sharing and factorized embedding parameteriza-
keyword extraction [176]. tion to reduce the parameters of PTMs. Although the number
Some task-oriented pre-training tasks were also proposed, of parameters is greatly reduced, the training and inference
such as sentiment Label-Aware MLM in SentiLR [83] for sen- time of ALBERT are even longer than the standard BERT.
timent analysis, Gap Sentence Generation (GSG) [212] for Generally, parameter sharing does not improve the compu-
text summarization, and Noisy Words Detection for disfluency tational efficiency at inference phase.
detection [192].
4.5.4 Knowledge Distillation

4.5 Model Compression Knowledge distillation (KD) [61] is a compression technique


in which a small model called student model is trained to re-
Since PTMs usually consist of at least hundreds of millions produce the behaviors of a large model called teacher model.
of parameters, they are difficult to be deployed on the on-line Here the teacher model can be an ensemble of many models
service in real-life applications and on resource-restricted de- and usually well pre-trained. Different to model compres-
vices. Model compression [16] is a potential approach to sion, distillation techniques learn a small student model from
reduce the model size and increase computation efficiency. a fixed teacher model through some optimization objectives,
There are five ways to compress PTMs [46]: (1) model while compression techniques aiming at searching a sparser
pruning, which removes less important parameters, (2) weight architecture.
quantization [40], which uses fewer bits to represent the pa- Generally, distillation mechanisms can be divided into three
rameters, (3) parameter sharing across similar model units, types: (1) distillation from soft target probabilities, (2) dis-
(4) knowledge distillation [61], which trains a smaller student tillation from other knowledge, and (3) distillation to other
model that learns from intermediate outputs from the original structures:
model and (5) module replacing, which replaces the modules (1) Distillation from soft target probabilities. Bucilua et al.
of original PTMs with more compact substitutes. [16] showed that making the student approximate the teacher
Table 3 gives a comparison of some representative com- model can transfer knowledge from teacher to student. A com-
pressed PTMs. mon method is approximating the logits of the teacher model.
DistilBERT [152] trained the student model with a distillation
loss over the soft target probabilities of the teacher as:
4.5.1 Model Pruning X
LKD-CE = ti ∗ log(si ), (5)
i
Model pruning refers to removing part of neural network (e.g.,
weights, neurons, layers, channels, attention heads), thereby where ti and si are the probabilities estimated by the teacher
achieving the effects of reducing the model size and speeding model and the student, respectively.
up inference time. Distillation from soft target probabilities can also be used
Gordon et al. [51] explored the timing of pruning (e.g., prun- in task-specific models, such as information retrieval [121],
ing during pre-training, after downstream fine-tuning) and the and sequence labeling [181].
pruning regimes. Michel et al. [128] and Voita et al. [187] (2) Distillation from other knowledge. Distillation from
tried to prune the entire self-attention heads in the transformer soft target probabilities regards the teacher model as a black
block. box and only focus on its outputs. Moreover, decomposing
14 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

Table 3: Comparison of Compressed PTMs

Method Type #Layer Loss Function∗ Speed Up Params Source PTM GLUE‡
BERTBASE [36] 12 LMLM + LNSP 110M 79.6
Baseline
BERTLARGE [36] 24 LMLM + LNSP 340M 81.9
Q-BERT [156] 12 HAWQ + GWQ - BERTBASE ≈ 99% BERT
Quantization
Q8BERT [211] 12 DQ + QAT - BERTBASE ≈ 99% BERT
ALBERT§ [93] Param. Sharing 12 LMLM + LSOP ×5.6 ∼ 0.3 12 ∼ 235M 89.4 (ensemble)
DistilBERT [152] 6 LKD-CE +CosKD + LMLM ×1.63 66M BERTBASE 77.0 (dev)
TinyBERT§ † [75] 4 MSEembed +MSEattn + MSEhidn +LKD-CE ×9.4 14.5M BERTBASE 76.5
BERT-PKD [169] 3∼6 LKD-CE +PTKD + LTask ×3.73 ∼ 1.64 45.7 ∼ 67 M BERTBASE 76.0 ∼ 80.6]
PD [183] Distillation 6 LKD-CE +LTask + LMLM ×2.0 67.5M BERTBASE 81.2]
MobileBERT§[172] 24 FMT+AT+PKT+ LKD-CE +LMLM ×4.0 25.3M BERTLARGE 79.7
MiniLM [194] 6 AT+AR ×1.99 66M BERTBASE 81.0[
DualTrain§ †[216] 12 Dual Projection+LMLM - 1.8 ∼ 19.2M BERTBASE 75.8 ∼ 81.9\
BERT-of-Theseus [203] Module Replacing 6 LTask ×1.94 66M BERTBASE 78.6
1
The desing of this table is borrowed from [203, 150].

The averaged score on 8 tasks (without WNLI) of GLUE benchmark (see Section 7.1). Here MNLI-m and MNLI-mm are regarded as two different tasks. ‘dev’ indicates the result
is on dev set. ‘ensemble’ indicates the result is from the ensemble model.

‘LMLM ’, ‘LNSP ’, and ‘LSOP ’ indicate pre-training objective (see Section 3.1 and Table 1).‘LTask ’ means task-specific loss.
‘HAWQ’, ‘GWQ’, ‘DQ’, and ‘QAT’ indicate Hessian AWare Quantization, Group-wise Quantization, Quantization-Aware Training, and Dynamically Quantized, respectively.
‘KD’ means knowledge distillation. ‘FMT’, ‘AT’, and ‘PKT’ mean Feature Map Transfer, Attention Transfer, and Progressive Knowledge Transfer, respectively. ‘AR’ means
Self-Attention value relation.
§
The dimensionality of the hidden or embedding layers is reduced.

Use a smaller vocabulary.
[
Generally, the F1 score is usually used as the main metric of the QQP task. But MiniLM reports the accuracy, which is incomparable to other works.

Result on MNLI and SST-2 only.
]
Result on the other tasks except for STS-B and CoLA.
\
Result on MRPC, MNLI, and SST-2 only.

the teacher model and distilling more knowledge can bring only requires one task-specific loss function. The compressed
improvement to the student model. model, BERT-of-Theseus, is 1.94× faster while retaining more
TinyBERT [75] performs layer-to-layer distillation with em- than 98% performance of the source model.
bedding outputs, hidden states, and self-attention distributions.
MobileBERT [172] also perform layer-to-layer distillation 4.5.6 Others
with soft target probabilities, hidden states, and self-attention
In addition to reducing model sizes, there are other ways to
distributions. MiniLM [194] distill self-attention distributions
improve the computational efficiency of PTMs in practical
and self-attention value relation from teacher model.
scenarios with limited resources. Liu et al. [112] proposed a
Besides, other models distill knowledge through many ap-
practical speed-tunable BERT, namely FastBERT, which can
proaches. Sun et al. [169] introduced a “patient” teacher-
dynamically reduce computational steps with sample-wise
student mechanism, Liu et al. [113] exploited KD to improve
adaptive mechanism.
a pre-trained multi-task deep neural network.
(3) Distillation to other structures. Generally, the structure
of the student model is the same as the teacher model, except
5 Adapting PTMs to Downstream Tasks
for a smaller layer size and a smaller hidden size. However, Although PTMs capture the general language knowledge from
not only decreasing parameters but also simplifying model a large corpus, how effectively adapting their knowledge to
structures from Transformer to RNN [177] or CNN [20] can the downstream task is still a key problem.
reduce the computational complexity.
5.1 Transfer Learning
4.5.5 Module Replacing
Transfer learning [132] is to adapt the knowledge from a
Module replacing is an interesting and simple way to reduce source task (or domain) to a target task (or domain). Fig-
the model size, which replaces the large modules of original ure 4 gives an illustration of transfer learning.
PTMs with more compact substitutes. Xu et al. [203] pro- There are many types of transfer learning in NLP, such as
posed Theseus Compression motivated by a famous thought domain adaptation, cross-lingual learning, multi-task learning.
experiment called “Ship of Theseus”, which progressively Adapting PTMs to downstream tasks is sequential transfer
substitutes modules from the source model with modules of learning task, in which tasks are learned sequentially and the
fewer parameters. Different from KD, Theseus Compression target task has labeled data.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 15

Knowledge syntactic information appears earlier in the network, while


Transfer
Source Model Target Model high-level semantic information appears at higher layers.
Let H(l) (1 6 l 6 L) denotes the l-th layer representation
of the pre-trained model with L layers, and g(·) denote the
task-specific model for the target task.
Source Dataset Target Dataset
There are three ways to select the representation:
a) Embedding Only. One approach is to choose only the
Figure 4: Transfer Learning
pre-trained static embeddings, while the rest of the model still
needs to be trained from scratch for a new target task.
5.2 How to Transfer? They fail to capture higher-level information that might
be even more useful. Word embeddings are only useful in
To transfer the knowledge of a PTM to the downstream NLP
capturing semantic meanings of words, but we also need to
tasks, we need to consider the following issues:
understand higher-level concepts like word sense.
b) Top Layer. The most simple and effective way is to feed
5.2.1 Choosing appropriate pre-training task, model archi-
the representation at the top layer into the task-specific model
tecture and corpus
g(H(L) ).
Different PTMs usually have different effects on the same c) All Layers. A more flexible way is to automatic choose
downstream task, since these PTMs are trained with various the best layer in a soft version, like ELMo [135]:
pre-training tasks, model architecture, and corpora. L
αl h(l)
X
(1) Currently, the language model is the most popular pre- rt = γ t , (6)
training task and can more efficiently solve a wide range of l=1

NLP problems [143]. However, different pre-training tasks where αl is the softmax-normalized weight for layer l and γ is
have their own bias and give different effects for different a scalar to scale the vectors output by pre-trained model. The
tasks. For example, the NSP task [36] makes PTM understand mixup representation is fed into the task-specific model g(rt ).
the relationship between two sentences. Thus, the PTM can
benefit downstream tasks such as Question Answering (QA) 5.2.3 To tune or not to tune?
and Natural Language Inference (NLI).
Currently, there are two common ways of model transfer: fea-
(2) The architecture of PTM is also important for the down-
ture extraction (where the pre-trained parameters are frozen),
stream task. For example, although BERT helps with most
and fine-tuning (where the pre-trained parameters are unfrozen
natural language understanding tasks, it is hard to generate
and fine-tuned).
language.
In feature extraction way, the pre-trained models are re-
(3) The data distribution of the downstream task should be
garded as off-the-shelf feature extractors. Moreover, it is im-
approximate to PTMs. Currently, there are a large number of
portant to expose the internal layers as they typically encode
off-the-shelf PTMs, which can just as conveniently be used
the most transferable representations [137].
for various domain-specific or language-specific downstream
Although both these two ways can significantly benefit
tasks.
most of NLP tasks, feature extraction way requires more com-
Therefore, given a target task, it is always a good solution
plex task-specific architecture. Therefore, the fine-tuning way
to choose the PTMs trained with appropriate pre-training task,
is usually more general and convenient for many different
architecture, and corpus.
downstream tasks than feature extraction way.
Table 4 gives some common combinations of adapting
5.2.2 Choosing appropriate layers
PTMs.
Given a pre-trained deep model, different layers should cap-
Table 4: Some common combinations of adapting PTMs.
ture different kinds of information, such as POS tagging, pars-
ing, long-term dependencies, semantic roles, coreference. For Where FT/FE?† PTMs
RNN-based models, Belinkov et al. [10] and Melamud et al. Embedding Only FT/FE Word2vec [129],GloVe [133]
[127] showed that representations learned from different lay- Top Layer FT BERT [36],RoBERTa [117]
ers in a multi-layer LSTM encoder benefit different tasks Top Layer FE BERT§ [218, 221]
(e.g., predicting POS tags and understanding word sense). For All Layers FE ELMo [135]
transformer-based PTMs, Tenney et al. [179] found BERT † FT and FE mean Fine-tuning and Feature Extraction respectively.
represents the steps of the traditional NLP pipeline: basic § BERT used as feature extractor.
16 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

5.3 Fine-Tuning Strategies Others Motivated by the success of widely-used ensemble


models, Xu et al. [206] improved the fine-tuning of BERT with
With the increase of the depth of PTMs, the representation cap- two effective mechanisms: self-ensemble and self-distillation,
tured by them makes the downstream task easier. Therefore, which can improve the performance of BERT on downstream
the task-specific layer of the whole model is simple. Since tasks without leveraging external resource or significantly de-
ULMFit and BERT, fine-tuning has become the main adaption creasing the training efficiency. They integrated ensemble and
method of PTMs. However, the process of fine-tuning is often distillation within a single training process. The teacher model
brittle: even with the same hyper-parameter values, distinct is an ensemble model by parameter-averaging several student
random seeds can lead to substantially different results [38]. models in previous time steps.
Besides standard fine-tuning, there are also some useful Instead of fine-tuning all the layers simultaneously, grad-
fine-tuning strategies. ual unfreezing [67] is also an effective method that gradu-
ally unfreezes layers of PTMs starting from the top layer.
Two-stage fine-tuning An alternative solution is two-stage Chronopoulou et al. [21] proposed a simpler unfreezing
transfer, which introduces an intermediate stage between pre- method, sequential unfreezing, which first fine-tunes only the
training and fine-tuning. In the first stage, the PTM is trans- randomly-initialized task-specific layers, and then unfreezes
ferred into a model fine-tuned by an intermediate task or cor- the hidden layers of PTM, and finally unfreezes the embedding
pus. In the second stage, the transferred model is fine-tuned to layer.
the target task. Sun et al. [167] showed that the “further pre- Li and Eisner [104] compressed ELMo embeddings us-
training” on the related-domain corpus can further improve ing variational information bottleneck while keeping only the
the ability of BERT and achieved state-of-the-art performance information that helps the target task.
on eight widely-studied text classification datasets. Phang Generally, the above works show that the utility of PTMs
et al. [139] and Garg et al. [48] introduced the intermediate can be further stimulated by better fine-tuning strategies.
supervised task related to the target task, which brings a large
improvement for BERT, GPT, and ELMo. Li et al. [106]
also used a two-stage transfer for the story ending prediction.
6 Resources of PTMs
The proposed TransBERT (transferable BERT) can transfer There are many related resources for PTMs available online.
not only general language knowledge from large-scale unla- Table 5 provides some popular repositories, including third-
beled data but also specific kinds of knowledge from various party implementations, paper lists, visualization tools, and
semantically related supervised tasks. other related resources of PTMs.
Besides, there are some other good survey papers on PTMs
Multi-task fine-tuning Liu et al. [114] fine-tuned BERT
for NLP [196, 110, 150].
under the multi-task learning framework, which demonstrates
that multi-task learning and pre-training are complementary
technologies. 7 Applications
In this section, we summarize some applications of PTMs in
Fine-tuning with extra adaptation modules The main
several classic NLP tasks.
drawback of fine-tuning is its parameter inefficiency: every
downstream task has its own fine-tuned parameters. There-
7.1 General Evaluation Benchmark
fore, a better solution is to inject some fine-tunable adaptation
modules into PTMs while the original parameters are fixed. There is an essential issue for the NLP community that how
Stickland and Murray [162] equipped a single share BERT can we evaluate PTMs in a comparable metric. Thus, large-
model with small additional task-specific adaptation modules, scale-benchmark is necessary.
projected attention layers (PALs). The shared BERT with The General Language Understanding Evaluation (GLUE)
the PALs matches separately fine-tuned models on the GLUE benchmark [190] is a collection of nine natural language under-
benchmark with roughly 7 times fewer parameters. Similarly, standing tasks, including single-sentence classification tasks
Houlsby et al. [66] modified the architecture of pre-trained (CoLA and SST-2), pairwise text classification tasks (MNLI,
BERT by adding adapter modules. Adapter modules yield a RTE, WNLI, QQP, and MRPC), text similarity task (STS-
compact and extensible model; they add only a few trainable B), and relevant ranking task (QNLI). GLUE benchmark is
parameters per task, and new tasks can be added without re- well-designed for evaluating the robustness as well as general-
visiting previous ones. The parameters of the original network ization of models. GLUE does not provide the labels for the
remain fixed, yielding a high degree of parameter sharing. test set but set up an evaluation server.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 17

Table 5: Resources of PTMs

Resource Description URL


Open-Source Implementations §
word2vec CBOW,Skip-Gram https://github.com/tmikolov/word2vec
GloVe Pre-trained word vectors https://nlp.stanford.edu/projects/glove
FastText Pre-trained word vectors https://github.com/facebookresearch/fastText
Transformers Framework: PyTorch&TF, PTMs: BERT, GPT-2, RoBERTa, XLNet, etc. https://github.com/huggingface/transformers
Fairseq Framework: PyTorch, PTMs:English LM, German LM, RoBERTa, etc. https://github.com/pytorch/fairseq
Flair Framework: PyTorch, PTMs:BERT, ELMo, GPT, RoBERTa, XLNet, etc. https://github.com/flairNLP/flair
AllenNLP [47] Framework: PyTorch, PTMs: ELMo, BERT, GPT-2, etc. https://github.com/allenai/allennlp
fastNLP Framework: PyTorch, PTMs: RoBERTa, GPT, etc. https://github.com/fastnlp/fastNLP
UniLMs Framework: PyTorch, PTMs: UniLM v1&v2, MiniLM, LayoutLM, etc. https://github.com/microsoft/unilm
Chinese-BERT [29] Framework: PyTorch&TF, PTMs: BERT, RoBERTa, etc. (for Chinese) https://github.com/ymcui/Chinese-BERT-wwm
BERT [36] Framework: TF, PTMs: BERT, BERT-wwm https://github.com/google-research/bert
RoBERTa [117] Framework: PyTorch https://github.com/pytorch/fairseq/tree/master/examples/roberta
XLNet [209] Framework: TF https://github.com/zihangdai/xlnet/
ALBERT [93] Framework: TF https://github.com/google-research/ALBERT
T5 [144] Framework: TF https://github.com/google-research/text-to-text-transfer-transformer
ERNIE(Baidu) [170, 171] Framework: PaddlePaddle https://github.com/PaddlePaddle/ERNIE
CTRL [84] Conditional Transformer Language Model for Controllable Generation. https://github.com/salesforce/ctrl
BertViz [185] Visualization Tool https://github.com/jessevig/bertviz
exBERT [65] Visualization Tool https://github.com/bhoov/exbert
TextBrewer [210] PyTorch-based toolkit for distillation of NLP models. https://github.com/airaria/TextBrewer
DeepPavlov Conversational AI Library. PTMs for the Russian, Polish, Bulgarian, https://github.com/deepmipt/DeepPavlov
Czech, and informal English.
Corpora
OpenWebText Open clone of OpenAI’s unreleased WebText dataset. https://github.com/jcpeterson/openwebtext
Common Crawl A very large collection of text. http://commoncrawl.org/
WikiEn English Wikipedia dumps. https://dumps.wikimedia.org/enwiki/
Other Resources
Paper List https://github.com/thunlp/PLMpapers
Paper List https://github.com/tomohideshibata/BERT-related-papers
Paper List https://github.com/cedrickchee/awesome-bert-nlp
Bert Lang Street A collection of BERT models with reported performances on different https://bertlang.unibocconi.it/
datasets, tasks and languages.
§
Most papers for PTMs release their links of official version. Here we list some popular third-party and official implementations.

However, motivated by the fact that the progress in recent (HotpotQA) [208].
years has eroded headroom on the GLUE benchmark dra-
BERT creatively transforms the extractive QA task to the
matically, a new benchmark called SuperGLUE [189] was
spans prediction task that predicts the starting span as well
presented. Compared to GLUE, SuperGLUE has more chal-
as the ending span of the answer [36]. After that, PTM as
lenging tasks and more diverse task formats (e.g., coreference
an encoder for predicting spans has become a competitive
resolution and question answering).
baseline. For extractive QA, Zhang et al. [215] proposed a ret-
State-of-the-art PTMs are listed in the corresponding leader-
rospective reader architecture and initialize the encoder with
board4) 5) .
PTM (e.g., ALBERT). For multi-round generative QA, Ju
et al. [78] proposed a “PTM+Adversarial Training+Rationale
7.2 Question Answering Tagging+Knowledge Distillation” model. For multi-hop QA,
Tu et al. [182] proposed an interpretable “Select, Answer, and
Question answering (QA), or a narrower concept machine
Explain” (SAE) system that PTM acts as the encoder in the
reading comprehension (MRC), is an important application in
selection module.
the NLP community. From easy to hard, there are three types
of QA tasks: single-round extractive QA (SQuAD) [145], Generally, encoder parameters in the proposed QA model
multi-round generative QA (CoQA) [147], and multi-hop QA are initialized through a PTM, and other parameters are ran-
4) https://gluebenchmark.com/
5) https://super.gluebenchmark.com/
6) https://rajpurkar.github.io/SQuAD-explorer/
7) https://stanfordnlp.github.io/coqa/
8) https://hotpotqa.github.io/
18 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

domly initialized. State-of-the-art models are listed in the for NER.


corresponding leaderboard. 6) 7) 8) Akbik et al. [1] used a pre-trained character-level language
model to produce word-level embedding for NER. TagLM
7.3 Sentiment Analysis [134] and ELMo [135] use a pre-trained language model’s last
layer output and weighted-sum of each layer output as a part
BERT outperforms previous state-of-the-art models by simply of word embedding. Liu et al. [107] used layer-wise pruning
fine-tuning on SST-2, which is a widely used dataset for senti- and dense connection to speed up ELMo’s inference on NER.
ment analysis (SA) [36]. Bataa and Wu [9] utilized BERT with Devlin et al. [36] used the first BPE’s BERT representation
transfer learning techniques and achieve new state-of-the-art to predict each word’s label without CRF. Pires et al. [140]
in Japanese SA. realized zero-shot NER through multilingual BERT. Tsai et al.
Despite their success in simple sentiment classification, [181] leveraged knowledge distillation to run a small BERT
directly applying BERT to aspect-based sentiment analysis for NER on a single CPU. Besides, BERT is also used on
(ABSA), which is a fine-grained SA task, shows less signif- domain-specific NER, such as biomedicine [56, 98], etc.
icant improvement [166]. To better leverage the powerful
representation of BERT, Sun et al. [166] constructed an auxil- 7.5 Machine Translation
iary sentence by transforming ABSA from a single sentence
classification task to a sentence pair classification task. Xu Machine Translation (MT) is an important task in the NLP
et al. [204] proposed post-training to adapt BERT from its community, which has attracted many researchers. Almost
source domain and tasks to the ABSA domain and tasks. Fur- all of Neural Machine Translation (NMT) models share the
thermore, Rietzler et al. [149] extended the work of [204] encoder-decoder framework, which first encodes input tokens
by analyzing the behavior of cross-domain post-training with to hidden representations by the encoder and then decodes
ABSA performance. Karimi et al. [81] showed that the per- output tokens in the target language from the decoder. Ra-
formance of post-trained BERT could be further improved machandran et al. [146] found the encoder-decoder models
via adversarial training. Song et al. [161] added an additional can be significantly improved by initializing both encoder and
pooling module, which can be implemented as either LSTM decoder with pre-trained weights of two language models.
or attention mechanism, to leverage BERT intermediate lay- Edunov et al. [41] used ELMo to set the word embedding
ers for ABSA. In addition, Li et al. [105] jointly learned as- layer in the NMT model. This work shows performance im-
pect detection and sentiment classification towards end-to-end provements on English-Turkish and English-German NMT
ABSA. SentiLR [83] acquires part-of-speech tag and prior sen- model by using a pre-trained language model for source word
timent polarity from SentiWordNet and adopts Label-Aware embedding initialization.
MLM to utilize the introduced linguistic knowledge to capture Given the superb performance of BERT on other NLP
the relationship between sentence-level sentiment labels and tasks, it is natural to investigate how to incorporate BERT into
word-level sentiment shifts. SentiLR achieves state-of-the-art NMT models. Conneau and Lample [27] tried to initialize
performance on several sentence- and aspect-level sentiment the entire encoder and decoder by a multilingual pre-trained
classification tasks. BERT model and showed a significant improvement could be
For sentiment transfer, Wu et al. [200] proposed “Mask achieved on unsupervised MT and English-Romanian super-
and Infill” based on BERT. In the mask step, the model disen- vised MT. Similarly, Clinchant et al. [25] devised a series of
tangles sentiment from content by masking sentiment tokens. different experiments for examining the best strategy to utilize
In the infill step, it uses BERT along with a target sentiment BERT on the encoder part of NMT models. They achieved
embedding to infill the masked positions. some improvement by using BERT as an initialization of the
encoder. Also, they found that these models can get better per-
formance on the out-of-domain dataset. Imamura and Sumita
7.4 Named Entity Recognition
[70] proposed a two stages BERT fine-tuning method for NMT.
Named Entity Recognition (NER) in information extraction At the first stage, the encoder is initialized by a pre-trained
and plays an important role in many NLP downstream tasks. BERT model, and they only train the decoder on the training
In deep learning, most of NER methods are in the sequence- set. At the second stage, the whole NMT model is jointly
labeling framework. The entity information in a sentence fine-tuned on the training set. By experiment, they show this
will be transformed into the sequence of labels, and one label approach can surpass the one stage fine-tuning method, which
corresponds to one word. The model is used to predict the directly fine-tunes the whole model. Apart from that, Zhu et al.
label of each word. Since ELMo and BERT have shown their [221] suggested using pre-trained BERT as an extra memory
power in NLP, there is much work about pre-trained models to facilitate NMT models. Concretely, they first encode the
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 19

input tokens by a pre-trained BERT and use the output of the 7.7 Adversarial Attacks and Defenses
last layer as extra memory. Then, the NMT model can access
The deep neural models are vulnerable to adversarial exam-
the memory via an extra attention module in each layer of
ples that can mislead a model to produce a specific wrong
both encoder and decoder. And they show a noticeable im-
prediction with imperceptible perturbations from the origi-
provement in supervised, semi-supervised, and unsupervised
nal input. In CV, adversarial attacks and defenses have been
MT.
widely studied. However, it is still challenging for text due
Instead of only pre-training the encoder, MASS (Masked
to the discrete nature of languages. Generating of adversarial
Sequence-to-Sequence Pre-Training) [160] utilizes Seq2Seq
samples for text needs to possess such qualities: (1) imper-
MLM to pre-train the encoder and decoder jointly. In the
ceptible to human judges yet misleading to neural models; (2)
experiment, this approach can surpass the BERT-style pre-
fluent in grammar and semantically consistent with original in-
training proposed by Conneau and Lample [27] both on un-
puts. Jin et al. [76] successfully attacked the fine-tuned BERT
supervised MT and English-Romanian supervised MT. Dif-
on text classification and textual entailment with adversarial
ferent from MASS, mBART [118], a multilingual extension
examples. Wallace et al. [188] defined universal adversarial
of BART [100], pre-trains the encoder and decoder jointly
triggers that can induce a model to produce a specific-purpose
with Seq2Seq denoising auto-encoder (DAE) task on large-
prediction when concatenated to any input. Some triggers can
scale monolingual corpora across 25 languages. Experiments
even cause the GPT-2 model to generate racist text. Sun et al.
demonstrated that mBART could significantly improve both
[168] showed BERT is not robust on misspellings.
supervised and unsupervised machine translation at both the
PTMs also have great potential to generate adversarial sam-
sentence level and document level.
ples. Li et al. [102] proposed BERT-Attack, a BERT-based
high-quality and effective attacker. They turned BERT against
7.6 Summarization
another fine-tuned BERT on downstream tasks and success-
Summarization, aiming at producing a shorter text which pre- fully misguided the target model to predict incorrectly, out-
serves the most meaning of a longer text, has attracted the performing state-of-the-art attack strategies in both success
attention of the NLP community in recent years. The task rate and perturb percentage, while the generated adversarial
has been improved significantly since the widespread use of samples are fluent and semantically preserved.
PTM. Zhong et al. [218] introduced transferable knowledge Besides, adversarial defenses for PTMs are also promis-
(e.g., BERT) for summarization and surpassed previous mod- ing, which improve the robustness of PTMs and make them
els. Zhang et al. [213] tries to pre-trained a document-level immune against adversarial attack.
model that predicts sentences instead of words, and then apply Adversarial training aims to improve the generalization
it on downstream tasks such as summarization. More elabo- by minimizes the maximal risk for label-preserving perturba-
rately, Zhang et al. [212] designed a Gap Sentence Generation tions in embedding space. Recent work [220, 115] showed
(GSG) task for pre-training, whose objective involves generat- that adversarial pre-training or fine-tuning can improve both
ing summary-like text from the input. Furthermore, Liu and generalization and robustness of PTMs for NLP.
Lapata [116] proposed BERTSUM. BERTSUM included a
novel document-level encoder, and a general framework for
both extractive summarization and abstractive summarization.
8 Future Directions
In the encoder frame, BERTSUM extends BERT by inserting
Though PTMs have proven their power for various NLP tasks,
multiple [CLS] tokens to learn the sentence representations.
challenges still exist due to the complexity of language. In
For extractive summarization, BERTSUM stacks several inter-
this section, we suggest five future directions of PTMs.
sentence Transformer layers. For abstractive summarization,
BERTSUM proposes a two-staged fine-tuning approach using (1) Upper Bound of PTMs Currently, PTMs have not yet
a new fine-tuning schedule. Zhong et al. [219] proposed a reached its upper bound. Most of the current PTMs can be
novel summary-level framework MATCHSUM and conceptu- further improved by more training steps and larger corpora.
alized extractive summarization as a semantic text matching The state of the art in NLP can be further advanced by
problem. They proposed a Siamese-BERT architecture to increasing the depth of models, such as Megatron-LM [157]
compute the similarity between the source document and the (8.3 billion parameters, 72 Transformer layers with a hidden
candidate summary and achieved a state-of-the-art result on size of 3072 and 32 attention heads) and Turing-NLG9) (17
CNN/DailyMail (44.41 in ROUGE-1) by only using the base billion parameters, 78 Transformer layers with a hidden size
version of BERT. of 4256 and 28 attention heads).
9) https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/
20 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

The general-purpose PTMs are always our pursuits for (4) Knowledge Transfer Beyond Fine-tuning Currently,
learning the intrinsic universal knowledge of languages (even fine-tuning is the dominant method to transfer PTMs’ knowl-
world knowledge). However, such PTMs usually need deeper edge to downstream tasks, but one deficiency is its parameter
architecture, larger corpus, and challenging pre-training tasks, inefficiency: every downstream task has its own fine-tuned
which further result in higher training costs. However, train- parameters. An improved solution is to fix the original pa-
ing huge models is also a challenging problem, which needs rameters of PTMs and by adding small fine-tunable adap-
more sophisticated and efficient training techniques such as tion modules for specific task [162, 66]. Thus, we can use
distributed training, mixed precision, gradient accumulation, a shared PTM to serve multiple downstream tasks. Indeed,
etc. Therefore, a more practical direction is to design more mining knowledge from PTMs can be more flexible, such as
efficient model architecture, self-supervised pre-training tasks, feature extraction, knowledge distillation [210], data augmen-
optimizers, and training skills using existing hardware and tation [199, 91], using PTMs as external knowledge [138].
software. ELECTRA [24] is a good solution towards this More efficient methods are expected.
direction.
(5) Interpretability and Reliability of PTMs Although
(2) Architecture of PTMs The transformer has been proved PTMs reach impressive performance, their deep non-linear
to be an effective architecture for pre-training. However, the architecture makes the procedure of decision-making highly
main limitation of the Transformer is its computation com- non-transparent.
plexity, which is quadratic to the input length. Limited by the Recently, explainable artificial intelligence (XAI) [5] has
memory of GPUs, most of current PTMs cannot deal with become a hotspot in the general AI community. Unlike CNNs
the sequence longer than 512 tokens. Breaking this limit for images, interpreting PTMs is harder due to the complex-
needs to improve the architecture of the Transformer, such ities of both the Transformer-like architecture and language.
as Transformer-XL [31]. Therefore, searching for more ef- Extensive efforts (see Section 3.3) have been made to analyze
ficient model architecture for PTMs is important to capture the linguistic and world knowledge included in PTMs, which
longer-range contextual information. help us understand these PMTs with some degree of trans-
The design of deep architecture is challenging, and we parency. However, much work on model analysis depends on
may seek help from some automatic methods, such as neural the attention mechanism, and the effectiveness of attention for
architecture search (NAS) [223]. interpretability is still controversial [71, 155].
Besides, PTMs are also vulnerable to adversarial attacks
(3) Task-oriented Pre-training and Model Compression
(see Section 7.7). The reliability of PTMs is also becoming
In practice, different downstream tasks require the different
an issue of great concern with the extensive use of PTMs in
abilities of PTMs. The discrepancy between PTMs and down-
production systems. The studies of adversarial attacks against
stream tasks usually lies in two aspects: model architecture
PTMs help us understand their capabilities by fully exposing
and data distribution. A larger discrepancy may result in that
their vulnerabilities. Adversarial defenses for PTMs are also
the benefit of PTMs may be insignificant. For example, text
promising, which improve the robustness of PTMs and make
generation usually needs a specific task to pre-train both the
them immune against adversarial attack.
encoder and decoder, while text matching needs pre-training
Overall, as key components in many NLP applications,
tasks designed for sentence pairs.
the interpretability and reliability of PTMs remain to be ex-
Besides, although larger PTMs can usually lead to better
plored further in many respects, which helps us understand
performance, a practical problem is how to leverage these
how PTMs work and provides a guide for better usage and
huge PTMs on special scenarios, such as low-capacity devices
further improvement.
and low-latency applications. Therefore, we can carefully de-
sign the specific model architecture and pre-training tasks for
downstream tasks or extract partial task-specific knowledge 9 Conclusion
from existing PTMs.
Instead of training task-oriented PTMs from scratch, we In this survey, we conduct a comprehensive overview of
can teach them with existing general-purpose PTMs by us- PTMs for NLP, including background knowledge, model ar-
ing techniques such as model compression (see Section 4.5). chitecture, pre-training tasks, various extensions, adaption
Although model compression is widely studied for CNNs in approaches, related resources, and applications. Based on
CV [18], compression for PTMs for NLP is just beginning. current PTMs, we propose a new taxonomy of PTMs from
The fully-connected structure of the Transformer also makes four different perspectives. We also suggest several possible
model compression more challenging. future research directions for PTMs.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 21

Acknowledgements tian Jauvin. A neural probabilistic language model. Journal


of machine learning research, 3(Feb):1137–1155, 2003.
We thank Zhiyuan Liu, Wanxiang Che, Minlie Huang, Dan- [13] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Rep-
qing Wang and Luyao Huang for their valuable feedback on resentation learning: A review and new perspectives. IEEE
this manuscript. This work was supported by the National transactions on pattern analysis and machine intelligence, 35
Natural Science Foundation of China (No. 61751201 and (8):1798–1828, 2013.
61672162), Shanghai Municipal Science and Technology Ma- [14] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas
jor Project (No. 2018SHZDZX01) and ZJLab. Mikolov. Enriching word vectors with subword information.
TACL, 5:135–146, 2017.
[15] Zied Bouraoui, José Camacho-Collados, and Steven Schock-
References
aert. Inducing relational knowledge from BERT. In AAAI,
2019.
[1] Alan Akbik, Duncan Blythe, and Roland Vollgraf. Contextual
string embeddings for sequence labeling. In COLING, pages [16] Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-
1638–1649, 2018. Mizil. Model compression. In KDD, pages 535–541, 2006.
[2] Chris Alberti, Jeffrey Ling, Michael Collins, and David Re- [17] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy,
itter. Fusion of detected objects in text for visual question Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu.
answering. In EMNLP-IJCNLP, pages 2131–2140, 2019. UNITER: learning universal image-text representations. arXiv
preprint arXiv:1909.11740, 2019.
[3] Emily Alsentzer, John R. Murphy, Willie Boag, Wei-Hung
Weng, Di Jin, Tristan Naumann, and Matthew B. A. McDer- [18] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of
mott. Publicly available clinical BERT embeddings. arXiv model compression and acceleration for deep neural networks.
preprint arXiv:1904.03323, 2019. arXiv preprint arXiv:1710.09282, 2017.
[4] Wissam Antoun, Fady Baly, and Hazem Hajj. AraBERT: [19] Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling
Transformer-based model for Arabic language understanding. Mao, and Heyan Huang. Cross-lingual natural language gen-
arXiv preprint arXiv:2003.00104, 2020. eration via pre-training. In AAAI, 2019.
[5] Alejandro Barredo Arrieta, Natalia Dı́az-Rodrı́guez, Javier [20] Yew Ken Chia, Sam Witteveen, and Martin Andrews. Trans-
Del Ser, Adrien Bennetot, Siham Tabik, Alberto Barbado, former to CNN: Label-scarce distillation for efficient text
Salvador Garcı́a, Sergio Gil-López, Daniel Molina, Richard classification. arXiv preprint arXiv:1909.03508, 2019.
Benjamins, et al. Explainable artificial intelligence (xai): [21] Alexandra Chronopoulou, Christos Baziotis, and Alexandros
Concepts, taxonomies, opportunities and challenges toward Potamianos. An embarrassingly simple approach for transfer
responsible ai. Information Fusion, 58:82–115, 2020. learning from pretrained language models. In NAACL-HLT,
[6] Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettle- pages 2089–2095, 2019.
moyer, and Michael Auli. Cloze-driven pretraining of self- [22] Yung-Sung Chuang, Chi-Liang Liu, and Hung-yi Lee.
attention networks. In Kentaro Inui, Jing Jiang, Vincent Ng, SpeechBERT: Cross-modal pre-trained language model for
and Xiaojun Wan, editors, EMNLP-IJCNLP, pages 5359– end-to-end spoken question answering. arXiv preprint
5368, 2019. arXiv:1910.11559, 2019.
[7] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. [23] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and
Neural machine translation by jointly learning to align and Yoshua Bengio. Empirical evaluation of gated recurrent
translate. In ICLR, 2014. neural networks on sequence modeling. arXiv preprint
[8] Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan Yang, arXiv:1412.3555, 2014.
Xiaodong Liu, Yu Wang, Songhao Piao, Jianfeng Gao, Ming [24] Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christo-
Zhou, et al. UniLMv2: Pseudo-masked language models pher D. Manning. ELECTRA: Pre-training text encoders as
for unified language model pre-training. arXiv preprint discriminators rather than generators. In ICLR, 2020.
arXiv:2002.12804, 2020. [25] Stephane Clinchant, Kweon Woo Jung, and Vassilina
[9] Enkhbold Bataa and Joshua Wu. An investigation of transfer Nikoulina. On the use of BERT for neural machine translation.
learning-based sentiment analysis in japanese. In ACL, 2019. In Proceedings of the 3rd Workshop on Neural Generation
[10] Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and Translation, Hong Kong, 2019.
and James Glass. What do neural machine translation models [26] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen,
learn about morphology? In ACL, pages 861–872, 2017. Koray Kavukcuoglu, and Pavel P. Kuksa. Natural language
[11] Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pre- processing (almost) from scratch. J. Mach. Learn. Res., 2011.
trained language model for scientific text. In EMNLP-IJCNLP, [27] Alexis Conneau and Guillaume Lample. Cross-lingual lan-
pages 3613–3618, 2019. guage model pretraining. In NeurIPS, pages 7057–7067, 2019.
[12] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Chris-
22 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

[28] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav [42] Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, Marcin
Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Kadras, Sylvain Gugger, and Jeremy Howard. MultiFiT: Effi-
Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. cient multi-lingual language model fine-tuning. In EMNLP-
Unsupervised cross-lingual representation learning at scale. IJCNLP, pages 5701–5706, 2019.
arXiv preprint arXiv:1911.02116, 2019. [43] Dumitru Erhan, Yoshua Bengio, Aaron C. Courville, Pierre-
[29] Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why
Shijin Wang, and Guoping Hu. Pre-training with whole word does unsupervised pre-training help deep learning? J. Mach.
masking for chinese BERT. arXiv preprint arXiv:1906.08101, Learn. Res., 11:625–660, 2010.
2019. [44] Allyson Ettinger. What BERT is not: Lessons from a new suite
[30] Andrew M Dai and Quoc V Le. Semi-supervised sequence of psycholinguistic diagnostics for language models. TACL,
learning. In NeurIPS, pages 3079–3087, 2015. 8:34–48, 2020.
[31] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, [45] Manaal Faruqui and Chris Dyer. Improving vector space word
Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Atten- representations using multilingual correlation. In EACL, pages
tive language models beyond a fixed-length context. In ACL, 462–471, 2014.
pages 2978–2988, 2019. [46] Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan,
[32] Joe Davison, Joshua Feldman, and Alexander M. Rush. Com- Yin Yang, Deming Chen, Marianne Winslett, Hassan Sajjad,
monsense knowledge mining from pretrained models. In and Preslav Nakov. Compressing large-scale transformer-
EMNLP-IJCNLP, pages 1173–1178, 2019. based models: A case study on BERT. arXiv preprint
[33] Wietse de Vries, Andreas van Cranenburgh, Arianna Bisazza, arXiv:2002.11985, 2020.
Tommaso Caselli, Gertjan van Noord, and Malvina Nis- [47] Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord,
sim. BERTje: A Dutch BERT model. arXiv preprint Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael
arXiv:1912.09582, 2019. Schmitz, and Luke S. Zettlemoyer. Allennlp: A deep semantic
[34] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob natural language processing platform. 2017.
Uszkoreit, and Lukasz Kaiser. Universal transformers. In [48] Siddhant Garg, Thuy Vu, and Alessandro Moschitti. Tanda:
ICLR, 2019. Transfer and adapt pre-trained transformer models for answer
[35] Pieter Delobelle, Thomas Winters, and Bettina Berendt. Rob- sentence selection. In AAAI, 2019.
BERT: a Dutch RoBERTa-based language model. arXiv [49] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats,
preprint arXiv:2001.06286, 2020. and Yann N Dauphin. Convolutional sequence to sequence
[36] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina learning. In ICML, pages 1243–1252, 2017.
Toutanova. BERT: pre-training of deep bidirectional trans- [50] Yoav Goldberg. Assessing BERT’s syntactic abilities. arXiv
formers for language understanding. In NAACL-HLT, 2019. preprint arXiv:1901.05287, 2019.
[37] Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, and Yong- [51] Mitchell A Gordon, Kevin Duh, and Nicholas Andrews. Com-
gang Wang. ZEN: pre-training chinese text encoder enhanced pressing BERT: Studying the effects of weight pruning on
by n-gram representations. arXiv preprint arXiv:1911.00720, transfer learning. arXiv preprint arXiv:2002.08307, 2020.
2019. [52] Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and Minlie
[38] Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Huang. A knowledge-enhanced pretraining model for com-
Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained monsense story generation. arXiv preprint arXiv:2001.05139,
language models: Weight initializations, data orders, and early 2020.
stopping. arXiv preprint arXiv:2002.06305, 2020. [53] Qipeng Guo, Xipeng Qiu, Pengfei Liu, Yunfan Shao, Xi-
[39] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, angyang Xue, and Zheng Zhang. Star-transformer. In NAACL-
Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. HLT, pages 1315–1325, 2019.
Unified language model pre-training for natural language un- [54] Abhijeet Gupta, Gemma Boleda, Marco Baroni, and Sebastian
derstanding and generation. In NeurIPS, pages 13042–13054, Padó. Distributional vectors encode referential attributes. In
2019. EMNLP, pages 12–21, 2015.
[40] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Ma- [55] Michael Gutmann and Aapo Hyvärinen. Noise-contrastive
honey, and Kurt Keutzer. Hawq: Hessian aware quantization estimation: A new estimation principle for unnormalized sta-
of neural networks with mixed-precision. In ICCV, pages tistical models. In AISTATS, pages 297–304, 2010.
293–302, 2019.
[56] Kai Hakala and Sampo Pyysalo. Biomedical named entity
[41] Sergey Edunov, Alexei Baevski, and Michael Auli. Pre-trained recognition with multilingual BERT. In BioNLP Open Shared
language model representations for language generation. In Tasks@EMNLP, pages 56–61, 2019.
Jill Burstein, Christy Doran, and Thamar Solorio, editors,
[57] Hiroaki Hayashi, Zecong Hu, Chenyan Xiong, and Graham
NAACL-HLT, pages 4052–4059, 2019.
Neubig. Latent relation language models. In AAAI, 2019.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 23

[58] Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, arXiv:1908.03548, 2019.
Nicholas Jing Yuan, and Tong Xu. Integrating graph contex- [74] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neu-
tualized knowledge into pre-trained language models. arXiv big. How can we know what language models know? arXiv
preprint arXiv:1912.00147, 2019. preprint arXiv:1911.12543, 2019.
[59] John Hewitt and Christopher D. Manning. A structural probe [75] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen,
for finding syntax in word representations. In NAACL-HLT, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling
pages 4129–4138, 2019. BERT for natural language understanding. arXiv preprint
[60] GE Hinton, JL McClelland, and DE Rumelhart. Distributed arXiv:1909.10351, 2019.
representations. In Parallel distributed processing: explo- [76] Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. Is
rations in the microstructure of cognition, vol. 1: foundations, BERT really robust? natural language attack on text classifi-
pages 77–109. 1986. cation and entailment. In AAAI, 2019.
[61] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- [77] Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke
ing the knowledge in a neural network. arXiv preprint Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-
arXiv:1503.02531, 2015. training by representing and predicting spans. Transactions
[62] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing of the Association for Computational Linguistics, 8:64–77,
the dimensionality of data with neural networks. Science, 313 2019.
(5786):504–507, 2006. [78] Ying Ju, Fubang Zhao, Shijie Chen, Bowen Zheng, Xuefeng
[63] R. Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Yang, and Yunfeng Liu. Technical report on conversational
Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua question answering. arXiv preprint arXiv:1909.10772, 2019.
Bengio. Learning deep representations by mutual information [79] Karthikeyan K, Zihan Wang, Stephen Mayhew, and Dan Roth.
estimation and maximization. In ICLR, 2019. Cross-lingual ability of multilingual BERT: An empirical
[64] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term study. In ICLR, 2020.
memory. Neural Computation, 1997. [80] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom.
[65] Benjamin Hoover, Hendrik Strobelt, and Sebastian Gehrmann. A convolutional neural network for modelling sentences. In
exbert: A visual analysis tool to explore learned rep- ACL, 2014.
resentations in transformers models. arXiv preprint [81] Akbar Karimi, Leonardo Rossi, Andrea Prati, and Katharina
arXiv:1910.05276, 2019. Full. Adversarial training for aspect-based sentiment analysis
[66] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna with BERT. arXiv preprint arXiv:2001.11316, 2020.
Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona [82] Nora Kassner and Hinrich Schütze. Negated LAMA: birds
Attariyan, and Sylvain Gelly. Parameter-efficient transfer cannot fly. arXiv preprint arXiv:1911.03343, 2019.
learning for NLP. In ICML, pages 2790–2799, 2019.
[83] Pei Ke, Haozhe Ji, Siyang Liu, Xiaoyan Zhu, and Min-
[67] Jeremy Howard and Sebastian Ruder. Universal language lie Huang. SentiLR: Linguistic knowledge enhanced lan-
model fine-tuning for text classification. In ACL, pages 328– guage representation for sentiment analysis. arXiv preprint
339, 2018. arXiv:1911.02493, 2019.
[68] Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun [84] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caim-
Shou, Daxin Jiang, and Ming Zhou. Unicoder: A universal ing Xiong, and Richard Socher. CTRL: A conditional trans-
language encoder by pre-training with multiple cross-lingual former language model for controllable generation. arXiv
tasks. In EMNLP-IJCNLP, pages 2485–2494, 2019. preprint arXiv:1909.05858, 2019.
[69] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clin- [85] Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang goo Lee.
icalBERT: Modeling clinical notes and predicting hospital Are pre-trained language models aware of phrases? simple
readmission. arXiv preprint arXiv:1904.05342, 2019. but strong baselines for grammar induction. In ICLR, 2020.
[70] Kenji Imamura and Eiichiro Sumita. Recycling a pre-trained [86] Yoon Kim. Convolutional neural networks for sentence classi-
BERT encoder for neural machine translation. In Proceedings fication. In EMNLP, pages 1746–1751, 2014.
of the 3rd Workshop on Neural Generation and Translation,
[87] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M
Hong Kong, November 2019.
Rush. Character-aware neural language models. In AAAI,
[71] Sarthak Jain and Byron C Wallace. Attention is not explana- 2016.
tion. In NAACL-HLT, pages 3543–3556, 2019.
[88] Thomas N Kipf and Max Welling. Semi-supervised classifica-
[72] Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah. What does tion with graph convolutional networks. In ICLR, 2017.
BERT learn about the structure of language? In ACL, pages
[89] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard
3651–3657, 2019.
Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler.
[73] Zongcheng Ji, Qiang Wei, and Hua Xu. BERT-based rank- Skip-thought vectors. In NeurIPS, pages 3294–3302, 2015.
ing for biomedical entity normalization. arXiv preprint
24 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

[90] Lingpeng Kong, Cyprien de Masson d’Autume, Lei Yu, Wang 2019.
Ling, Zihang Dai, and Dani Yogatama. A mutual information [104] Xiang Lisa Li and Jason Eisner. Specializing word embed-
maximization perspective of language representation learning. dings (for parsing) by information bottleneck. In EMNLP-
In ICLR, 2019. IJCNLP, pages 2744–2754, 2019.
[91] Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Data [105] Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam. Exploit-
augmentation using pre-trained transformer models. arXiv ing BERT for end-to-end aspect-based sentiment analysis. In
preprint arXiv:2003.02245, 2020. W-NUT@EMNLP, 2019.
[92] Yuri Kuratov and Mikhail Arkhipov. Adaptation of deep bidi- [106] Zhongyang Li, Xiao Ding, and Ting Liu. Story ending predic-
rectional multilingual transformers for russian language. arXiv tion by transferable bert. In IJCAI, pages 1800–1806, 2019.
preprint arXiv:1905.07213, 2019.
[107] Liyuan Liu, Xiang Ren, Jingbo Shang, Xiaotao Gu, Jian Peng,
[93] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin and Jiawei Han. Efficient contextualized representation: Lan-
Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite guage model pruning for sequence labeling. In EMNLP, pages
BERT for self-supervised learning of language representa- 1215–1225, 2018.
tions. In International Conference on Learning Representa-
[108] Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E.
tions, 2020.
Peters, and Noah A. Smith. Linguistic knowledge and transfer-
[94] Anne Lauscher, Ivan Vulic, Edoardo Maria Ponti, Anna Ko- ability of contextual representations. In NAACL-HLT, pages
rhonen, and Goran Glavas. Informing unsupervised pre- 1073–1094, 2019.
training with external linguistic knowledge. arXiv preprint
[109] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. Recurrent
arXiv:1909.02339, 2019.
neural network for text classification with multi-task learning.
[95] Hang Le, Loı̈c Vial, Jibril Frej, Vincent Segonne, Maximin In IJCAI, 2016.
Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoı̂t
[110] Qi Liu, Matt J Kusner, and Phil Blunsom. A survey on con-
Crabbé, Laurent Besacier, and Didier Schwab. FlauBERT:
textual embeddings. arXiv preprint arXiv:2003.07278, 2020.
Unsupervised language model pre-training for French. arXiv
preprint arXiv:1912.05372, 2019. [111] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Qi Ju,
Haotang Deng, and Ping Wang. K-BERT: Enabling language
[96] Quoc Le and Tomas Mikolov. Distributed representations of
representation with knowledge graph. In AAAI, 2019.
sentences and documents. In ICML, pages 1188–1196, 2014.
[112] Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang
[97] Jieh-Sheng Lee and Jieh Hsiang. PatentBERT: Patent clas-
Deng, and Qi Ju. FastBERT: a self-distilling BERT with
sification with fine-tuning a pre-trained BERT model. arXiv
adaptive inference time. In ACL, 2020.
preprint arXiv:1906.02124, 2019.
[113] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng
[98] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim,
Gao. Improving multi-task deep neural networks via knowl-
Sunkyu Kim, Chan Ho So, and Jaewoo Kang. BioBERT:
edge distillation for natural language understanding. arXiv
a pre-trained biomedical language representation model for
preprint arXiv:1904.09482, 2019.
biomedical text mining. Bioinformatics, 2019.
[114] Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng
[99] Yoav Levine, Barak Lenz, Or Dagan, Dan Padnos, Or Sharir,
Gao. Multi-task deep neural networks for natural language
Shai Shalev-Shwartz, Amnon Shashua, and Yoav Shoham.
understanding. In ACL, 2019.
SenseBERT: Driving some sense into BERT. arXiv preprint
arXiv:1908.05646, 2019. [115] Xiulei Liu, Hao Cheng, Peng cheng He, Weizhu Chen,
Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarial
[100] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvinine-
training for large neural language models. arXiv preprint
jad, Abdelrahman Mohamed, Omer Levy, Veselin Stoy-
arXiv:2004.08994, 2020.
anov, and Luke Zettlemoyer. BART: denoising sequence-to-
sequence pre-training for natural language generation, transla- [116] Yang Liu and Mirella Lapata. Text summarization with pre-
tion, and comprehension. arXiv preprint arXiv:1910.13461, trained encoders. In EMNLP/IJCNLP, 2019.
2019. [117] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
[101] Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Unicoder-vl: A universal encoder for vision and language by Zettlemoyer, and Veselin Stoyanov. RoBERTa: A ro-
cross-modal pre-training. In AAAI, 2020. bustly optimized BERT pretraining approach. arXiv preprint
arXiv:1907.11692, 2019.
[102] Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and
Xipeng Qiu. BERT-ATTACK: Adversarial attack against [118] Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov,
BERT using BERT. arXiv preprint arXiv:2004.09984, 2020. Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer.
Multilingual denoising pre-training for neural machine trans-
[103] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and
lation. arXiv preprint arXiv:2001.08210, 2020.
Kai-Wei Chang. VisualBERT: A simple and performant base-
line for vision and language. arXiv preprint arXiv:1908.03557, [119] Robert L. Logan IV, Nelson F. Liu, Matthew E. Peters, Matt
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 25

Gardner, and Sameer Singh. Barack’s wife hillary: Using EMNLP, 2014.
knowledge graphs for fact-aware language modeling. In ACL, [134] Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula,
2019. and Russell Power. Semi-supervised sequence tagging with
[120] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. ViL- bidirectional language models. In ACL, pages 1756–1765,
BERT: Pretraining task-agnostic visiolinguistic representa- 2017.
tions for vision-and-language tasks. In NeurIPS, pages 13–23, [135] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gard-
2019. ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.
[121] Wenhao Lu, Jian Jiao, and Ruofei Zhang. TwinBERT: Distill- Deep contextualized word representations. In NAACL-HLT,
ing knowledge to twin-structured BERT models for efficient 2018.
retrieval. arXiv preprint arXiv:2002.06275, 2020. [136] Matthew E. Peters, Mark Neumann, Robert L. Logan IV, Roy
[122] Huaishao Luo, Lei Ji, Botian Shi, Haoyang Huang, Nan Duan, Schwartz, Vidur Joshi, Sameer Singh, and Noah A. Smith.
Tianrui Li, Xilin Chen, and Ming Zhou. UniViLM: A unified Knowledge enhanced contextual word representations. In
video and language pre-training model for multimodal under- EMNLP-IJCNLP, 2019.
standing and generation. arXiv preprint arXiv:2002.06353, [137] Matthew E. Peters, Sebastian Ruder, and Noah A. Smith. To
2020. tune or not to tune? adapting pretrained representations to
[123] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. diverse tasks. In Proceedings of the 4th Workshop on Repre-
Bilingual word representations with monolingual quality in sentation Learning for NLP, RepL4NLP@ACL 2019, Florence,
mind. In Proceedings of the 1st Workshop on Vector Space Italy, August 2, 2019, pages 7–14, 2019.
Modeling for Natural Language Processing, pages 151–159, [138] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick
2015. S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H.
[124] Diego Marcheggiani, Joost Bastings, and Ivan Titov. Ex- Miller. Language models as knowledge bases? In EMNLP-
ploiting semantics in neural machine translation with graph IJCNLP, pages 2463–2473, 2019.
convolutional networks. In NAACL-HLT, pages 486–492, [139] Jason Phang, Thibault Févry, and Samuel R Bowman. Sen-
2018. tence encoders on STILTs: Supplementary training on inter-
[125] Louis Martin, Benjamin Müller, Pedro Javier Ortiz Suárez, mediate labeled-data tasks. arXiv preprint arXiv:1811.01088,
Yoann Dupont, Laurent Romary, Éric Villemonte de la Clerg- 2018.
erie, Djamé Seddah, and Benoı̂t Sagot. CamemBERT: a tasty [140] Telmo Pires, Eva Schlinger, and Dan Garrette. How multi-
French language model. arXiv preprint arXiv:1911.03894, lingual is multilingual BERT? In ACL, pages 4996–5001,
2019. 2019.
[126] Bryan McCann, James Bradbury, Caiming Xiong, and Richard [141] Nina Pörner, Ulli Waltinger, and Hinrich Schütze. BERT is not
Socher. Learned in translation: Contextualized word vectors. a knowledge base (yet): Factual knowledge vs. name-based
In NeurIPS, 2017. reasoning in unsupervised QA. CoRR, abs/1911.03681, 2019.
[127] Oren Melamud, Jacob Goldberger, and Ido Dagan. Con- [142] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
text2Vec: Learning generic context embedding with bidirec- Sutskever. Improving language understanding by generative
tional LSTM. In CoNLL, pages 51–61, 2016. pre-training. 2018. URL https://s3-us-west-2.amazonaws.
[128] Paul Michel, Omer Levy, and Graham Neubig. Are sixteen com/openai-assets/researchcovers/languageunsupervised/
heads really better than one? In NeurIPS, pages 14014–14024, languageunderstandingpaper.pdf.
2019. [143] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
[129] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Cor- Amodei, and Ilya Sutskever. Language models are unsuper-
rado, and Jeffrey Dean. Distributed representations of words vised multitask learners. OpenAI Blog, 2019.
and phrases and their compositionality. In NeurIPS, 2013. [144] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,
[130] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguis- Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Pe-
tic regularities in continuous space word representations. In ter J. Liu. Exploring the limits of transfer learning with a uni-
HLT-NAACL, pages 746–751, 2013. fied text-to-text transformer. arXiv preprint arXiv:1910.10683,
[131] Andriy Mnih and Koray Kavukcuoglu. Learning word embed- 2019.
dings efficiently with noise-contrastive estimation. In NeurIPS, [145] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy
pages 2265–2273, 2013. Liang. Squad: 100, 000+ questions for machine comprehen-
[132] Sinno Jialin Pan and Qiang Yang. A survey on transfer learn- sion of text. In Jian Su, Xavier Carreras, and Kevin Duh,
ing. IEEE Transactions on knowledge and data engineering, editors, EMNLP, pages 2383–2392, 2016.
22(10):1345–1359, 2009. [146] Prajit Ramachandran, Peter J Liu, and Quoc Le. Unsupervised
[133] Jeffrey Pennington, Richard Socher, and Christopher D. Man- pretraining for sequence to sequence learning. In EMNLP,
ning. GloVe: Global vectors for word representation. In pages 383–391, 2017.
26 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

[147] Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: sentiment analysis and natural language inference. arXiv
A conversational question answering challenge. TACL, 7:249– preprint arXiv:2002.04815, 2020.
266, 2019. [162] Asa Cooper Stickland and Iain Murray. BERT and PALs:
[148] Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Vie- Projected attention layers for efficient adaptation in multi-task
gas, Andy Coenen, Adam Pearce, and Been Kim. Visualizing learning. In ICML, pages 5986–5995, 2019.
and measuring the geometry of BERT. In NeurIPS, pages [163] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu
8592–8600, 2019. Wei, and Jifeng Dai. VL-BERT: Pre-training of generic visual-
[149] Alexander Rietzler, Sebastian Stabinger, Paul Opitz, and Ste- linguistic representations. In ICLR, 2020.
fan Engl. Adapt or get left behind: Domain adaptation through [164] Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia
BERT language model finetuning for aspect-target sentiment Schmid. Contrastive bidirectional transformer for temporal
classification. arXiv preprint arXiv:1908.11860, 2019. representation learning. arXiv preprint arXiv:1906.05743,
[150] Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer 2019.
in BERTology: What we know about how BERT works. arXiv [165] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and
preprint arXiv:2002.12327, 2020. Cordelia Schmid. VideoBERT: A joint model for video and
[151] Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari Rappoport. language representation learning. In ICCV, pages 7463–7472.
How well do distributional models capture different types of IEEE, 2019.
semantic knowledge? In ACL, pages 726–730, 2015. [166] Chi Sun, Luyao Huang, and Xipeng Qiu. Utilizing BERT
[152] Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas for aspect-based sentiment analysis via constructing auxiliary
Wolf. DistilBERT, a distilled version of BERT: smaller, faster, sentence. In NAACL-HLT, 2019.
cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019. [167] Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang. How
[153] Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail to fine-tune BERT for text classification? In China National
Khodak, and Hrishikesh Khandeparkar. A theoretical analysis Conference on Chinese Computational Linguistics, pages 194–
of contrastive unsupervised representation learning. In ICML, 206, 2019.
pages 5628–5637, 2019. [168] Lichao Sun, Kazuma Hashimoto, Wenpeng Yin, Akari Asai,
[154] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural Jia Li, Philip Yu, and Caiming Xiong. Adv-BERT: BERT
machine translation of rare words with subword units. In ACL, is not robust on misspellings! generating nature adversarial
2016. samples on BERT. arXiv preprint arXiv:2003.04985, 2020.
[155] Sofia Serrano and Noah A Smith. Is attention interpretable? [169] Siqi Sun, Yu Cheng, Zhe Gan, and Jingjing Liu. Patient knowl-
In ACL, pages 2931–2951, 2019. edge distillation for BERT model compression. In EMNLP-
[156] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, IJCNLP, pages 4323–4332, 2019.
Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q- [170] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen,
BERT: Hessian based ultra low precision quantization of Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua
BERT. In AAAI, 2020. Wu. ERNIE: enhanced representation through knowledge
[157] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick integration. arXiv preprint arXiv:1904.09223, 2019.
LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- [171] Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian,
LM: Training multi-billion parameter language models using Hua Wu, and Haifeng Wang. ERNIE 2.0: A continual pre-
gpu model parallelism. arXiv preprint arXiv:1909.08053, training framework for language understanding. In AAAI,
2019. 2019.
[158] Karan Singla, Doğan Can, and Shrikanth Narayanan. A multi- [172] Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yim-
task approach to learning multilingual representations. In ing Yang, and Denny Zhou. MobileBERT: a compact task-
ACL, pages 214–220, 2018. agnostic BERT for resource-limited devices. arXiv preprint
[159] Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, arXiv:2004.02984, 2020.
Christopher D Manning, Andrew Y Ng, and Christopher Potts. [173] Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. Sequence to
Recursive deep models for semantic compositionality over sequence learning with neural networks. In NeurIPS, pages
a sentiment treebank. In EMNLP, pages 1631–1642. ACL, 3104–3112, 2014.
2013. [174] Kai Sheng Tai, Richard Socher, and Christopher D. Manning.
[160] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Improved semantic representations from tree-structured long
Liu. MASS: masked sequence to sequence pre-training for short-term memory networks. In ACL, pages 1556–1566,
language generation. In ICML, volume 97 of Proceedings of 2015.
Machine Learning Research, pages 5926–5936, 2019. [175] Hao Tan and Mohit Bansal. LXMERT: Learning cross-
[161] Youwei Song, Jiahai Wang, Zhiwei Liang, Zhiyue Liu, and modality encoder representations from transformers. In
Tao Jiang. Utilizing BERT intermediate layers for aspect based EMNLP-IJCNLP, pages 5099–5110, 2019.
QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020) 27

[176] Matthew Tang, Priyanka Gandhi, Md Ahsanul Kabir, Christo- 3261–3275, 2019.
pher Zou, Jordyn Blakey, and Xiao Luo. Progress notes clas- [190] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill,
sification and keyword extraction using attention-based deep Omer Levy, and Samuel R. Bowman. GLUE: A multi-task
learning models with BERT. arXiv preprint arXiv:1910.05786, benchmark and analysis platform for natural language under-
2019. standing. In ICLR, 2019.
[177] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechto- [191] Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Xuanjing
mova, and Jimmy Lin. Distilling task-specific knowledge Huang, Jianshu Ji, Guihong Cao, Daxin Jiang, and Ming Zhou.
from BERT into simple neural networks. arXiv preprint K-adapter: Infusing knowledge into pre-trained models with
arXiv:1903.12136, 2019. adapters. arXiv preprint arXiv:2002.01808, 2020.
[178] Wilson L. Taylor. “cloze procedure”: A new tool for measur- [192] Shaolei Wang, Wanxiang Che, Qi Liu, Pengda Qin, Ting Liu,
ing readability. Journalism Quarterly, 30(4):415–433, 1953. and William Yang Wang. Multi-task self-supervised learning
[179] Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscov- for disfluency detection. In AAAI, 2019.
ers the classical NLP pipeline. In Anna Korhonen, David R. [193] Wei Wang, Bin Bi, Ming Yan, Chen Wu, Zuyi Bao, Liwei
Traum, and Lluı́s Màrquez, editors, ACL, pages 4593–4601, Peng, and Luo Si. StructBERT: Incorporating language struc-
2019. tures into pre-training for deep language understanding. In
[180] Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Po- ICLR, 2020.
liak, R. Thomas McCoy, Najoung Kim, Benjamin Van Durme, [194] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang,
Samuel R. Bowman, Dipanjan Das, and Ellie Pavlick. What and Ming Zhou. MiniLM: Deep self-attention distillation for
do you learn from context? probing for sentence structure in task-agnostic compression of pre-trained transformers. arXiv
contextualized word representations. In ICLR, 2019. preprint arXiv:2002.10957, 2020.
[181] Henry Tsai, Jason Riesa, Melvin Johnson, Naveen Arivazha- [195] Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu,
gan, Xin Li, and Amelia Archer. Small and practical BERT Juanzi Li, and Jian Tang. KEPLER: A unified model for
models for sequence labeling. In EMNLP-IJCNLP, pages knowledge embedding and pre-trained language representa-
3632–3636, 2019. tion. arXiv preprint arXiv:1911.06136, 2019.
[182] Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang, Xi- [196] Yuxuan Wang, Yutai Hou, Wanxiang Che, and Ting Liu. From
aodong He, and Bowen Zhou. Select, answer and explain: static to dynamic word representations: a survey. International
Interpretable multi-hop reading comprehension over multiple Journal of Machine Learning and Cybernetics, pages 1–20,
documents. In AAAI, 2020. 2020.
[183] Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina [197] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen.
Toutanova. Well-read students learn better: The impact of Knowledge graph and text jointly embedding. In EMNLP,
student initialization on knowledge distillation. arXiv preprint pages 1591–1601, 2014.
arXiv:1908.08962, 2019.
[198] Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang,
[184] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen,
reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia and Qun Liu. NEZHA: Neural contextualized representa-
Polosukhin. Attention is all you need. In NeurIPS, 2017. tion for chinese language understanding. arXiv preprint
[185] Jesse Vig. A multiscale visualization of attention in the trans- arXiv:1909.00204, 2019.
former model. In ACL, 2019. [199] Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and
[186] Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Songlin Hu. Conditional BERT contextual augmentation. In
Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo Pyysalo. International Conference on Computational Science, pages
Multilingual is not enough: BERT for Finnish. arXiv preprint 84–95, 2019.
arXiv:1912.07076, 2019. [200] Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, and
[187] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Songlin Hu. ”mask and infill” : Applying masked language
Ivan Titov. Analyzing multi-head self-attention: Specialized model to sentiment transfer. In IJCAI, 2019.
heads do the heavy lifting, the rest can be pruned. In ACL, [201] Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and
pages 5797–5808, 2019. Maosong Sun. Representation learning of knowledge graphs
[188] Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and with entity descriptions. In IJCAI, 2016.
Sameer Singh. Universal adversarial triggers for attacking and [202] Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin
analyzing NLP. In EMNLP-IJCNLP, pages 2153–2162, 2019. Stoyanov. Pretrained encyclopedia: Weakly supervised
[189] Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet knowledge-pretrained language model. In ICLR, 2020.
Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. [203] Canwen Xu, Wangchunshu Zhou, Tao Ge, Furu Wei, and
Bowman. SuperGLUE: A stickier benchmark for general- Ming Zhou. BERT-of-Theseus: Compressing BERT by pro-
purpose language understanding systems. In NeurIPS, pages gressive module replacing. arXiv preprint arXiv:2002.02925,
28 QIU XP, et al. Pre-trained Models for Natural Language Processing: A Survey March (2020)

2020. ment level pre-training of hierarchical bidirectional transform-


[204] Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. BERT post- ers for document summarization. In ACL, pages 5059–5069,
training for review reading comprehension and aspect-based 2019.
sentiment analysis. In NAACL-HLT, 2019. [214] Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong
[205] Jiacheng Xu, Xipeng Qiu, Kan Chen, and Xuanjing Huang. Sun, and Qun Liu. ERNIE: enhanced language representation
Knowledge graph representation with jointly structural and with informative entities. In ACL, 2019.
textual encoding. In IJCAI, pages 1318–1324, 2017. [215] Zhuosheng Zhang, Junjie Yang, and Hai Zhao. Retrospective
[206] Yige Xu, Xipeng Qiu, Ligao Zhou, and Xuanjing Huang. reader for machine reading comprehension. arXiv preprint
Improving BERT fine-tuning via self-ensemble and self- arXiv:2001.09694, 2020.
distillation. arXiv preprint arXiv:2002.10345, 2020. [216] Sanqiang Zhao, Raghav Gupta, Yang Song, and Denny Zhou.
[207] An Yang, Quan Wang, Jing Liu, Kai Liu, Yajuan Lyu, Hua Extreme language model compression with optimal subwords
Wu, Qiaoqiao She, and Sujian Li. Enhancing pre-trained and shared projections. arXiv preprint arXiv:1909.11687,
language representations with rich knowledge for machine 2019.
reading comprehension. In ACL, pages 2346–2357, 2019. [217] Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, and
[208] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, Zheng Chen. Aligning knowledge and text embeddings by
William W. Cohen, Ruslan Salakhutdinov, and Christopher D. entity descriptions. In EMNLP, pages 267–272, 2015.
Manning. HotpotQA: A dataset for diverse, explainable multi- [218] Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, and
hop question answering. In EMNLP, pages 2369–2380, 2018. Xuanjing Huang. Searching for effective neural extractive
[209] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, summarization: What works and what’s next. In ACL, pages
Russ R Salakhutdinov, and Quoc V Le. XLNet: General- 1049–1058, 2019.
ized autoregressive pretraining for language understanding. In [219] Ming Zhong, Pengfei Liu, Yiran Chen, Danqing Wang,
NeurIPS, pages 5754–5764, 2019. Xipeng Qiu, and Xuan-Jing Huang. Extractive summarization
[210] Ziqing Yang, Yiming Cui, Zhipeng Chen, Wanxiang Che, Ting as text matching. In ACL, 2020.
Liu, Shijin Wang, and Guoping Hu. Textbrewer: An open- [220] Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein,
source knowledge distillation toolkit for natural language pro- and Jingjing Liu. FreeLB: Enhanced adversarial training for
cessing. arXiv preprint arXiv:2002.12620, 2020. natural language understanding. In ICLR, 2020.
[211] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe [221] Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Wengang
Wasserblat. Q8BERT: Quantized 8bit BERT. arXiv preprint Zhou, Houqiang Li, and Tieyan Liu. Incorporating BERT into
arXiv:1910.06188, 2019. neural machine translation. In ICLR, 2020.
[212] Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J [222] Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo. Long
Liu. PEGASUS: Pre-training with extracted gap-sentences for short-term memory over recursive structures. In International
abstractive summarization. arXiv preprint arXiv:1912.08777, Conference on Machine Learning, pages 1604–1612, 2015.
2019. [223] Barret Zoph and Quoc V Le. Neural architecture search with
[213] Xingxing Zhang, Furu Wei, and Ming Zhou. HIBERT: Docu- reinforcement learning. In ICLR, 2017.

You might also like