Alternating Language Modeling For Cross-Lingual Pre-Training
Alternating Language Modeling For Cross-Lingual Pre-Training
†
Jian Yang,1∗ Shuming Ma,2 Dongdong Zhang,2 ShuangZhi Wu,3∗ Zhoujun Li,1 Ming Zhou2
1
State Key Lab of Software Development Environment, Beihang University
2
Microsoft Research Asia
3
SPPD of Tencent Inc.
{jiaya, lizj}@buaa.edu.cn; {shumma, dozhang, mingzhou}@microsoft.com; [email protected]
Abstract 澷濜濝濢濙濧濙澔濨濣濟濙濢
澹濢濛濠濝濧濜澔濨濣濟濙濢
Language model pre-training has achieved success in many
natural language processing tasks. Existing methods for ⍻⌀ 㒯 䖃 ↲ⳤ 坋↧ 濖濴濿濿瀆澳濹瀂瀅澳濹瀅濸瀆濻澳濼瀁濷瀈瀆瀇瀅濼濴濿澳濴濶瀇濼瀂瀁
cross-lingual pre-training adopt Translation Language Model
to predict masked words with the concatenation of the source 澜濕澝澔濌激濁
sentence and its target equivalent. In this work, we intro-
duce a novel cross-lingual pre-training method, called Al- 濖濴濿濿瀆 濹瀂瀅 㒯 䖃 ↲ⳤ 濴濶瀇濼瀂瀁
ternating Language Modeling (ALM). It code-switches sen-
tences of different languages rather than simple concatena- 澜濖澝澔澵激濁
tion, hoping to capture the rich cross-lingual context of words
and phrases. More specifically, we randomly substitute source Figure 1: Example of Translation Language Model and Al-
phrases with target translations to create code-switched sen- ternating Language Model.
tences. Then, we use these code-switched data to train ALM
model to learn to predict words of different languages. We
evaluate our pre-training ALM on the downstream tasks of cross-lingual pre-training model can learn the relationship
machine translation and cross-lingual classification. Exper- between languages.
iments show that ALM can outperform the previous pre-
training methods on three benchmarks.1
In this work, we propose a novel cross-lingual language
model, which alternately predicts words of different lan-
guages. Figure 1 shows an example of the proposed Alter-
Introduction nating Language Model (ALM). Different from XLM, the
input sequence of ALM is mixed with different languages,
Recently language model pre-training methods, including so it can capture the rich cross-lingual context of words and
ELMo (Peters et al. 2018), GPT (Radford et al. 2018), phrases. Moreover, it forces the language model to predict
GPT2 (Radford et al. 2019), BERT (Devlin et al. 2019), one language conditioned on the context of the other lan-
and UniLM (Dong et al. 2019), have achieved impres- guage. Therefore, it can minor the gap between the embed-
sive results on various natural language processing tasks dings of the source language and the target languages, which
such as question-answering (Min, Seo, and Hajishirzi 2017; is beneficial for the cross-lingual setting.
Yang et al. 2019a), machine reading comprehension (Salant Based on Alternating Language Model, we introduce a
and Berant 2018; Yu et al. 2018) and natural language infer- new cross-lingual pre-training method. More specifically,
ence (Tay, Luu, and Hui 2018). More recently, XLM (Lam- we take the Transformer model (Vaswani et al. 2017) as
ple and Conneau 2019) has extended this approach to cross- the backbone model. Then, we construct the training ex-
lingual pre-training, and proven successful in applying lan- amples for pre-training by replacing the phrases with their
guage model pre-training in the cross-lingual setting. translation of the other language. Finally, we pre-train the
Existing methods for supervised cross-lingual pre- Transformer model with the constructed examples using the
training adopt a cross-lingual language model objective, masked language model objective. The pre-trained model
called Translation Language Model (TLM). It makes use of can be used to further fine-tune the downstream cross-
parallel data by predicting the masked words with concate- lingual tasks.
nation of the sentence and its translation. In this way, the To verify the effectiveness of the proposed method, we
∗
Contribution during internship at Microsoft Research Asia. evaluate our pre-training method on machine translation and
†
Corresponding author. cross-lingual classification. Experiments show that ALM
Copyright c 2020, Association for the Advancement of Artificial can outperform the previous pre-training methods on three
Intelligence (www.aaai.org). All rights reserved. benchmark datasets.
1
Code can be found at https://github.com/zddfunseeker/ALM. The contributions of this work are as follows:
9386
Source Target ଶ
઼ ⍻ⴁ ⨳ޘᣕ䆖 ছᱏ㌫㔏 global monitor and warning satellite system
Sample 2 ⨳ޘ ⴁ⍻ and ᣕ䆖 ছᱏ㌫㔏 ݔଵ - ݔଷ ݕସ ݔହ - ݕ ଼ݕ
…
Sample n-1 global monitor and warning ছᱏ㌫㔏 Source ݔଵ ݔଶ ݔଷ ݔସ ݔହ ݔ ݔ
Sample n global ⴁ⍻ and ᣕ䆖 satellite system Target ݕଵ ݕଵ ݕଷ ݕସ ݕହ ݕ ݕ ଼ݕ
Figure 2: Overview of our ALM cross-lingual pre-training method. Given a pair of bilingual sentences, we yield a set of cross-
lingual sentences. These sentences are used to pre-train the Transformer encoder which predicts an English masked word or a
Chinese one.
Code-Switched Sequence
Unsupervised Language Modeling
Given a bilingual sentence pair (X, Y ) with the source
CLM recurrently predicts the next word given the previous sentence X = {x1 , x2 , ..., xN } and the target translation
context, which is the typical objective of language modeling. Y = {y1 , y2 , ..., yM }, where N and M are the lengths of
GPT (Radford et al. 2018) is the first pre-training model to the source and target sentences, we create the code-switched
adopt CLM, and GPT-2 (Radford et al. 2019) further proves sequence U by composing the phrases of X and Y , where
the success of CLM for pre-training. U ={u1 , u2 , .., uL } with the length L.
CLM only makes use of the uni-directional context. Dif- In details, for each phrase U[i,j] , it comes from either
ferent from CLM, MLM uses bidirectional contextual infor- source phrase X[a,b] or target phrase Y[c,d] where the con-
mation. It randomly masks some tokens during training and straint is that these two phrases are the linguistic translation
predicts the identity of the masked word. BERT (Devlin et counterpart in the parallel sentence (X, Y ), 1 ≤ a ≤ b ≤ N
al. 2019) is the first to propose this model and use it for pre- and 1 ≤ c ≤ d ≤ M . We denote the proportion of the source
training. Different from the BERT, XLM (Lample and Con- words in the alternating language sequence U as α.
neau 2019) uses an arbitrary number of sentences (truncated Specifically, the constituent of U can be illustrated into
at 256 tokens) instead of pairs of sentences, and it samples four categories:
the masked tokens according to a multinomial distribution,
whose weights are proportional to the square root of their • Monolingual source language: that is α = 0.
invert frequencies. • Monolingual target language: that is α = 1.
9387
ݔଶ Model Architecture and Pre-Training
Figure 2 also shows the overall architecture of our proposed
model. Given a parallel sentence pair, we combine two sen-
Masked Transformer Encoder tences from different languages into a single code-switched
sequence as described above. Then we mask out a certain
percentage of words in the sequences. We feed the masked
ݔଵ - ݔଷ ݔସ ݔହ - ݔ ଼ݔ sentences into Transformer model to learn to predict the
ߙൌͲ words being masked out.
In details, we sample randomly 15% of the tokens, replace
ଶ ݕ them by a [MASK] token 80% of the time, by a random
token 10% of the time, and keep them unchanged 10% of
the time.
Masked Transformer Encoder Figure 3 shows two special cases of ALM. When α = 0,
the input sequence is purely from source language. It be-
comes the masked language model for source language.
ݕଵ - ݕଷ ݕସ ݕହ - ݕ ଼ݕ When α = 1, the input sequence is purely from target lan-
guage, so it becomes the masked language model for target
ߙൌͳ
language. In this way, the model becomes unsupervised be-
Figure 3: The model architecture of ALM when α = 0 and cause it only relies on monolingual data.
α = 1. In practice, we have 10% of training samples with α = 0,
10% of samples with α = 1, and the rest with 0 < α < 1.
We manually choose a proper value of α which ensures some
• Major source language: that means most of U is derived phrases are replaced with their counterparts by alignment
from X where some source phrases X[a,b] are substituted instead of sweeping all values of α (0 ≤ α ≤ 1). In order
by their target counterpart phrases Y[c,d] (α ≥ 0.5). to ensure the value of α is in a reasonable range, we set max
length and max number for phrase substitution.
• Major target language: that means most of U is derived
from Y where some target phrases Y[c,d] are substituted
by their source counterpart phrases X[a,b] (α < 0.5). Applying to Downstream Tasks
After pre-training, we further fine-tune ALM in order to
Constructing Training Samples adapt the parameters for the downstream tasks, which are
Since there are few natural code-switched sentences, we machine translation and cross-lingual classification.
should construct them from bilingual sentence pairs. First,
we perform word alignment with the GIZA toolkit (Och Machine Translation After pre-training, we use ALM
and Ney 2003) between the parallel sentence X and Y , as the encoder of machine translation, and construct a
and extract a bilingual phrase table using statistical ma- Transformer-based decoder conditioned on ALM. We fine-
chine translation techniques (Koehn, Och, and Marcu 2003). tune the parameters of the total encoder-decoder model on
Then, for each sentence pair in training corpus, we cre- parallel training dataset of machine translation.
ate the major-source-language samples by substituting some
phrases in source sentence with the corresponding target
phrases with highest probabilities in phrase table. A similar Cross-Lingual Classification XNLI (Conneau et al.
method creates major-target-language samples by substitut- 2018) is a significant dataset which is similar to the English
ing some phrases in target sentence with the corresponding MultiNLI including several languages. Taking the task of
source phrases. NLI as an example, we concatenate premise and hypothe-
The details of the construction for a sentence pair are: sis as input, and feed them into ALM. On top of ALM, we
• Each phrase is limited to less than 5 words for both source add a linear classifier and a dropout layer after the first hid-
language and target language. den state for last layer. Then, we fine-tune the parameters of
ALM on training dataset of cross-lingual classification.
• The substituted words are less than 30% of the total words
in the sentence. Therefore, the source words dominate the
sentence in the major source language, while the target Experiments
words dominate the sentence in the major target language. We evaluate our proposed method on machine translation
• Each bilingual sentence pair is used to create multiple al- and cross-lingual text classification. In this section, we pro-
ternating language sentences by randomly choosing the vide the details, results, and analysis of the experiments.
substituted phrases.
Figure 2 shows an example of constructing code-switched Datasets
sentences. Given the Chinese sentence and its translation, Following previous work (Lample and Conneau 2019), we
multiple training samples can be derived from one sentence use Wikipedia data by using WikiExtractor and WMT data
pair by choosing different phrases to substitute. as monolingual data. For bilingual data, French, Spanish,
9388
Russian, Arabic, and Chinese data are from MultiUN (Ziem- Baselines We compare our methods with state-of-the-art
ski, Junczys-Dowmunt, and Pouliquen 2016). Hindi data is supervised methods and the pre-training methods, which are
from the IIT Bombay corpus (Kunchukuttan, Mehta, and described as follows:
Bhattacharyya 2018). German and Greek are from the EU- • Transformer (Vaswani et al. 2017): We implement
bookshop corpus. Turkish, Vietnamese and Thai are from Transformer model with our in-house tensorflow code,
OpenSubtitles 2018. Urdu and Swahili data are from Tanzil. and the experimental settings are the same as Transformer
Swahili data is from GlobalVoices. For most languages, we (Vaswani et al. 2017)
use the tokenizer provided by Moses (Koehn et al. 2007).
• ConvS2S (Gehring et al. 2017): We report the results re-
Pre-Training Details ferring to the paper of convolutional sequence to sequence
model(ConvS2S).
We use byte pair encoding (BPE) (Sennrich, Haddow, and
Birch 2016). The vocabulary contains 95K byte pair encod- • Weighted Transformer (Ahmed, Keskar, and Socher
ing tokens. We pre-train our model with both 1024 embed- 2017): It uses self-attention branches in place of multi-
ding and hidden units, 8 heads, a dropout rate of 0.1 and head attention. The branches replace multiple heads in at-
learned positional embeddings. We use an Adam optimizer tention mechanism of the original Transformer network.
with parameters of β1 = 0.9 and β2 = 0.98. We set the in- • Layer-wise Transformer (He et al. 2018): It explicitly
verse sqrt learning rate schedule with a linear warmup where coordinates the learning of hidden representations of the
the number of warmup step is 4000 and a learning rate of encoder and decoder, gradually from low level to high
0.0005. level.
For pre-training data, we use source language monolin- • RNMT+ (Chen et al. 2018): It combines the advantages
gual data (α = 0) and target language monolingual data of both the recurrent structure and Transformer architec-
(α = 1). Besides, we also split parallel data to expand mono- ture.
lingual data. For the monolingual data, we regard source
language mono-lingual data as α = 0 and target language • LightConv and DynamicConv (Wu et al. 2019): Light-
mono-lingual data as α = 1, which could be classified Conv uses a lightweight convolution which can perform
into a special situation of ALM. To construct monolingual competitively to the best reported self-attention results.
dataset, we use Wikipedia data as monolingual data by us- Furthermore, they introduce dynamic convolutions (Dy-
ing WikiExtractor. Our pre-training samples includes mono- namicConv) which are simpler and more efficient than
lingual data and parallel data, we use original parallel data self-attention.
to generate 20 times code-switched sentences than original • Multilingual BERT (Devlin et al. 2019): Multilingual
parallel data. More specifically, we separately obtain the al- BERT (mBERT) extends the BERT model to different
ternating language sentences of source language and target languages. We download the pre-trained model provided
language, which are 40 times than original parallel data in by the authors, and fine-tune on the machine translation
total. Considering that there exist some bad cases in alter- datasets.
nating language sentences, we filter some low-quality code-
switched sentences of which length is too long or too short, • XLM (Lample and Conneau 2019): We use the released
and randomly drop some sentences. At last, nearly 1.5 bil- code2 and the pre-trained data provided by XLM, and fur-
lion code-switched sentences are used for pre-training. ther fine-tune the pre-trained model on the corresponding
data.
Fine-Tuning on Machine Translation • MASS (Song et al. 2019): We conduct experiments with
the codes provided by the authors. We set the fragment
We fine-tune the pre-trained ALM on two datasets: WMT14 length k as 50% of the total number of masked tokens in
English-German machine translation and IWSLT14 the sentence.
German-English machine translation. WMT14 English-
German machine translation dataset has 4.5 million sentence
pairs for training. newsdev2014 is used as the validation Details We fine-tune our ALM with the Adam optimizer
set, while the newstest2014 is the testing set. IWSLT14 (Kingma and Ba 2015) with a linear warmup (Vaswani et al.
German-English machine translation dataset contains 160 2017). We tune the learning rates based on the performance
thousand sentence pairs. They are collected from TED talks. on the validation set, and the learning rates are 5 × 10−4 for
We use iwslt14 devset as the validation set and the iwslt14 IWSLT14 German-English and 10−3 for WMT14 English-
testset as the testing set. German. We use the averaged perplexity over all languages
We build a Transformer decoder conditioned on ALM en- as a criterion for early stopping. The batch size is set to
coder. We feed the source language into ALM, and generate 8192 tokens for all experiments. During decoding, we set
the target language with decoder. We reload the parameters the beam size to 8.
of word embedding and encoder parameters which are also
used to initialize the decoder for our in-house NMT code Results To prove the effectiveness of ALM, we perform
from pre-trained model. We evaluate the performance of the experiments on the English-German and German-English
translated sentences. The evaluation metric is BLEU (Pap-
2
ineni et al. 2002). https://github.com/facebookresearch/XLM
9389
En → De BLEU(%) We also compare ALM with three pre-training baselines.
Transformer (Vaswani et al. 2017) 28.40 It shows that our ALM obtains the best performance and
ConvS2S (Gehring et al. 2017) 25.16 reaches 35.53 BLEU score in this task, outperforming the
Weighted Transformer (Ahmed, Keskar, and Socher 2017) 28.90
Layer-wise Transformer (He et al. 2018) 29.01
previous baseline mBERT, MASS, and XLM by +0.71 and
RNMT+ (Chen et al. 2018) 28.50 +0.39, and +0.31 in terms of BLEU score.
mBERT (Devlin et al. 2019) 28.64 In general, our ALM could achieve significant improve-
MASS (Song et al. 2019) 28.92 ments over all baseline models on two translation tasks. As
XLM (Lample and Conneau 2019) 28.88 our method pre-trains the encoder on a large scale cross-
ALM (this work) 29.22 lingual corpus, the word representations and encoder could
acquire sufficient cross-lingual information. For example,
Table 1: Results on WMT14 English-German machine the target phrase can see both its source and target context.
translation task. This cross-lingual context is helpful for target word genera-
tion and understanding the source sentence in a cross-lingual
De → En BLEU(%)
way.
Transformer (Vaswani et al. 2017) 34.49 Fine-Tuning on Cross-Lingual Classification
LightConv (Wu et al. 2019) 34.80
DynamicConv (Wu et al. 2019) 35.20 We fine-tune the pre-trained ALM model on XNLI dataset
Advsoft (Wang, Gong, and Liu 2019) 35.18 to evaluate the effectiveness of our model. We build a lin-
Layer-wise Transformer (He et al. 2018) 35.07 ear classifier on the top of the pre-trained ALM to project
mBERT (Devlin et al. 2019) 34.82 the first hidden state of ALM output into the probabili-
MASS (Song et al. 2019) 35.14 ties of each class. We concatenate premise and hypothe-
XLM (Lample and Conneau 2019) 35.22
sis, and feed them into ALM. We evaluate the performance
ALM (this work) 35.53 of the fine-tuned model in 15 XNLI languages. Follow-
ing previous work (Lample and Conneau 2019), we eval-
Table 2: Results on IWSLT14 German-English machine uate the model in three different settings: “TRANSLATE-
translation task. TRAIN”, “TRANSLATE-TEST”, and “CROSS-LINGUAL
TEST”. The evaluation metric is the accuracy of the pre-
dicted NLI class.
translation tasks. Table 1 and Table 2 show that our ALM
has significant improvements over baselines without pre- Baselines We compare our methods with three strong
training or with pre-training methods. baselines, including a supervised method without pre-
In Table 1, we report the performance of ALM and the training, and two pre-training methods:
baseline models in the WMT14 English-German machine
translation dataset. Transformer is an important baseline, • Conneau: Conneau (Conneau et al. 2018) proposes a
and it obtains 28.40 in BLEU score. We also compare ALM BiLSTM model to set up a baseline for XNLI. We report
with the convolutional baseline ConvS2S, which achieves the scores directly from their paper.
25.16. Weighted Transformer and Layer-wise Transformer • Multilingual BERT (Devlin et al. 2019): Multilingual
are two methods to improve the Transformer model, and BERT (mBERT) extends the BERT model to different
they get 28.90 and 29.01 in terms of BLEU score. RNMT+ languages, which is also a strong baseline.
combines the recurrent structure and the multi-head atten- • XLM (Lample and Conneau 2019): XLM is the state-of-
tion components, which yields an improvement to 28.50 the-art model for cross-lingual pre-training. We report the
BLEU score. Our ALM significantly outperforms these results of XLM directly from their paper.
baseline models. We also compare our model with three
state-of-the-art pre-training models. mBERT and MASS
are unsupervised pre-training models. They achieve 28.64 Details We fine-tune our ALM with the Adam optimizer
BLEU score and 28.92 BLEU score, respectively. XLM is (Kingma and Ba 2015) with β1 = 0.9 and β2 = 0.997.
a mixture of unsupervised and supervised pre-training mod- We tune the learning rates based on the performance on the
els, achieving 28.88 BLEU score. Our ALM reaches 29.22 validation set, and the learning rates are set to 5 × 10−6 . We
BLEU score, yielding an improvement of +0.58, +0.30, and set the batch size to 24, and we limit the sentences up to
+0.34 BLEU scores. 256 tokens. We set a rate of dropout 0.15 of last layer. We
In Table 2, we report the performance of ALM and evaluate our model for every 1000 sentences.
the baseline models in IWSLT14 German-English machine
translation dataset. We first compare our ALM with the su- Results Table 3 shows the experimental results of our
pervised models without pre-training. Transformer and its proposed ALM and the baseline models. Following the
variant Layer-wise Transformer achieves 34.49 and 35.07 work of XNLI (Conneau et al. 2018), we evaluate these
in terms of BLEU score. The convolution-based models, models in three different settings: “TRANSLATE-TRAIN”,
LightConv and DynamicConv, achieve 34.80 and 35.20, re- “TRANSLATE-TEST”, and “CROSS-LINGUAL TEST”.
spectively. Advsoft gets a BLEU score of 35.18. ALM out- In the setting “TRANSLATE-TRAIN”, we translate the
performs these baselines, achieving 35.53 in BLEU score. training set of the English MultiNLI dataset into each XNLI
9390
en fr es de el bg ru tr ar vi th zh hi sw ur avg.
Machine translation baselines (TRANSLATE-TRAIN)
Conneau (Conneau et al. 2018) 73.7 68.3 68.8 66.5 66.4 67.4 66.5 64.5 65.8 66.0 62.8 67.0 62.1 58.2 56.6 65.4
mBERT (Devlin et al. 2019) 81.9 - 77.8 75.9 - - - - 70.7 - - 76.6 - - 61.6 -
XLM (Lample and Conneau 2019) 85.0 80.2 80.8 80.3 78.1 79.3 78.1 74.7 76.5 76.6 75.5 78.6 72.3 70.9 63.2 76.7
ALM (this work) 85.2 81.1 82.0 82.3 78.3 79.8 78.4 74.9 76.7 76.8 75.6 78.7 72.5 71.5 63.4 77.2
Machine translation baselines (TRANSLATE-TEST)
Conneau (Conneau et al. 2018) 73.7 70.4 70.7 68.7 69.1 70.4 67.8 66.3 66.8 66.5 64.4 68.3 64.2 61.8 59.3 67.2
mBERT (Devlin et al. 2019) 81.4 - 74.9 74.4 - - - - 70.4 - - 70.1 - - 62.1 -
XLM (Lample and Conneau 2019) 85.0 79.0 79.5 78.1 77.8 77.6 75.5 73.7 73.7 70.8 70.4 73.6 69.0 64.7 65.1 74.2
ALM (this work) 85.2 79.1 80.0 78.4 78.0 77.8 77.1 73.9 74.2 71.2 70.5 73.8 69.2 64.8 65.3 74.6
Evaluation of cross-lingual sentence encoders (CROSS-LINGUAL TEST)
Conneau (Conneau et al. 2018) 73.7 67.7 68.7 67.7 68.9 67.9 65.4 64.2 64.8 66.4 64.1 65.8 64.1 55.7 58.4 65.6
mBERT (Devlin et al. 2019) 81.4 - 74.3 70.5 - - - - 62.1 - - 63.8 - - 58.3 -
XLM (Lample and Conneau 2019) 85.0 78.7 78.9 77.8 76.6 77.4 75.3 72.5 73.1 76.1 73.2 76.5 69.6 68.4 67.3 75.1
ALM (this work) 85.2 79.3 79.2 78.0 76.7 78.1 76.5 73.0 73.2 76.4 73.5 78.6 69.8 69.0 66.8 75.6
Table 3: Cross-lingual natural language inference (XNLI) test accuracy for the 15 languages.
languages (except English), and fine-tune the models on the Word Embedding Distribution Figure 4 shows the
translated training set. In the setting “TRANSLATE-TEST”, word embedding distributions of Transformer (without pre-
we translate the testing set of each XNLI language to En- training) and ALM (with pre-training). We project the
glish, and evaluate the performance of the models in each learned word embeddings from high dimension to 2 dimen-
translated testing set. In the setting “CROSS-LINGUAL sion with the PCA method. We plot both the Chinese word
TEST”, we fine-tune the models on the English XNLI train- embeddings and the English word embeddings in the same
ing set, and evaluate the performance directly in each testing space. The hollow cycles denote Chinese words, while the
set. We compare our model with Conneau’s baseline model, solid cycles denote English words.
mBERT, and XLM in these three settings. As for the Transformer baseline, the distribution of the
In the “CROSS-LINGUAL TEST” setting, our ALM sig- Chinese word embeddings is very different from that of the
nificantly outperforms the baseline models. More precisely, English word embeddings. We draw a dashed line to illus-
ALM obtains 75.6% accuracy on average, while Conneau’s trate the separation of the Chinese word embeddings and the
baseline achieves 65.6% accuracy, and XLM gets 75.1%. On English word embeddings.
the Russian and Turkish languages, we outperform the base- As for the pre-trained ALM, the distribution of Chinese
lines by 1.2% and 0.5% respectively. ALM gets 85.2% accu- word embeddings is similar to that of the English word em-
racy in English testing set, outperforming Conneau’s base- beddings. The reason is that we mix Chinese words and
line model by 11.5%, BERT by 3.8%, and XLM by 0.2% in English words during training, so the embeddings of both
terms of accuracy. source language and target language can distribute in the
In the “TRANSLATE-TRAIN” setting, our ALM reaches same space.
77.2% accuracy in average across different languages, which According to Figure 4, it also indicates that the source
indicates that ALM can be fine-tuned for any languages to words and its translated target words have closer distance
achieve good performance. On the German and French lan- than that of the Transformer baseline model. There are some
guages, we outperform the baselines by 2.0% and 1.9% re- cases which are very close to each other in ALM’s embed-
spectively. Besides, our ALM achieves higher accuracy than ding space but far from each other in the Transformer’s em-
XLM in 15 languages. bedding space.
It concludes that ALM pre-training method can minor the
In the “TRANSLATE-TEST” setting, our ALM obtains gap between the embeddings of source language and target
74.6% average accuracy, while Conneau’s baseline achieves language, which is beneficial for the cross-lingual setting.
67.2% accuracy, and XLM gets 74.2%. In general, our ALM
can outperform these three baselines across different exper-
iment settings. Low Resource Setting We would like to further analyze
the performance of our pre-trained ALM given different
sizes of parallel data. Therefore, we randomly shuffle the
Discussions and Analysis full parallel training set in the task of IWSLT14 German-
to-English translation dataset. Then, we extract the random
We further analyze the advantages of our pre-trained model. K% samples as the fine-tuned parallel data. We set K =
We visualize the distribution of our model’s word embed- {10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%,
ding, and compare it with that of Transformer baseline 100%}, and compare our ALM with Transformer baseline
model. We evaluate the performance of our ALM given dif- model. We randomly extract specific data from the whole
ferent parallel data, in order to analyze the benefits of pre- sentence pairs. Figure 5 shows the BLEU scores of our mod-
training in the low-resource setting. els and the baseline. When the parallel data size is small,
9391
Chinese token
English token
ALM
35.53
壂 35.0
Transformer
33.58 34.49
32.5
view 32.23
30.11
30.0
⹔ 28.56 29.34
BLEU
27.52
27.5
26.12 27.12
25.22
when 25.0 25.56
23.76
they 22.64 23.05 23.42
22.5
ᷗᷭ 21.62
20.0 20.78
19.82
18.78
(a) Transformer 10 20 30 40 50 60 70 80 90 100
Chinese token
Ratio(%)
English token
Figure 5: Results of ALM vs Transformer fine-tuning on
low-resource data.
Conclusions
ALM can outperform Transformer model by a large margin. In this work, we propose a novel cross-lingual pre-training
With the increase of parallel data, the margin gets narrow method, called Alternating Language Modeling (ALM).
because of the upper bound of the model capacity. It con- First, we randomly substitute the source phrases with the tar-
cludes that ALM pre-training can benefit the performance get equivalents to create code-switched sentences. Then, we
of Transformer model especially when the training samples use these code-switched data to train ALM model to learn to
are not sufficient. predict words of different languages. We evaluate our pre-
training ALM on the downstreams tasks of machine transla-
Related Work tion and cross-lingual classification. Experiments show that
Pre-training and transfer learning are widely used in many ALM can outperform the previous pre-training methods on
tasks of natural language processing. ELMo (Peters et al. three benchmark datasets. In the future work, we will ex-
2018) is proposed as a kind of deep contextualized word plore the effect of code-switched sentences being used for
representation that is pre-trained in the large scale corpus MASS-like pre-training method.
and can be transferred to other tasks. Universal Language
Model Fine-tuning (ULMFiT) (Howard and Ruder 2018) Acknowledgments
is an effective transfer learning method that can be ap- This work is supported by the National Natural Science
plied to any task in NLP, and includes techniques that are Foundation of China (Grand Nos. U1636211, 61672081,
key for fine-tuning a language model. BERT (Devlin et al. 61370126), Beijing Advanced Innovation Center for Imag-
2019) achieves state-of-the-art performance among various ing Technology (No.BAICIT-2016001) and the Fund of the
pre-training approaches to monolingual NLP tasks. Further- State Key Laboratory of Software Development Environ-
more, XLM and MASS (Song et al. 2019) obtain more ment (No.SKLSDE-2019ZX-17).
great success in language understanding by pre-training. Un-
like BERT that pre-trains only the encoder or the decoder,
MASS is carefully designed to pre-train the encoder and de- References
coder jointly by predicting the fragment of the sentence that Ahmed, K.; Keskar, N. S.; and Socher, R. 2017. Weighted
is masked on the encoder side and predict the masked to- transformer network for machine translation. CoRR
kens in the decoder side. By masking the input tokens of abs/1711.02132.
9392
Chen, M. X.; Firat, O.; Bapna, A.; Johnson, M.; Macherey, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
W.; Foster, G.; Jones, L.; Schuster, M.; Shazeer, N.; Parmar, word representations. In NAACL 2018, 2227–2237.
N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Chen, Z.; Wu, Y.; Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,
and Hughes, M. 2018. The best of both worlds: Combining I. 2018. Improving language understanding by gen-
recent advances in neural machine translation. In ACL 2018, erative pre-training. URL https://s3-us-west-2. ama-
76–86. zonaws. com/openai-assets/research-covers/language-
Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bow- unsupervised/language understanding paper. pdf.
man, S. R.; Schwenk, H.; and Stoyanov, V. 2018. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
XNLI: evaluating cross-lingual sentence representations. In Sutskever, I. 2019. Language models are unsupervised mul-
EMNLP 2018, 2475–2485. titask learners. OpenAI Blog.
Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. Salant, S., and Berant, J. 2018. Contextualized word rep-
BERT: pre-training of deep bidirectional transformers for resentations for reading comprehension. In NAACL 2018,
language understanding. In NAACL 2019, 4171–4186. 554–559.
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural ma-
Gao, J.; Zhou, M.; and Hon, H. 2019. Unified language chine translation of rare words with subword units. In ACL
model pre-training for natural language understanding and 2016, 1715–1725.
generation. CoRR abs/1905.03197. Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T. 2019. MASS:
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, masked sequence to sequence pre-training for language gen-
Y. N. 2017. Convolutional sequence to sequence learning. eration. In ICML 2019, 5926–5936.
In ICML 2017, 1243–1252. Tay, Y.; Luu, A. T.; and Hui, S. C. 2018. Compare, com-
He, T.; Tan, X.; Xia, Y.; He, D.; Qin, T.; Chen, Z.; and Liu, press and propagate: Enhancing neural architectures with
T. 2018. Layer-wise coordination between encoder and alignment factorization for natural language inference. In
decoder for neural machine translation. In NeurIPS 2018, EMNLP 2018, 1565–1575.
7955–7965. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Howard, J., and Ruder, S. 2018. Universal language model L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
fine-tuning for text classification. In ACL 2018, 328–339. tention is all you need. In NIPS 2017, 5998–6008.
Kingma, D. P., and Ba, J. 2015. Adam: A method for Wang, D.; Gong, C.; and Liu, Q. 2019. Improving neural
stochastic optimization. In ICLR 2015. language modeling via adversarial training. In ICML 2019,
6555–6565.
Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Fed-
erico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Wu, F.; Fan, A.; Baevski, A.; Dauphin, Y. N.; and Auli, M.
Zens, R.; Dyer, C.; Bojar, O.; Constantin, A.; and Herbst, 2019. Pay less attention with lightweight and dynamic con-
E. 2007. Moses: Open source toolkit for statistical machine volutions. In ICLR 2019.
translation. In ACL 2007, 177–180. Yang, W.; Xie, Y.; Lin, A.; Li, X.; Tan, L.; Xiong, K.; Li,
Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrase- M.; and Lin, J. 2019a. End-to-end open-domain question
based translation. In NAACL 2003, 48–54. answering with bertserini. In NAACL 2019, 72–77.
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J. G.; Salakhutdi-
Kunchukuttan, A.; Mehta, P.; and Bhattacharyya, P. 2018.
nov, R.; and Le, Q. V. 2019b. Xlnet: Generalized au-
The IIT bombay english-hindi parallel corpus. In LREC
toregressive pretraining for language understanding. CoRR
2018, 3473–3476.
abs/1906.08237.
Lample, G., and Conneau, A. 2019. Cross-lingual language
Yu, S.; Indurthi, S. R.; Back, S.; and Lee, H. 2018. A multi-
model pretraining. CoRR abs/1901.07291.
stage memory augmented neural network for machine read-
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, ing comprehension. In ACL 2018, 21–30.
O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Ziemski, M.; Junczys-Dowmunt, M.; and Pouliquen, B.
Roberta: A robustly optimized BERT pretraining approach. 2016. The united nations parallel corpus v1.0. In LREC
CoRR abs/1907.11692. 2016, 3530–3534.
Min, S.; Seo, M. J.; and Hajishirzi, H. 2017. Question an-
swering through transfer learning from large fine-grained su-
pervision data. In ACL 2017, 510–517.
Och, F. J., and Ney, H. 2003. A systematic comparison
of various statistical alignment models. Computational Lin-
guistics 29(1):19–51.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu:
a method for automatic evaluation of machine translation. In
ACL 2002, 311–318.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,
9393