Alternating Language Modeling For Cross-Lingual Pre-Training

Uploaded by

robinsonaziel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views8 pages

Alternating Language Modeling For Cross-Lingual Pre-Training

Uploaded by

robinsonaziel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Alternating Language Modeling for Cross-Lingual Pre-Training

†
Jian Yang,1∗ Shuming Ma,2 Dongdong Zhang,2 ShuangZhi Wu,3∗ Zhoujun Li,1 Ming Zhou2
1
State Key Lab of Software Development Environment, Beihang University
2
Microsoft Research Asia
3
SPPD of Tencent Inc.
{jiaya, lizj}@buaa.edu.cn; {shumma, dozhang, mingzhou}@microsoft.com; [email protected]

Abstract 澷濜濝濢濙濧濙澔濨濣濟濙濢
澹濢濛濠濝濧濜澔濨濣濟濙濢
Language model pre-training has achieved success in many
natural language processing tasks. Existing methods for ⍻⌀ 㒯䖃 ↲ⳤ 坋↧ 濖濴濿濿瀆澳濹瀂瀅澳濹瀅濸瀆濻澳濼瀁濷瀈瀆瀇瀅濼濴濿澳濴濶瀇濼瀂瀁
cross-lingual pre-training adopt Translation Language Model
to predict masked words with the concatenation of the source 澜濕澝澔濌激濁
sentence and its target equivalent. In this work, we intro-
duce a novel cross-lingual pre-training method, called Al- 濖濴濿濿瀆濹瀂瀅㒯䖃 ↲ⳤ 濴濶瀇濼瀂瀁
ternating Language Modeling (ALM). It code-switches sen-
tences of different languages rather than simple concatena- 澜濖澝澔澵激濁
tion, hoping to capture the rich cross-lingual context of words
and phrases. More specifically, we randomly substitute source Figure 1: Example of Translation Language Model and Al-
phrases with target translations to create code-switched sen- ternating Language Model.
tences. Then, we use these code-switched data to train ALM
model to learn to predict words of different languages. We
evaluate our pre-training ALM on the downstream tasks of cross-lingual pre-training model can learn the relationship
machine translation and cross-lingual classification. Exper- between languages.
iments show that ALM can outperform the previous pre-
training methods on three benchmarks.1
In this work, we propose a novel cross-lingual language
model, which alternately predicts words of different lan-
guages. Figure 1 shows an example of the proposed Alter-
Introduction nating Language Model (ALM). Different from XLM, the
input sequence of ALM is mixed with different languages,
Recently language model pre-training methods, including so it can capture the rich cross-lingual context of words and
ELMo (Peters et al. 2018), GPT (Radford et al. 2018), phrases. Moreover, it forces the language model to predict
GPT2 (Radford et al. 2019), BERT (Devlin et al. 2019), one language conditioned on the context of the other lan-
and UniLM (Dong et al. 2019), have achieved impres- guage. Therefore, it can minor the gap between the embed-
sive results on various natural language processing tasks dings of the source language and the target languages, which
such as question-answering (Min, Seo, and Hajishirzi 2017; is beneficial for the cross-lingual setting.
Yang et al. 2019a), machine reading comprehension (Salant Based on Alternating Language Model, we introduce a
and Berant 2018; Yu et al. 2018) and natural language infer- new cross-lingual pre-training method. More specifically,
ence (Tay, Luu, and Hui 2018). More recently, XLM (Lam- we take the Transformer model (Vaswani et al. 2017) as
ple and Conneau 2019) has extended this approach to cross- the backbone model. Then, we construct the training ex-
lingual pre-training, and proven successful in applying lan- amples for pre-training by replacing the phrases with their
guage model pre-training in the cross-lingual setting. translation of the other language. Finally, we pre-train the
Existing methods for supervised cross-lingual pre- Transformer model with the constructed examples using the
training adopt a cross-lingual language model objective, masked language model objective. The pre-trained model
called Translation Language Model (TLM). It makes use of can be used to further fine-tune the downstream cross-
parallel data by predicting the masked words with concate- lingual tasks.
nation of the sentence and its translation. In this way, the To verify the effectiveness of the proposed method, we
∗
Contribution during internship at Microsoft Research Asia. evaluate our pre-training method on machine translation and
†
Corresponding author. cross-lingual classification. Experiments show that ALM
Copyright c 2020, Association for the Advancement of Artificial can outperform the previous pre-training methods on three
Intelligence (www.aaai.org). All rights reserved. benchmark datasets.
1
Code can be found at https://github.com/zddfunseeker/ALM. The contributions of this work are as follows:

9386
Source Target ଶ ଺
‫ ઼ ⍻ⴁ ⨳ޘ‬ᣕ䆖 ছᱏ㌫㔏 global monitor and warning satellite system

Masked Transformer Encoder

Sample 1 ‫⨳ޘ‬ ⴁ⍻ ઼ ᣕ䆖 satellite system

Sample 2 ‫⨳ޘ‬ ⴁ⍻ and ᣕ䆖 ছᱏ㌫㔏 ‫ݔ‬ଵ - ‫ݔ‬ଷ ‫ݕ‬ସ ‫ݔ‬ହ - ‫଻ݕ‬ ‫଼ݕ‬
…

Sample n-1 global monitor and warning ছᱏ㌫㔏 Source ‫ݔ‬ଵ ‫ݔ‬ଶ ‫ݔ‬ଷ ‫ݔ‬ସ ‫ݔ‬ହ ‫଺ݔ‬ ‫଻ݔ‬

Sample n global ⴁ⍻ and ᣕ䆖 satellite system Target ‫ݕ‬ଵ ‫ݕ‬ଵ ‫ݕ‬ଷ ‫ݕ‬ସ ‫ݕ‬ହ ‫଺ݕ‬ ‫଻ݕ‬ ‫଼ݕ‬

Figure 2: Overview of our ALM cross-lingual pre-training method. Given a pair of bilingual sentences, we yield a set of cross-
lingual sentences. These sentences are used to pre-train the Transformer encoder which predicts an English masked word or a
Chinese one.

• We propose a novel cross-lingual language model, which Supervised Language Modeling

alternately predicts words of different languages. XLM also proposes an additional objective that can make
• We introduce a new cross-lingual pre-training method use of bilingual data called TLM. TLM concatenates paral-
based on the proposed cross-lingual language model, lel sentences as training samples. Similar to MLM, it ran-
which can be further fine-tuned on downstream tasks. domly masks words of concatenated sentences, so that it can
leverage both words in source language and target language
• Experiments show that ALM outperforms the previous translation by predicting the masked words. Moreover, TLM
pre-training methods on the benchmark datasets for ma- leverages target sentences to predict source words when the
chine translation and cross-lingual text classification. source context is insufficient to predict these words.
TLM makes use of bilingual data by concatenating sen-
Cross-Lingual Pre-Training tences of two languages, so it can learn the relationship be-
tween languages. In this work, we mainly focus on improv-
Cross-lingual pre-training trains a model that can be fur- ing the supervised pre-training model. We also show that the
ther fine-tuned to improve downstream tasks by making use proposed model can be applied to unsupervised settings in
of monolingual data and bilingual data. XLM is a recently the following section.
proposed model that achieves success in cross-lingual pre-
training. It consists of two unsupervised models that relies Alternating Language Model
on monolingual data, and a supervised model that relies on
bilingual data. These three models of XLM are Causal Lan- We propose Alternating Language Model (ALM) to alter-
guage Model (CLM), Masked Language Model (MLM), and nately predict words of different languages. In this section,
Translation Language Model (TLM), respectively. we present the details of ALM.

Code-Switched Sequence
Unsupervised Language Modeling
Given a bilingual sentence pair (X, Y ) with the source
CLM recurrently predicts the next word given the previous sentence X = {x1 , x2 , ..., xN } and the target translation
context, which is the typical objective of language modeling. Y = {y1 , y2 , ..., yM }, where N and M are the lengths of
GPT (Radford et al. 2018) is the first pre-training model to the source and target sentences, we create the code-switched
adopt CLM, and GPT-2 (Radford et al. 2019) further proves sequence U by composing the phrases of X and Y , where
the success of CLM for pre-training. U ={u1 , u2 , .., uL } with the length L.
CLM only makes use of the uni-directional context. Dif- In details, for each phrase U[i,j] , it comes from either
ferent from CLM, MLM uses bidirectional contextual infor- source phrase X[a,b] or target phrase Y[c,d] where the con-
mation. It randomly masks some tokens during training and straint is that these two phrases are the linguistic translation
predicts the identity of the masked word. BERT (Devlin et counterpart in the parallel sentence (X, Y ), 1 ≤ a ≤ b ≤ N
al. 2019) is the first to propose this model and use it for pre- and 1 ≤ c ≤ d ≤ M . We denote the proportion of the source
training. Different from the BERT, XLM (Lample and Con- words in the alternating language sequence U as α.
neau 2019) uses an arbitrary number of sentences (truncated Specifically, the constituent of U can be illustrated into
at 256 tokens) instead of pairs of sentences, and it samples four categories:
the masked tokens according to a multinomial distribution,
whose weights are proportional to the square root of their • Monolingual source language: that is α = 0.
invert frequencies. • Monolingual target language: that is α = 1.

9387
‫ݔ‬ଶ ଺ Model Architecture and Pre-Training
Figure 2 also shows the overall architecture of our proposed
model. Given a parallel sentence pair, we combine two sen-
Masked Transformer Encoder tences from different languages into a single code-switched
sequence as described above. Then we mask out a certain
percentage of words in the sequences. We feed the masked
‫ݔ‬ଵ - ‫ݔ‬ଷ ‫ݔ‬ସ ‫ݔ‬ହ - ‫଻ݔ‬ ‫଼ݔ‬ sentences into Transformer model to learn to predict the
ߙൌͲ words being masked out.
In details, we sample randomly 15% of the tokens, replace
ଶ ‫଺ݕ‬ them by a [MASK] token 80% of the time, by a random
token 10% of the time, and keep them unchanged 10% of
the time.
Masked Transformer Encoder Figure 3 shows two special cases of ALM. When α = 0,
the input sequence is purely from source language. It be-
comes the masked language model for source language.
‫ݕ‬ଵ - ‫ݕ‬ଷ ‫ݕ‬ସ ‫ݕ‬ହ - ‫଻ݕ‬ ‫଼ݕ‬ When α = 1, the input sequence is purely from target lan-
guage, so it becomes the masked language model for target
ߙൌͳ
language. In this way, the model becomes unsupervised be-
Figure 3: The model architecture of ALM when α = 0 and cause it only relies on monolingual data.
α = 1. In practice, we have 10% of training samples with α = 0,
10% of samples with α = 1, and the rest with 0 < α < 1.
We manually choose a proper value of α which ensures some
• Major source language: that means most of U is derived phrases are replaced with their counterparts by alignment
from X where some source phrases X[a,b] are substituted instead of sweeping all values of α (0 ≤ α ≤ 1). In order
by their target counterpart phrases Y[c,d] (α ≥ 0.5). to ensure the value of α is in a reasonable range, we set max
length and max number for phrase substitution.
• Major target language: that means most of U is derived
from Y where some target phrases Y[c,d] are substituted
by their source counterpart phrases X[a,b] (α < 0.5). Applying to Downstream Tasks
After pre-training, we further fine-tune ALM in order to
Constructing Training Samples adapt the parameters for the downstream tasks, which are
Since there are few natural code-switched sentences, we machine translation and cross-lingual classification.
should construct them from bilingual sentence pairs. First,
we perform word alignment with the GIZA toolkit (Och Machine Translation After pre-training, we use ALM
and Ney 2003) between the parallel sentence X and Y , as the encoder of machine translation, and construct a
and extract a bilingual phrase table using statistical ma- Transformer-based decoder conditioned on ALM. We fine-
chine translation techniques (Koehn, Och, and Marcu 2003). tune the parameters of the total encoder-decoder model on
Then, for each sentence pair in training corpus, we cre- parallel training dataset of machine translation.
ate the major-source-language samples by substituting some
phrases in source sentence with the corresponding target
phrases with highest probabilities in phrase table. A similar Cross-Lingual Classification XNLI (Conneau et al.
method creates major-target-language samples by substitut- 2018) is a significant dataset which is similar to the English
ing some phrases in target sentence with the corresponding MultiNLI including several languages. Taking the task of
source phrases. NLI as an example, we concatenate premise and hypothe-
The details of the construction for a sentence pair are: sis as input, and feed them into ALM. On top of ALM, we
• Each phrase is limited to less than 5 words for both source add a linear classifier and a dropout layer after the first hid-
language and target language. den state for last layer. Then, we fine-tune the parameters of
ALM on training dataset of cross-lingual classification.
• The substituted words are less than 30% of the total words
in the sentence. Therefore, the source words dominate the
sentence in the major source language, while the target Experiments
words dominate the sentence in the major target language. We evaluate our proposed method on machine translation
• Each bilingual sentence pair is used to create multiple al- and cross-lingual text classification. In this section, we pro-
ternating language sentences by randomly choosing the vide the details, results, and analysis of the experiments.
substituted phrases.
Figure 2 shows an example of constructing code-switched Datasets
sentences. Given the Chinese sentence and its translation, Following previous work (Lample and Conneau 2019), we
multiple training samples can be derived from one sentence use Wikipedia data by using WikiExtractor and WMT data
pair by choosing different phrases to substitute. as monolingual data. For bilingual data, French, Spanish,

9388
Russian, Arabic, and Chinese data are from MultiUN (Ziem- Baselines We compare our methods with state-of-the-art
ski, Junczys-Dowmunt, and Pouliquen 2016). Hindi data is supervised methods and the pre-training methods, which are
from the IIT Bombay corpus (Kunchukuttan, Mehta, and described as follows:
Bhattacharyya 2018). German and Greek are from the EU- • Transformer (Vaswani et al. 2017): We implement
bookshop corpus. Turkish, Vietnamese and Thai are from Transformer model with our in-house tensorflow code,
OpenSubtitles 2018. Urdu and Swahili data are from Tanzil. and the experimental settings are the same as Transformer
Swahili data is from GlobalVoices. For most languages, we (Vaswani et al. 2017)
use the tokenizer provided by Moses (Koehn et al. 2007).
• ConvS2S (Gehring et al. 2017): We report the results re-
Pre-Training Details ferring to the paper of convolutional sequence to sequence
model(ConvS2S).
We use byte pair encoding (BPE) (Sennrich, Haddow, and
Birch 2016). The vocabulary contains 95K byte pair encod- • Weighted Transformer (Ahmed, Keskar, and Socher
ing tokens. We pre-train our model with both 1024 embed- 2017): It uses self-attention branches in place of multi-
ding and hidden units, 8 heads, a dropout rate of 0.1 and head attention. The branches replace multiple heads in at-
learned positional embeddings. We use an Adam optimizer tention mechanism of the original Transformer network.
with parameters of β1 = 0.9 and β2 = 0.98. We set the in- • Layer-wise Transformer (He et al. 2018): It explicitly
verse sqrt learning rate schedule with a linear warmup where coordinates the learning of hidden representations of the
the number of warmup step is 4000 and a learning rate of encoder and decoder, gradually from low level to high
0.0005. level.
For pre-training data, we use source language monolin- • RNMT+ (Chen et al. 2018): It combines the advantages
gual data (α = 0) and target language monolingual data of both the recurrent structure and Transformer architec-
(α = 1). Besides, we also split parallel data to expand mono- ture.
lingual data. For the monolingual data, we regard source
language mono-lingual data as α = 0 and target language • LightConv and DynamicConv (Wu et al. 2019): Light-
mono-lingual data as α = 1, which could be classified Conv uses a lightweight convolution which can perform
into a special situation of ALM. To construct monolingual competitively to the best reported self-attention results.
dataset, we use Wikipedia data as monolingual data by us- Furthermore, they introduce dynamic convolutions (Dy-
ing WikiExtractor. Our pre-training samples includes mono- namicConv) which are simpler and more efficient than
lingual data and parallel data, we use original parallel data self-attention.
to generate 20 times code-switched sentences than original • Multilingual BERT (Devlin et al. 2019): Multilingual
parallel data. More specifically, we separately obtain the al- BERT (mBERT) extends the BERT model to different
ternating language sentences of source language and target languages. We download the pre-trained model provided
language, which are 40 times than original parallel data in by the authors, and fine-tune on the machine translation
total. Considering that there exist some bad cases in alter- datasets.
nating language sentences, we filter some low-quality code-
switched sentences of which length is too long or too short, • XLM (Lample and Conneau 2019): We use the released
and randomly drop some sentences. At last, nearly 1.5 bil- code2 and the pre-trained data provided by XLM, and fur-
lion code-switched sentences are used for pre-training. ther fine-tune the pre-trained model on the corresponding
data.
Fine-Tuning on Machine Translation • MASS (Song et al. 2019): We conduct experiments with
the codes provided by the authors. We set the fragment
We fine-tune the pre-trained ALM on two datasets: WMT14 length k as 50% of the total number of masked tokens in
English-German machine translation and IWSLT14 the sentence.
German-English machine translation. WMT14 English-
German machine translation dataset has 4.5 million sentence
pairs for training. newsdev2014 is used as the validation Details We fine-tune our ALM with the Adam optimizer
set, while the newstest2014 is the testing set. IWSLT14 (Kingma and Ba 2015) with a linear warmup (Vaswani et al.
German-English machine translation dataset contains 160 2017). We tune the learning rates based on the performance
thousand sentence pairs. They are collected from TED talks. on the validation set, and the learning rates are 5 × 10−4 for
We use iwslt14 devset as the validation set and the iwslt14 IWSLT14 German-English and 10−3 for WMT14 English-
testset as the testing set. German. We use the averaged perplexity over all languages
We build a Transformer decoder conditioned on ALM en- as a criterion for early stopping. The batch size is set to
coder. We feed the source language into ALM, and generate 8192 tokens for all experiments. During decoding, we set
the target language with decoder. We reload the parameters the beam size to 8.
of word embedding and encoder parameters which are also
used to initialize the decoder for our in-house NMT code Results To prove the effectiveness of ALM, we perform
from pre-trained model. We evaluate the performance of the experiments on the English-German and German-English
translated sentences. The evaluation metric is BLEU (Pap-
2
ineni et al. 2002). https://github.com/facebookresearch/XLM

9389
En → De BLEU(%) We also compare ALM with three pre-training baselines.
Transformer (Vaswani et al. 2017) 28.40 It shows that our ALM obtains the best performance and
ConvS2S (Gehring et al. 2017) 25.16 reaches 35.53 BLEU score in this task, outperforming the
Weighted Transformer (Ahmed, Keskar, and Socher 2017) 28.90
Layer-wise Transformer (He et al. 2018) 29.01
previous baseline mBERT, MASS, and XLM by +0.71 and
RNMT+ (Chen et al. 2018) 28.50 +0.39, and +0.31 in terms of BLEU score.
mBERT (Devlin et al. 2019) 28.64 In general, our ALM could achieve significant improve-
MASS (Song et al. 2019) 28.92 ments over all baseline models on two translation tasks. As
XLM (Lample and Conneau 2019) 28.88 our method pre-trains the encoder on a large scale cross-
ALM (this work) 29.22 lingual corpus, the word representations and encoder could
acquire sufficient cross-lingual information. For example,
Table 1: Results on WMT14 English-German machine the target phrase can see both its source and target context.
translation task. This cross-lingual context is helpful for target word genera-
tion and understanding the source sentence in a cross-lingual
De → En BLEU(%)
way.
Transformer (Vaswani et al. 2017) 34.49 Fine-Tuning on Cross-Lingual Classification
LightConv (Wu et al. 2019) 34.80
DynamicConv (Wu et al. 2019) 35.20 We fine-tune the pre-trained ALM model on XNLI dataset
Advsoft (Wang, Gong, and Liu 2019) 35.18 to evaluate the effectiveness of our model. We build a lin-
Layer-wise Transformer (He et al. 2018) 35.07 ear classifier on the top of the pre-trained ALM to project
mBERT (Devlin et al. 2019) 34.82 the first hidden state of ALM output into the probabili-
MASS (Song et al. 2019) 35.14 ties of each class. We concatenate premise and hypothe-
XLM (Lample and Conneau 2019) 35.22
sis, and feed them into ALM. We evaluate the performance
ALM (this work) 35.53 of the fine-tuned model in 15 XNLI languages. Follow-
ing previous work (Lample and Conneau 2019), we eval-
Table 2: Results on IWSLT14 German-English machine uate the model in three different settings: “TRANSLATE-
translation task. TRAIN”, “TRANSLATE-TEST”, and “CROSS-LINGUAL
TEST”. The evaluation metric is the accuracy of the pre-
dicted NLI class.
translation tasks. Table 1 and Table 2 show that our ALM
has significant improvements over baselines without pre- Baselines We compare our methods with three strong
training or with pre-training methods. baselines, including a supervised method without pre-
In Table 1, we report the performance of ALM and the training, and two pre-training methods:
baseline models in the WMT14 English-German machine
translation dataset. Transformer is an important baseline, • Conneau: Conneau (Conneau et al. 2018) proposes a
and it obtains 28.40 in BLEU score. We also compare ALM BiLSTM model to set up a baseline for XNLI. We report
with the convolutional baseline ConvS2S, which achieves the scores directly from their paper.
25.16. Weighted Transformer and Layer-wise Transformer • Multilingual BERT (Devlin et al. 2019): Multilingual
are two methods to improve the Transformer model, and BERT (mBERT) extends the BERT model to different
they get 28.90 and 29.01 in terms of BLEU score. RNMT+ languages, which is also a strong baseline.
combines the recurrent structure and the multi-head atten- • XLM (Lample and Conneau 2019): XLM is the state-of-
tion components, which yields an improvement to 28.50 the-art model for cross-lingual pre-training. We report the
BLEU score. Our ALM significantly outperforms these results of XLM directly from their paper.
baseline models. We also compare our model with three
state-of-the-art pre-training models. mBERT and MASS
are unsupervised pre-training models. They achieve 28.64 Details We fine-tune our ALM with the Adam optimizer
BLEU score and 28.92 BLEU score, respectively. XLM is (Kingma and Ba 2015) with β1 = 0.9 and β2 = 0.997.
a mixture of unsupervised and supervised pre-training mod- We tune the learning rates based on the performance on the
els, achieving 28.88 BLEU score. Our ALM reaches 29.22 validation set, and the learning rates are set to 5 × 10−6 . We
BLEU score, yielding an improvement of +0.58, +0.30, and set the batch size to 24, and we limit the sentences up to
+0.34 BLEU scores. 256 tokens. We set a rate of dropout 0.15 of last layer. We
In Table 2, we report the performance of ALM and evaluate our model for every 1000 sentences.
the baseline models in IWSLT14 German-English machine
translation dataset. We first compare our ALM with the su- Results Table 3 shows the experimental results of our
pervised models without pre-training. Transformer and its proposed ALM and the baseline models. Following the
variant Layer-wise Transformer achieves 34.49 and 35.07 work of XNLI (Conneau et al. 2018), we evaluate these
in terms of BLEU score. The convolution-based models, models in three different settings: “TRANSLATE-TRAIN”,
LightConv and DynamicConv, achieve 34.80 and 35.20, re- “TRANSLATE-TEST”, and “CROSS-LINGUAL TEST”.
spectively. Advsoft gets a BLEU score of 35.18. ALM out- In the setting “TRANSLATE-TRAIN”, we translate the
performs these baselines, achieving 35.53 in BLEU score. training set of the English MultiNLI dataset into each XNLI

9390
en fr es de el bg ru tr ar vi th zh hi sw ur avg.
Machine translation baselines (TRANSLATE-TRAIN)
Conneau (Conneau et al. 2018) 73.7 68.3 68.8 66.5 66.4 67.4 66.5 64.5 65.8 66.0 62.8 67.0 62.1 58.2 56.6 65.4
mBERT (Devlin et al. 2019) 81.9 - 77.8 75.9 - - - - 70.7 - - 76.6 - - 61.6 -
XLM (Lample and Conneau 2019) 85.0 80.2 80.8 80.3 78.1 79.3 78.1 74.7 76.5 76.6 75.5 78.6 72.3 70.9 63.2 76.7
ALM (this work) 85.2 81.1 82.0 82.3 78.3 79.8 78.4 74.9 76.7 76.8 75.6 78.7 72.5 71.5 63.4 77.2
Machine translation baselines (TRANSLATE-TEST)
Conneau (Conneau et al. 2018) 73.7 70.4 70.7 68.7 69.1 70.4 67.8 66.3 66.8 66.5 64.4 68.3 64.2 61.8 59.3 67.2
mBERT (Devlin et al. 2019) 81.4 - 74.9 74.4 - - - - 70.4 - - 70.1 - - 62.1 -
XLM (Lample and Conneau 2019) 85.0 79.0 79.5 78.1 77.8 77.6 75.5 73.7 73.7 70.8 70.4 73.6 69.0 64.7 65.1 74.2
ALM (this work) 85.2 79.1 80.0 78.4 78.0 77.8 77.1 73.9 74.2 71.2 70.5 73.8 69.2 64.8 65.3 74.6
Evaluation of cross-lingual sentence encoders (CROSS-LINGUAL TEST)
Conneau (Conneau et al. 2018) 73.7 67.7 68.7 67.7 68.9 67.9 65.4 64.2 64.8 66.4 64.1 65.8 64.1 55.7 58.4 65.6
mBERT (Devlin et al. 2019) 81.4 - 74.3 70.5 - - - - 62.1 - - 63.8 - - 58.3 -
XLM (Lample and Conneau 2019) 85.0 78.7 78.9 77.8 76.6 77.4 75.3 72.5 73.1 76.1 73.2 76.5 69.6 68.4 67.3 75.1
ALM (this work) 85.2 79.3 79.2 78.0 76.7 78.1 76.5 73.0 73.2 76.4 73.5 78.6 69.8 69.0 66.8 75.6

Table 3: Cross-lingual natural language inference (XNLI) test accuracy for the 15 languages.

languages (except English), and fine-tune the models on the Word Embedding Distribution Figure 4 shows the
translated training set. In the setting “TRANSLATE-TEST”, word embedding distributions of Transformer (without pre-
we translate the testing set of each XNLI language to En- training) and ALM (with pre-training). We project the
glish, and evaluate the performance of the models in each learned word embeddings from high dimension to 2 dimen-
translated testing set. In the setting “CROSS-LINGUAL sion with the PCA method. We plot both the Chinese word
TEST”, we fine-tune the models on the English XNLI train- embeddings and the English word embeddings in the same
ing set, and evaluate the performance directly in each testing space. The hollow cycles denote Chinese words, while the
set. We compare our model with Conneau’s baseline model, solid cycles denote English words.
mBERT, and XLM in these three settings. As for the Transformer baseline, the distribution of the
In the “CROSS-LINGUAL TEST” setting, our ALM sig- Chinese word embeddings is very different from that of the
nificantly outperforms the baseline models. More precisely, English word embeddings. We draw a dashed line to illus-
ALM obtains 75.6% accuracy on average, while Conneau’s trate the separation of the Chinese word embeddings and the
baseline achieves 65.6% accuracy, and XLM gets 75.1%. On English word embeddings.
the Russian and Turkish languages, we outperform the base- As for the pre-trained ALM, the distribution of Chinese
lines by 1.2% and 0.5% respectively. ALM gets 85.2% accu- word embeddings is similar to that of the English word em-
racy in English testing set, outperforming Conneau’s base- beddings. The reason is that we mix Chinese words and
line model by 11.5%, BERT by 3.8%, and XLM by 0.2% in English words during training, so the embeddings of both
terms of accuracy. source language and target language can distribute in the
In the “TRANSLATE-TRAIN” setting, our ALM reaches same space.
77.2% accuracy in average across different languages, which According to Figure 4, it also indicates that the source
indicates that ALM can be fine-tuned for any languages to words and its translated target words have closer distance
achieve good performance. On the German and French lan- than that of the Transformer baseline model. There are some
guages, we outperform the baselines by 2.0% and 1.9% re- cases which are very close to each other in ALM’s embed-
spectively. Besides, our ALM achieves higher accuracy than ding space but far from each other in the Transformer’s em-
XLM in 15 languages. bedding space.
It concludes that ALM pre-training method can minor the
In the “TRANSLATE-TEST” setting, our ALM obtains gap between the embeddings of source language and target
74.6% average accuracy, while Conneau’s baseline achieves language, which is beneficial for the cross-lingual setting.
67.2% accuracy, and XLM gets 74.2%. In general, our ALM
can outperform these three baselines across different exper-
iment settings. Low Resource Setting We would like to further analyze
the performance of our pre-trained ALM given different
sizes of parallel data. Therefore, we randomly shuffle the
Discussions and Analysis full parallel training set in the task of IWSLT14 German-
to-English translation dataset. Then, we extract the random
We further analyze the advantages of our pre-trained model. K% samples as the fine-tuned parallel data. We set K =
We visualize the distribution of our model’s word embed- {10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%,
ding, and compare it with that of Transformer baseline 100%}, and compare our ALM with Transformer baseline
model. We evaluate the performance of our ALM given dif- model. We randomly extract specific data from the whole
ferent parallel data, in order to analyze the benefits of pre- sentence pairs. Figure 5 shows the BLEU scores of our mod-
training in the low-resource setting. els and the baseline. When the parallel data size is small,

9391
Chinese token
English token
ALM
35.53
壂 35.0
Transformer
33.58 34.49
32.5
view 32.23
30.11
30.0
⹔ 28.56 29.34

BLEU
27.52
27.5
26.12 27.12
25.22
when 25.0 25.56
23.76
they 22.64 23.05 23.42
22.5
ᷗᷭ 21.62
20.0 20.78
19.82
18.78
(a) Transformer 10 20 30 40 50 60 70 80 90 100

Chinese token
Ratio(%)
English token
Figure 5: Results of ALM vs Transformer ﬁne-tuning on
low-resource data.

the decoder that are unmasked in the source side, MASS

ᷗᷭ
can force the decoder to rely more on the source repre-
they sentation other than the previous tokens in the target side
view
when
壂
for the next token prediction by pre-training with monolin-
gual data. More recently, XLNet (Yang et al. 2019b) pro-
⹔ poses a generalized auto-aggressive pre-training method that
enables learning bidirectional contexts by maximizing the
expected likelihood over all permutations of the factoriza-
(b) ALM
tion order. RoBERTa (Liu et al. 2019) presents a replication
Figure 4: Visualization of word embedding in Transformer study of BERT pre-training that carefully measures the im-
and ALM. pact of many key hyperparameters and training data size.

Conclusions
ALM can outperform Transformer model by a large margin. In this work, we propose a novel cross-lingual pre-training
With the increase of parallel data, the margin gets narrow method, called Alternating Language Modeling (ALM).
because of the upper bound of the model capacity. It con- First, we randomly substitute the source phrases with the tar-
cludes that ALM pre-training can benefit the performance get equivalents to create code-switched sentences. Then, we
of Transformer model especially when the training samples use these code-switched data to train ALM model to learn to
are not sufficient. predict words of different languages. We evaluate our pre-
training ALM on the downstreams tasks of machine transla-
Related Work tion and cross-lingual classification. Experiments show that
Pre-training and transfer learning are widely used in many ALM can outperform the previous pre-training methods on
tasks of natural language processing. ELMo (Peters et al. three benchmark datasets. In the future work, we will ex-
2018) is proposed as a kind of deep contextualized word plore the effect of code-switched sentences being used for
representation that is pre-trained in the large scale corpus MASS-like pre-training method.
and can be transferred to other tasks. Universal Language
Model Fine-tuning (ULMFiT) (Howard and Ruder 2018) Acknowledgments
is an effective transfer learning method that can be ap- This work is supported by the National Natural Science
plied to any task in NLP, and includes techniques that are Foundation of China (Grand Nos. U1636211, 61672081,
key for fine-tuning a language model. BERT (Devlin et al. 61370126), Beijing Advanced Innovation Center for Imag-
2019) achieves state-of-the-art performance among various ing Technology (No.BAICIT-2016001) and the Fund of the
pre-training approaches to monolingual NLP tasks. Further- State Key Laboratory of Software Development Environ-
more, XLM and MASS (Song et al. 2019) obtain more ment (No.SKLSDE-2019ZX-17).
great success in language understanding by pre-training. Un-
like BERT that pre-trains only the encoder or the decoder,
MASS is carefully designed to pre-train the encoder and de- References
coder jointly by predicting the fragment of the sentence that Ahmed, K.; Keskar, N. S.; and Socher, R. 2017. Weighted
is masked on the encoder side and predict the masked to- transformer network for machine translation. CoRR
kens in the decoder side. By masking the input tokens of abs/1711.02132.

9392
Chen, M. X.; Firat, O.; Bapna, A.; Johnson, M.; Macherey, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized
W.; Foster, G.; Jones, L.; Schuster, M.; Shazeer, N.; Parmar, word representations. In NAACL 2018, 2227–2237.
N.; Vaswani, A.; Uszkoreit, J.; Kaiser, L.; Chen, Z.; Wu, Y.; Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever,
and Hughes, M. 2018. The best of both worlds: Combining I. 2018. Improving language understanding by gen-
recent advances in neural machine translation. In ACL 2018, erative pre-training. URL https://s3-us-west-2. ama-
76–86. zonaws. com/openai-assets/research-covers/language-
Conneau, A.; Rinott, R.; Lample, G.; Williams, A.; Bow- unsupervised/language understanding paper. pdf.
man, S. R.; Schwenk, H.; and Stoyanov, V. 2018. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
XNLI: evaluating cross-lingual sentence representations. In Sutskever, I. 2019. Language models are unsupervised mul-
EMNLP 2018, 2475–2485. titask learners. OpenAI Blog.
Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. Salant, S., and Berant, J. 2018. Contextualized word rep-
BERT: pre-training of deep bidirectional transformers for resentations for reading comprehension. In NAACL 2018,
language understanding. In NAACL 2019, 4171–4186. 554–559.
Dong, L.; Yang, N.; Wang, W.; Wei, F.; Liu, X.; Wang, Y.; Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural ma-
Gao, J.; Zhou, M.; and Hon, H. 2019. Unified language chine translation of rare words with subword units. In ACL
model pre-training for natural language understanding and 2016, 1715–1725.
generation. CoRR abs/1905.03197. Song, K.; Tan, X.; Qin, T.; Lu, J.; and Liu, T. 2019. MASS:
Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, masked sequence to sequence pre-training for language gen-
Y. N. 2017. Convolutional sequence to sequence learning. eration. In ICML 2019, 5926–5936.
In ICML 2017, 1243–1252. Tay, Y.; Luu, A. T.; and Hui, S. C. 2018. Compare, com-
He, T.; Tan, X.; Xia, Y.; He, D.; Qin, T.; Chen, Z.; and Liu, press and propagate: Enhancing neural architectures with
T. 2018. Layer-wise coordination between encoder and alignment factorization for natural language inference. In
decoder for neural machine translation. In NeurIPS 2018, EMNLP 2018, 1565–1575.
7955–7965. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones,
Howard, J., and Ruder, S. 2018. Universal language model L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. At-
fine-tuning for text classification. In ACL 2018, 328–339. tention is all you need. In NIPS 2017, 5998–6008.
Kingma, D. P., and Ba, J. 2015. Adam: A method for Wang, D.; Gong, C.; and Liu, Q. 2019. Improving neural
stochastic optimization. In ICLR 2015. language modeling via adversarial training. In ICML 2019,
6555–6565.
Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Fed-
erico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Wu, F.; Fan, A.; Baevski, A.; Dauphin, Y. N.; and Auli, M.
Zens, R.; Dyer, C.; Bojar, O.; Constantin, A.; and Herbst, 2019. Pay less attention with lightweight and dynamic con-
E. 2007. Moses: Open source toolkit for statistical machine volutions. In ICLR 2019.
translation. In ACL 2007, 177–180. Yang, W.; Xie, Y.; Lin, A.; Li, X.; Tan, L.; Xiong, K.; Li,
Koehn, P.; Och, F. J.; and Marcu, D. 2003. Statistical phrase- M.; and Lin, J. 2019a. End-to-end open-domain question
based translation. In NAACL 2003, 48–54. answering with bertserini. In NAACL 2019, 72–77.
Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J. G.; Salakhutdi-
Kunchukuttan, A.; Mehta, P.; and Bhattacharyya, P. 2018.
nov, R.; and Le, Q. V. 2019b. Xlnet: Generalized au-
The IIT bombay english-hindi parallel corpus. In LREC
toregressive pretraining for language understanding. CoRR
2018, 3473–3476.
abs/1906.08237.
Lample, G., and Conneau, A. 2019. Cross-lingual language
Yu, S.; Indurthi, S. R.; Back, S.; and Lee, H. 2018. A multi-
model pretraining. CoRR abs/1901.07291.
stage memory augmented neural network for machine read-
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, ing comprehension. In ACL 2018, 21–30.
O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Ziemski, M.; Junczys-Dowmunt, M.; and Pouliquen, B.
Roberta: A robustly optimized BERT pretraining approach. 2016. The united nations parallel corpus v1.0. In LREC
CoRR abs/1907.11692. 2016, 3530–3534.
Min, S.; Seo, M. J.; and Hajishirzi, H. 2017. Question an-
swering through transfer learning from large fine-grained su-
pervision data. In ACL 2017, 510–517.
Och, F. J., and Ney, H. 2003. A systematic comparison
of various statistical alignment models. Computational Lin-
guistics 29(1):19–51.
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. Bleu:
a method for automatic evaluation of machine translation. In
ACL 2002, 311–318.
Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark,

9393

Vocational Training at NTPC Limited: (National Thermal Power Corporation Limited DADRI)
100% (1)
Vocational Training at NTPC Limited: (National Thermal Power Corporation Limited DADRI)
63 pages
Final Year Project Dissertation Example
100% (2)
Final Year Project Dissertation Example
8 pages
Basic Sewing Module
100% (1)
Basic Sewing Module
18 pages
Ramakien - The Thai Ramayan
No ratings yet
Ramakien - The Thai Ramayan
7 pages
Wintersc Iitguwahati Multilingual Model Jan25
No ratings yet
Wintersc Iitguwahati Multilingual Model Jan25
81 pages
NAACL Jiajun
No ratings yet
NAACL Jiajun
19 pages
SecondLanguageAcquisition_NeuralLanguageModels
No ratings yet
SecondLanguageAcquisition_NeuralLanguageModels
14 pages
Getting More From Less: Large Language Models Are Good Spontaneous Multilingual Learners
No ratings yet
Getting More From Less: Large Language Models Are Good Spontaneous Multilingual Learners
14 pages
LLMs Languages
No ratings yet
LLMs Languages
18 pages
2020 Acl-Main 692
No ratings yet
2020 Acl-Main 692
17 pages
TransCoder - YDATA Seminar
No ratings yet
TransCoder - YDATA Seminar
32 pages
Rizal ST 160207034201
No ratings yet
Rizal ST 160207034201
25 pages
2025.coling-main.138
No ratings yet
2025.coling-main.138
12 pages
1 - FIDP - Diaz, Roger M.
No ratings yet
1 - FIDP - Diaz, Roger M.
2 pages
ANLP_Lec09
No ratings yet
ANLP_Lec09
50 pages
XNLI Report (2)
No ratings yet
XNLI Report (2)
21 pages
Garment-finishing
No ratings yet
Garment-finishing
10 pages
Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer
No ratings yet
Analysing The Impact Of Linguistic Features On Cross-Lingual Transfer
10 pages
2023 Acl-Long 66
No ratings yet
2023 Acl-Long 66
16 pages
Cross Lingual Data Sharing LLMFine Tuniing
No ratings yet
Cross Lingual Data Sharing LLMFine Tuniing
15 pages
Writing
No ratings yet
Writing
10 pages
Multilingual LLMs Are Better Cross-Lingual
No ratings yet
Multilingual LLMs Are Better Cross-Lingual
16 pages
Unit5 Notes
No ratings yet
Unit5 Notes
17 pages
2005.00052
No ratings yet
2005.00052
20 pages
XYY Syndrome
No ratings yet
XYY Syndrome
4 pages
Aycock 和 Bawden - 2024 - Topic-guided Example Selection for Domain Adaptati
No ratings yet
Aycock 和 Bawden - 2024 - Topic-guided Example Selection for Domain Adaptati
21 pages
Asdfasdasd
No ratings yet
Asdfasdasd
9 pages
Unsupervised Language Identification in The Wild: Anonymous Submission
No ratings yet
Unsupervised Language Identification in The Wild: Anonymous Submission
6 pages
1811.00383v2
No ratings yet
1811.00383v2
6 pages
Cettolo & Als, On The Development of Customized Neural Machine Translation Models (FBK Paper 30)
No ratings yet
Cettolo & Als, On The Development of Customized Neural Machine Translation Models (FBK Paper 30)
7 pages
CONNEAU and Lample - 2019 - Cross-lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-lingual Language Model Pretraining
11 pages
Survey Paper CLSA
No ratings yet
Survey Paper CLSA
5 pages
Dynamic Data Sampler For Cross-Language Transfer Learning in Large Language Models
No ratings yet
Dynamic Data Sampler For Cross-Language Transfer Learning in Large Language Models
5 pages
2108.05542
No ratings yet
2108.05542
42 pages
Cross-Language Transfer Learning, Continuous Learning, and Domain
No ratings yet
Cross-Language Transfer Learning, Continuous Learning, and Domain
5 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Milmo:Minority Multilingual Pre-Trained Language Model
No ratings yet
Milmo:Minority Multilingual Pre-Trained Language Model
7 pages
Artificial Intelligent Decoding of Rare Words in Natural Language Translation Using Lexical Level Context
No ratings yet
Artificial Intelligent Decoding of Rare Words in Natural Language Translation Using Lexical Level Context
7 pages
Contrastive Learning for Sentence Representation
No ratings yet
Contrastive Learning for Sentence Representation
10 pages
Case Study Nlp
No ratings yet
Case Study Nlp
4 pages
Multilinguality
No ratings yet
Multilinguality
10 pages
Scalable Cross-Lingual Transfer of Neural Sentence Embeddings
No ratings yet
Scalable Cross-Lingual Transfer of Neural Sentence Embeddings
10 pages
Language Detection & Translation
No ratings yet
Language Detection & Translation
15 pages
NLP paper
No ratings yet
NLP paper
5 pages
The Ancient Doctrine of The First Eye (Primus Oculus)
No ratings yet
The Ancient Doctrine of The First Eye (Primus Oculus)
11 pages
Unsupervised Cross-Lingual Representation Learning at Scale
No ratings yet
Unsupervised Cross-Lingual Representation Learning at Scale
12 pages
86 People v. Baharan
No ratings yet
86 People v. Baharan
2 pages
Associates: Youth Mind First Aid
No ratings yet
Associates: Youth Mind First Aid
8 pages
Unsupervised Cross-Lingual Representation Learning For Speech Recognition
No ratings yet
Unsupervised Cross-Lingual Representation Learning For Speech Recognition
11 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
N19-1213
No ratings yet
N19-1213
7 pages
Zero Shot
No ratings yet
Zero Shot
11 pages
Language-Agnostic BERT Sentence Embedding
No ratings yet
Language-Agnostic BERT Sentence Embedding
14 pages
RUBERT A Bilingual Roman Urdu BERT Using Cross Lin
No ratings yet
RUBERT A Bilingual Roman Urdu BERT Using Cross Lin
7 pages
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
No ratings yet
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
7 pages
CUNI Submission For Low-Resource Languages in WMT News 2019
No ratings yet
CUNI Submission For Low-Resource Languages in WMT News 2019
7 pages
2004 09813v1 PDF
No ratings yet
2004 09813v1 PDF
10 pages
Comsl: A Composite Speech-Language Model For End-To-End Speech-To-Text Translation
No ratings yet
Comsl: A Composite Speech-Language Model For End-To-End Speech-To-Text Translation
12 pages
2017 - Unsupervised Neural Machine Translation PDF
No ratings yet
2017 - Unsupervised Neural Machine Translation PDF
11 pages
duan2020
No ratings yet
duan2020
6 pages
Cross-Lingual Language Model Pretraining PDF
No ratings yet
Cross-Lingual Language Model Pretraining PDF
10 pages
Llama Beyond English: An Empirical Study On Language Capability Transfer
No ratings yet
Llama Beyond English: An Empirical Study On Language Capability Transfer
10 pages
Improving Neural Machine Translation Models With Monolingual Data
No ratings yet
Improving Neural Machine Translation Models With Monolingual Data
11 pages
Sample of Technical Proposal For A Project
No ratings yet
Sample of Technical Proposal For A Project
5 pages
NCC Education International Diploma IN Computer Studies Visual Basic December 2009 - Local Examination
No ratings yet
NCC Education International Diploma IN Computer Studies Visual Basic December 2009 - Local Examination
6 pages
Tests: It Is A Must For Beginner Special To Give A Trail Test
No ratings yet
Tests: It Is A Must For Beginner Special To Give A Trail Test
5 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
No ratings yet
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
10 pages
Poly Urea Coating: Smart & Reliable Chemical Solutions
No ratings yet
Poly Urea Coating: Smart & Reliable Chemical Solutions
12 pages
Hyundai Engine Power Plant 2015
No ratings yet
Hyundai Engine Power Plant 2015
8 pages
The EAYN
No ratings yet
The EAYN
2 pages
Confucius Say Activity
No ratings yet
Confucius Say Activity
1 page
Constitutional Crisis
No ratings yet
Constitutional Crisis
3 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
Bbim4103 International Marketing-1
No ratings yet
Bbim4103 International Marketing-1
18 pages
24 Typical Sports English Speaking Countries
100% (1)
24 Typical Sports English Speaking Countries
2 pages
Diary of A Madman - 1
No ratings yet
Diary of A Madman - 1
17 pages
Encyclopedia of Biological Chemistry - Vol - 1
No ratings yet
Encyclopedia of Biological Chemistry - Vol - 1
895 pages
Detail of Quantity of Hume Pipe Culvert Single ROW Input Detail For Quantity of Hume Pipe Culvert at Various Locations
No ratings yet
Detail of Quantity of Hume Pipe Culvert Single ROW Input Detail For Quantity of Hume Pipe Culvert at Various Locations
1 page
Cost & Management Accounting - MGT402 Mcqs
No ratings yet
Cost & Management Accounting - MGT402 Mcqs
19 pages
Personal Selling
100% (2)
Personal Selling
30 pages
NAS620
No ratings yet
NAS620
2 pages
Marketing Environment: DEFINITION: "The External World in Which The Organization and Its Potential Customers
No ratings yet
Marketing Environment: DEFINITION: "The External World in Which The Organization and Its Potential Customers
44 pages
1987: Present Filipino Alphabet: Graduate School
No ratings yet
1987: Present Filipino Alphabet: Graduate School
7 pages
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
From Everand
ChatGPT Simplified: A Comprehensive Guide to Understanding and Utilizing AI Language Models, ChatGPT-4, ChatGPT Prompts, Fiction Writing, Blogging, Content Writing, Make Money Online
Silas Quantum
5/5 (1)
Gene Expression Programming: Fundamentals and Applications
From Everand
Gene Expression Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet