Data-Efficient Multilingual Learning
Data-Efficient Multilingual Learning
Step 1: Language Identification Step 2: Data Selection for Annotation Step 3: Finetuning
Source Data Labelling
Lt1 Models
Target Data
A
Lt2
B
Transfer Language
Lt3
Source Data Labelling C
DeMuX
LEGEND
Models
Target (t)
Source Data
Source
A Labelled
Unlabelled
Target Data B Labelling
Annotator
C No Annotator
Method exists
D
Not addressed
Figure 1: Top: Today, improving the performance of a model on multilingual target data is a three-step process.
First, one would identify the target languages. Next, they would either collect data to label in these languages,
or closely related transfer languages, based on annotator availability. Finally, they would fine-tune a model on
the labelled data. However Step 1 excludes 98% of the world’s languages, Step 2 is constrained by annotators
or linguistic feature information, and Step 3 of factoring in the model to fine-tune, largely remains unaccounted
for. Bottom: With our end-to-end framework D E M U X, we prescribe the exact data to label from a vast pool of
multilingual source data, that provides for best transfer to the target, for a given model.
of unlabelled source data. Through iterations of a combination of the two, which picks uncertain
model training, data acquisition and human annota- points in target points’ local neighborhood (§3).
tion, the goal is to achieve satisfactory performance We experiment with tasks of varying complexity,
on a target test set, labelling only a small fraction categorized based on their label structure: token-
of the data. Past works (Chaudhary et al., 2019; Ku- level (NER and POS), sequence-level (NLI), and
mar et al., 2022; Moniz et al., 2022) have leveraged question answering (QA). We test our strategies
AL in the special case where the same language(s) in a zero-shot setting across three MultiLMs and
constitute the source and target set (Step 2 (upper five target language configurations, for a budget of
branch): Figure 1). However, none so far have 10,000 examples acquired in five AL rounds (§4).
considered the case of source and target languages
having unknown degrees of overlap; a far more We find that our strategies outperform previous
pervasive problem for real-world applications that baselines in most cases, including those with mul-
commonly build classifiers on multi-domain data tilingual target sets. The extent varies, based on
(Dredze and Crammer, 2008). From the AL lens, the budget, the task, the languages and models
this is particularly challenging since conventional (§5). Overall, we observe that the hybrid strategy
strategies of choosing the most uncertain samples performs best for token-level tasks, but picking
(Settles, 2009), could pick distracting examples globally uncertain points gains precedence for NLI
from very dissimilar language distributions (Long- and QA. To test the applicability of D E M U X in
pre et al., 2022). Our strategies are designed to resource constrained settings, we experiment with
deal with this distribution shift by leveraging small lower budgets ranging from 5-1000 examples, ac-
amounts of unlabelled data in target languages. quired in a single AL round. In this setting, we
observe gains of upto 8-11 F1 points for token-
In the rest of the paper, we first describe three level tasks, and 2-5 F1 for complex tasks like NLI
AL strategies based on the principles of a) semantic and QA. For NLI, our strategies surpass the gold
similarity with the target; b) uncertainty; and c) standard of fine-tuning in target languages, while
being zero-shot. unlabelled target pool (McCallum et al., 1998; Set-
tles and Craven, 2008). The source points cho-
2 Notation sen are informative since they are prototypical of
the target data in the representation space (Figure
Assume that we have a set of source languages, 2a). Especially for low degrees of overlap between
Ls = {ls1 . . . lsn }, and a set of target languages, source and target data distributions, this criterion
Lt = {lt1 . . . ltm }. Ls and Lt are assumed to have can ignore uninformative source points. Formally,
unknown degrees of overlap.
Further, let us denote the corpus of unlabelled X
X∗ = argmin X ∈ S b dt (xs )
source data as Xs = {x1s . . . xN s } and the unla-
xs ∈X
belled target data as Xt = {x1t . . . xM
t }.
Our objective is to label a total budget of B Where
data points over K AL rounds from the source 1
f (x) − f xjt
X
dt (x) =
data. The points to select in each round can then |Xt |
B xjt ∈Xt
be calculated by b = K . Thus considering the
b
super-set S = {X ⊂ Xs |X| = b} of all b-
sized subsets of Xs , our objective is to select some For all task types, we use embeddings of tokens
X∗ ∈ S b according to an appropriate criterion. fed into the final classifier, to represent the whole
sequence. For NLI and QA, this is the [CLS] token
3 Annotation Strategies embedding. For token-level tasks, we compute the
mean of the initial sub-word token embeddings
Based on the broad categorizations of AL methods for each word, as this is the input provided to the
as defined by Zhang et al. (2022), we design three classifier to determine the word-level tag.
annotation strategies that are either representation-
based, information-based, or hybrid. The first 3.2 UNCERTAINTY
picks instances that capture the diversity of the
dataset; the second picks the most uncertain points Uncertainty sampling (Lewis, 1995) improves an-
which are informative to learn a robust decision notation efficiency by choosing points that the
boundary; and the third focuses on optimally com- model would potentially misclassify in the current
bining both criteria. In contrast to the standard AL iteration. Uncertainty measures for each task-
AL setup, there are two added complexities in our type can be found below:
framework: a) source-target domain mismatch; b) Sequence-Level: We use margin-sampling
multiple distributions for each of our target lan- (Scheffer et al., 2001; Schein and Ungar, 2007),
guages. We therefore design our measures to select which selects points having the least difference
samples that are semantically similar (from the between the model’s probabilities for the top-two
perspective of the MultiLM) to the target domain classes. We compute the output probability distri-
(Longpre et al., 2022). bution for all unlabeled samples in Xs and select
All strategies build upon reliable distance and un- samples with the smallest margin. Formally,
certainty measures, whose implementation varies
X
based on the type of task, i.e. whether the task is X∗ = argmin X ∈ S b P∆ (xs )
token-level, sequence-level or question answering. xs ∈X
A detailed visualization of how these are calculated Where
can be found in §A.1. Below, we formally describe
P∆ (x) = pc1 (x) − pc2 (x)
the three strategies, also detailing the motivation
behind our choices.
pc1 (x) and pc2 (x) are the predicted probabilities
3.1 AVERAGE-DIST of the top-two classes for an unlabeled sample x.
AVERAGE-DIST constructs the set X∗ such that it Token-level: For token-level tasks we first com-
minimizes the average distance of points from Xt pute the margin (as described above) for each token
under an embedding function f : X → Rd de- in the sequence. Then, we assign the minimum mar-
fined by the MultiLM. This is a representation- gin across all tokens as the sequence margin score
based strategy that picks points lying close to the and choose construct X∗ with sequences having
(a) AVERAGE-DIST (b) UNCERTAINTY (c) KNN-UNCERTAINTY
Figure 2: Visualization of datapoints selected using strategies detailed in Section 3, for a three-class sequence
classification task (XNLI). AVERAGE-DIST selects points (dark blue) at a minimum average distance from the
target (pink); UNCERTAINTY selects most uncertain points lying at the decision boundary of two classes, and
KNN-UNCERTAINTY selects uncertain points in the target neighborhood.
the least score. Formally, points (Figure 2a) – even if the model is accurate
X in that region of representation space – resulting in
X∗ = argmin X ∈ S b MARGIN-MIN (xs ) minimal coverage on the target set.
xs ∈X To design a strategy that combines the strengths
Where of both distance and uncertainty, we first measure
|x| how well a target point’s uncertainty correlates with
MARGIN-MIN (x) = min pic1 (x) − pic2 (x)
its neighborhood. We calculate the Pearson’s corre-
i=1
lation coefficient (ρ) (Pearson, 1903) between the
Question Answering: The QA task we investi- uncertainty of a target point in Xt and the average
gate involves extracting the answer span from a rel- uncertainty of its top-k neighbors in Xs . We ob-
evant context for a given question. This is achieved serve a statistically significant ρ value > 0.7, for
by selecting tokens with the highest start and end all tasks. A natural conclusion drawn from this is
probabilities as the boundaries, and predicting to- that decreasing the uncertainty of a target point’s
kens within this range as the answer. Hence, sam- neighborhood would decrease the uncertainty of
ples having the lowest start and end probabilities, the target point itself. Hence, we first select the top-
qualify as most uncertain. Formally, k neighbors for each xt ∈ Xt . Next, we choose the
X most uncertain points from these neighbors until
X∗ = argmin X ∈ S b SUM-PROB (xs ) we reach b data points. Formally, until |X∗ | = b :
xs ∈X X
Where X∗ = argmax k U (xs )
{X ⊂ Nt |X| = b}
xs ∈ X
|x| |x|
SUM-PROB (x) = max log pis (x)+ max log pie (x) Where
i=1 i=1
|Xt |
Above, |x| denotes the sequence length of the k-NEARESTNEIGHBORS(xjt , Xs )
[
Ntk =
unlabeled sample x, and pis (x) and pie (x) represent j=1
the predicted probabilities for the start and end
Above, U (xs ) represents the uncertainty of the
index, respectively.
source point as calculated in §3.2.
3.3 KNN-UNCERTAINTY
4 Experimental Setup
As standalone measures, both distance and uncer-
tainty based criteria have shortcomings. When Our setup design aims to address the following:
there is little overlap between source and target, Q1) Does D E M U X benefit tasks with varying com-
choosing source points based on UNCERTAINTY plexity? Which strategies work well across differ-
alone leads to selecting data that are uninformative ent task types? (§4.1)
to the target. When there is high degrees of overlap Q2) How well does D E M U X perform across a
between source and target, the AVERAGE-DIST met- varied set of target languages? Can it benefit multi-
ric tends to produce a highly concentrated set of lingual target pools as well? (§4.2)
Task Type Task Dataset Languages (two-letter ISO code)
Part-of-Speech Universal Dependencies tl, af, ru, nl, it, de, es, bg, pt, fr, te, et, el, fi,
Token-level
Tagging (POS) v2.5 (Nivre et al., 2020) hu, mr, kk, hi, tr, eu, id, fa, ur, he, ar, ta, vi,
ko, th, zh, yo, ja
Named Entity WikiAnn (Rahimi et al., nl, pt, bg, it, fr, hu, es, el, vi, fi, et, af, bn, de,
Recognition (NER) 2019) tr, tl, hi, ka, sw, ru, mr, ml, jv, fa, eu, ko, ta,
ms, he, ur, kk, te, my, ar, id, yo, zh, ja, th
Sequence-Level Natural Language XNLI (Conneau et al., es, bg, de, fr, el, vi, ru, zh, tr, th, ar, hi, ur,
Inference (NLI) 2018) sw
Question Answering (QA) TyDiQA (Clark et al., 2020) id, fi, te, ar, ru, sw, bn, ko
Table 1: Tasks and Datastes: D E M U X is applied across tasks of varying complexity, as elucidated in Q1: §4.
Table 2: Target language configurations. We run five experiments for each model and task, with the language sets
above as targets (details in §4.2). All languages mentioned in Table 1 make up the source set, except the chosen
target languages for a particular configuration.
Q3) How do the benefits of D E M U X vary across From these, we choose languages that have similar-
different MultiLMs? (§4.3) ities with the source set across different linguistic
dimensions (obtained using lang2vec (Littell et al.,
4.1 Task and Dataset Selection
2017)), to study the role of typological similarity
We have three distinct task types, based on the label for different tasks.
format. We remove duplicates from each dataset
Multi-target: Here, we envision two scenarios:
to prevent selecting multiple copies of the same
Geo: Mid-to-low performing languages in geo-
instance. Dataset details can be found in Table 1.
graphical proximity are chosen. From an applica-
4.2 Source and Target Language Selection tion perspective, this would allow one to improve a
We experiment with the zero-shot case of disjoint MultiLM for an entire geographical area.
source and target languages, i.e., the unlabelled LPP: All low-performing languages are pooled,
source pool contains no data from target languages. to test whether we can collectively enhance the
The train and validation splits constitute the unla- MultiLM’s performance across all of them.
belled source or target data, respectively. Evalua-
tion is done on the test split for each target language. 4.3 Model Selection
With Q2) in mind, we experiment with five target We test D E M U X across multiple MultiLMs: XLM-
settings (Table 2): R (Conneau et al., 2019), InfoXLM (Chi et al.,
Single-target: We partition languages into three 2020), and RemBERT (Chung et al., 2020). All
equal tiers based on zero-shot performance post models have a similar number of parameters
fine-tuning on English: high-performing (HP), mid- (∼550M-600M), and support 100+ languages.
performing (MP) and low-performing (LP), and XLM-R is trained on monolingual corpora from
choose one language from each, guided by two CC-100 (Conneau et al., 2019), InfoXLM is trained
factors. First, we select languages that are com- to maximize mutual information between multi-
mon across multiple datasets, to study how data lingual texts, and RemBERT is a deeper model,
selection for the same language varies across tasks. that reallocates input embedding parameters to the
Transformer layers. 1% absolute delta from the best-performing base-
line.
4.4 Baselines How does D E M U X fare on multilingual target
We include a number of baselines to compare our pools? We observe consistent gains given mul-
strategies against: tilingual target pools as well (Geo and LPP). We
1) RANDOM: In each round, a random subset of b believe this is enabled by the language-independent
data points from Xs is selected. design of our strategies, which makes annotation
2) EGALITARIAN: An equal number of randomly decisions at a per-instance level. This has important
selected data points from the unlabeled pool for consequences, since this would enable researchers,
each language, i.e. |xs | = b/|Ls |; ∀xs ∈ Xs is like those at Company Y, to better models for all
chosen. Debnath et al. (2021) demonstrate that this the languages that they care about.
outperforms a diverse set of alternatives. Does the model select data from the same lan-
3) LITMUS: LITMUS (Srinivasan et al., 2022) is a guages across tasks? No! We find that selected
tool to make performance projections for a fine- data distributions vary across tasks for the same
tuned model, but can also be used to generate data target languages. For example, when the target lan-
labeling plans, based on the predictor’s projections. guage is Urdu, D E M U X chooses 70-80% of sam-
We only run this for XLM-R since the tool requires ples from Hindi for NLI and POS, but prioritizes
past fine-tuning performance profiles, and XLM-R Farsi and Arabic (35-45%) for NER. Despite Hindi
is supported by default. and Urdu’s syntactic, genetic, and phonological
4) GOLD: This involves training on data from the similarities as per lang2vec, their differing scripts
target languages itself. Given all other strategies are underscore the significance of script similarity in
zero-shot, we expect GOLD to out-perform them and NER transfer. This also proves that analysing data
help determine an upper bound on performance. selected by D E M U X can offer linguistic insights
into the learned task-specific representations.
4.5 Fine-tuning Details
Method HP MP LP Geo LPP
We first fine-tune all MultiLMs on English (EN-FT)
EN-FT 80.0 79.5 65.6 61.0 45.8
and continue fine-tuning on data selected using GOLD 90.1 92.8 94.5 81.2 73.7
XLM-R
D E M U X, similar to Lauscher et al. (2020); Ku- BASEegal 85.4 87.6 84.0 80.6 62.8
mar et al. (2022). We experiment with a budget D E M U Xknn 87.8 89.2 85.8 82.4 62.3
of 10,000 examples acquired in five AL rounds, ∆base 2.4 1.6 1.8 1.8 -0.5
except for TyDiQA, where our budget is 5,000 ex- EN-FT 80.5 82.8 65.4 64.2 44.8
GOLD 90.0 92.8 94.6 83.5 74.9
InfoXLM
GOLD 95.6 81.2 93.2 91.8 88.2 GOLD 81.6 79.5 70.3 81.6 76.0
XLM-R
BASEegal 87.1 79.6 88.4 85.7 68.9 BASEegal 81.6 78.8 73.0 80.9 75.6
D E M U Xknn 87.5 80.1 90.1 86.1 70.9 D E M U Xavg 83.7 79.9 75.3 82.2 77.1
∆base 0.4 0.5 1.7 0.4 2.0 ∆base 2.1 1.1 2.3 1.3 1.5
∆gold 2.1 0.4 5.0 0.6 1.1
EN-FT 79.6 74.0 59.0 73.6 58.2
InfoXLM
GOLD 95.7 81.4 93.3 92.0 88.7 EN-FT 81.9 77.3 68.8 79.8 71.5
BASEegal 88.0 79.4 88.8 86.3 67.8 GOLD 83.6 80.6 73.7 82.4 77.7
InfoXLM
D E M U Xknn 87.8 79.5 90.4 86.0 66.8 BASEegal 83.7 79.8 74.6 81.5 77.3
D E M U Xavg 84.8 80.8 75.9 83.1 77.8
∆base -0.3 0.1 1.6 -0.3 -1.0
∆base 1.1 1.0 1.3 1.6 0.5
EN-FT 72.9 71.1 50.6 66.1 55.7
∆gold
RemBERT
RemBERT
D E M U Xknn 87.4 77.7 88.2 84.2 68.0 GOLD 81.1 73.3 63.1 76.0 67.5
BASEegal 80.0 75.3 63.9 76.4 67.9
∆base 0.5 -0.3 2.4 0.4 0.2
D E M U Xavg 81.7 76.1 67.6 78.6 70.9
Table 4: UDPOS Results (F1): We observe modest gains ∆base 1.7 0.8 3.7 2.2 3.0
∆gold 0.6 2.8 4.5 2.6 3.4
for a 10k budget, but higher gains for lower budgets (§6)
.
Table 5: XNLI Results (F1): Here we even surpass the
gold standard of in-language finetuning. Details in §5.
Performance
70
80
60
70 50
40
60
5
10
50
100
250
500
5
1000
10
50
100
250
500
1000
Budget Budget
XNLI: Urdu TyDiQA: Arabic Strategies
75 85 gold
egalitarian
70 knn
80 average
Performance
Performance
65 uncertain
75
60
70
55
5
10
50
100
250
500
5
1000
10
50
100
250
500
1000
Budget Budget
Figure 4: Multiple budgets, one AL round: We experiment with low-budgets acquired using the EN-FT model. We
observe gains of up to 8-11 F1 over baselines for 5-100 examples, with a trend of diminishing gains given larger
budgets. All runs averaged across three seeds (2, 22, 42).
GOLD 81.2 83.8 83.7 84.7 81.0 lection in identified transfer languages is vital.
BASEegal 79.9 81.7 79.6 81.1 78.7
D E M U Xunc 80.8 82.9 80.3 81.0 77.8
7 Related Work
∆base 0.9 1.2 0.7 -0.1 -0.9
EN-FT 77.6 75.4 82.2 81.9 78.8 Multilingual Fine-tuning: Traditionally models
InfoXLM
GOLD 78.4 80.1 86.7 84.4 80.5 two research directions. The first emphasizes the
BASEegal 81.3 78.9 82.8 76.5 75.3 significance of using few-shot target language data
D E M U Xunc 82.7 80.2 80.6 78.0 76.1
(Lauscher et al., 2020) and the development of
∆base 1.4 1.3 -2.2 1.5 0.8
strategies for optimal few-shot selection (Kumar
et al., 2022; Moniz et al., 2022). The second
Table 6: TyDiQA Results (F1): UNCERTAINTY works
best here. Despite TyDiQA being composed of typo- focuses on choosing the best source languages for
logically diverse languages and being extremely small a target, based on linguistic features (Lin et al.,
(35-40k samples), we observe modest gains across mul- 2019) or past model performances (Srinivasan
tiple configs. et al., 2022). Discerning a globally optimal transfer
language however, has been largely ambiguous
(Pelloni et al., 2022) and the language providing
Do the selected datapoints matter or does follow- for highest empirical transfer is at times inex-
ing the language distribution suffice? D E M U X plicable by known linguistic relatedness criteria
not only identifies transfer languages but also se- (Pelloni et al., 2022; Turc et al., 2021). By making
lects specific data for labeling. To evaluate its im- decisions at a data-instance level rather than a
portance, we establish the language distribution language level, D E M U X removes the reliance
of data selected using D E M U X and randomly se- on linguistic features and sidesteps ambiguous
lect datapoints following this distribution. Despite consensus on how MultiLMs learn cross-lingual
relations, while prescribing domain-relevant inference on all of the source data; which can be
instances to label. time-consuming. However, one can run parallel
CPU-inference which greatly reduces latency.
Active learning for NLP: AL has seen wide adop-
tion in NLP, being applied to tasks like text classifi- Aprior Model Selection: We require knowing the
cation (Karlos et al., 2012; Li et al., 2013), named model apriori which might mean a different label-
entity recognition (Shen et al., 2017; Wei et al., ing scheme for different models. This a trade-off
2019; Erdmann et al., 2019), and machine transla- we choose in pursuit of better performance for the
tion (Miura et al., 2016; Zhao et al., 2020), among chosen model, but it may not be the most feasible
others. In the multilingual context, past works (Mo- solution for all users.
niz et al., 2022; Kumar et al., 2022; Chaudhary
et al., 2019) have applied AL to selectively label Refinement to the hybrid approach: Our hy-
data in target languages. However, they do not con- brid strategy picks the most uncertain points in
sider cases with unknown overlap between source the neighborhood of the target. However, its cur-
and target languages. This situation, similar to a rent design prioritizes semantic similarity with the
multi-domain AL setting, is challenging as data target over global uncertainty, since we first pick
selection from the source languages may not prove top-k neighbors and prune this set based on uncer-
beneficial for the target (Longpre et al., 2022). tainty. However, it will be interesting to experiment
with choosing globally uncertain points first and
8 Conclusion then pruning the set based on target similarity. For
In this work, we introduce D E M U X, an end-to- NLI and QA, we observe that globally uncertain
end framework that selects data to label from vast points help for higher budgets but choosing nearest
pools of unlabelled multilingual data, under an an- neighbors helps most for lower budgets. Therefore,
notation budget. D E M U X’s design is language- this alternative may work better for these tasks, and
agnostic, making it viable for cases where source is something we look to explore in future work.
and target data do not overlap. We design three
10 Acknowledgements
strategies drawing from AL principles that encom-
pass semantic similarity with the target, uncertainty, We thanks members of the Neulab and COMEDY,
and a hybrid combination of the two. Our strate- for their invaluable feedback on a draft of this pa-
gies outperform strong baselines for 84% of target per. This work was supported in part by grants from
language configurations (including multilingual tar- the National Science Foundation (No. 2040926),
get sets) in the zero-shot case of disjoint source Google, Two Sigma, Defence Science and Technol-
and target languages, across three models and four ogy Agency (DSTA) Singapore, and DSO National
tasks: NER, UDPOS, NLI and QA. We find that Laboratories Singapore.
semantic similarity with the target mostly benefits
token-level tasks, while picking uncertain points
gains precedence for complex tasks like NLI and References
QA. We further analyse D E M U X’s applicability in
Aditi Chaudhary, Jiateng Xie, Zaid Sheikh, Graham
low-budget settings and observe gains of up to 8-11 Neubig, and Jaime G Carbonell. 2019. A little an-
F1 points for some tasks, with a trend of dimin- notation does a lot of good: A study in bootstrap-
ishing gains for larger budgets. We hope that our ping low-resource named entity recognizers. arXiv
work helps improve the capabilities of MultiLMs preprint arXiv:1908.08983.
for desired languages, in a cost-efficient way. Zewen Chi, Li Dong, Furu Wei, Nan Yang, Sak-
sham Singhal, Wenhui Wang, Xia Song, Xian-Ling
9 Limitations Mao, Heyan Huang, and Ming Zhou. 2020. In-
foxlm: An information-theoretic framework for
With D E M U X’s wider applicability across lan- cross-lingual language model pre-training. arXiv
guages come a few limitations as we detail below: preprint arXiv:2007.07834.
LOG-SOFTMAX
Linear Layer + LOG-SOFTMAX
–
LOG-SOFTMAX
…
CLS CLF 0.5 – 0.4 …
CLF – … Start MAX{ }
MARGIN End MAX{ }
0.1
– …
… MIN{[MARGINS]} +
Model
CLS A1 A1 A2 A2 A2 A3 … SEP B1 B2 B2 B3 B3 B3 …
CLS A1 A1 A2 A2 A2 A3 … SEP B1 B2 B2 B3 B3 B3 …
Input
[CLS] Sentence A [SEP] Sentence B [CLS] Sentence A [CLS] Question [SEP] Context
Figure 5: An overview of how distance and uncertainty are measured in our setup. A1 , A2 , A3 denote three words
in Sentence A that are tokenized into 2, 3, and 1 subword, respectively.
Table 7: Number of unlabelled target examples used in each configuration. This is the size of the validation set.
Table 9: LITMUS prescribed annotation budget: LITMUS prescribes how many samples to select from each
language. We select a random sample of data following the prescribed annotation.
AL Round
Dataset Strategy
1 2 3 4 5
SR 81.1 81.9 84.1 85.1 84.0
PAN-X DEMUX 83.2 83.1 84.1 85.8 85.2
∆ 2.0 1.2 0.0 0.7 1.2
SR 89.9 89.3 89.5 90.0 89.8
UDPOS DEMUX 89.7 89.8 89.9 89.5 90.1
∆ -0.2 0.5 0.5 -0.4 0.3
SR 73.3 73.8 73.8 73.8 73.9
XNLI DEMUX 74.5 74.7 75.5 75.3 75.3
∆ 1.2 0.9 1.6 1.5 1.4
SR 80.6 81.5 81.5 82.0 81.7
TyDiQA DEMUX 82.8 83.2 83.2 83.5 83.8
∆ 2.2 1.7 1.8 1.5 2.1