Multilinguality
Multilinguality
1. Collection
2. Initial Cleaning
● Goal: Perform basic preprocessing and filtering.
● Steps:
○ Language Identification: Detects and isolates data in the desired language(s).
○ Threshold-Based Filtering: Remove low-quality content based on predefined
thresholds (e.g., document length or language probability).
○ Multi-Language Documents: Handle or segregate multi-language data.
3. Deduplication
4. Filtering
Tokenization
● BPE
● Sentence Piece
● Word Piece
● Byte T5
Monolingual Corpora
1. Temperature Sampling
● Formula:
● How It Works:
○ Adjusts the sampling probabilities to balance high-resource and low-resource
languages.
○ Upsamples low-resource languages (increases their probability).
○ Downsamples high-resource languages (reduces their probability).
● Impact:
○ By modifying α
■ α = 0 -> Uniform sampling
■ α = 1 -> Proportion based distribution
■ So we need between this
2. Unimax Sampling
● How It Works:
○ Allocates the sampling budget as uniformly as possible.
○ Starts with the lowest-resource languages, ensuring their inclusion, and then
gradually allocates remaining budget to higher-resource languages.
● Advantages:
○ Provides better performance compared to temperature sampling by ensuring that
even the smallest language resources get sampled sufficiently.
Bitext Corpora
Bitext corpora are datasets containing parallel sentences in two languages (e.g., English-French,
English-Chinese). The techniques described here aim to balance the sampling for language pairs.
● Focuses on non-English languages when English is the central pivot language in bitext
corpora.
● Adjusts sampling probabilities to ensure fair representation of these non-English languages
relative to English.
● Formula:
● How It Works:
○ Balances sampling probabilities for language pairs (e.g., English-French,
English-Spanish).
○ Prevents overrepresentation of high-resource language pairs and
underrepresentation of low-resource pairs.
● Objective:
○ Train the model to align word-level representations across different languages.
○ The task forces the model to recover a word in one language given its aligned
counterpart in another language.
● How It Works:
○ A word from a sentence in the source language (e.g., English) is masked.
○ The corresponding sentence in the target language (e.g., Spanish) is provided as
input alongside the source sentence.
○ The model predicts the masked word in the source language using contextual
information from both languages.
● Purpose:
○ Encourages the model to build a shared embedding space for words with similar
meanings across languages.
● Example:
○ Input: Source: "The cat is [MASK]" (English), Target: "El gato está en la alfombra"
(Spanish).
○ Output: Predict the masked word: "on."
● Objective:
○ Determine whether two sentences in different languages are paraphrases (i.e.,
convey the same meaning).
● How It Works:
○ Sentence pairs in different languages are passed to the model.
○ The model encodes both sentences and predicts whether they are paraphrases
(similar meaning) or not.
● Purpose:
○ Improves the model's ability to align sentence-level semantics across languages.
● Example:
○ Input:
■ Sentence 1: "The dog is barking loudly" (English).
■ Sentence 2: "El perro está ladrando en voz alta" (Spanish).
○ Output: Yes, they are paraphrases.
● Training Data:
○ Uses parallel datasets or translation pairs that provide semantically equivalent
sentences in different languages.
● Objective:
○ Extend the traditional Masked Language Modeling (MLM) task to leverage
cross-lingual contexts for filling in masked tokens.
● How It Works:
○ Mask random tokens in a sentence from the source language.
○ Provide additional context from a parallel sentence in the target language.
○ The model predicts the masked tokens in the source language.
● Purpose:
○ Strengthens the model's ability to use cross-lingual context for word prediction.
○ Encourages shared contextual understanding across languages.
● Example:
○ Input:
■ Sentence in English: "The [MASK] is barking loudly."
■ Parallel sentence in Spanish: "El perro está ladrando en voz alta."
○ Output: Predict the masked word: "dog."
Difficulties in Fully Multilingual Learning
Multilingual models aim to learn and support multiple languages within a single model. However,
this approach comes with significant challenges, particularly when dealing with a mix of
high-resource and low-resource languages. Let’s break this down:
1. Curse of Multilinguality
● Problem:
○ For a fixed-sized model, the per-language capacity decreases as the number of
supported languages increases.
○ This means:
■ The model must distribute its limited parameters across all languages.
■ As more languages are added, the model's ability to represent and learn
each language diminishes.
● Example:
○ A model trained on 10 languages has more capacity per language compared to one
trained on 100 languages.
○ Result: The model's performance on individual languages, especially low-resource
ones, can degrade.
● Problem:
○ Adding more low-resource languages to the training set can negatively impact the
performance of high-resource languages.
○ The model reallocates capacity and focuses on balancing all languages, which can:
■ Hurt high-resource language translation quality.
■ Spread the model too thin across both high-resource and low-resource
languages.
● Example:
○ Adding languages like Quechua or Tigrinya (low-resource) to a model that already
supports English, French, and German can reduce the quality of English-to-French or
English-to-German translations.
The image represents an adaptive data balancing approach for training multilingual neural machine
translation (NMT) models. The method dynamically adjusts the sampling distribution of training data
for each language to optimize model performance across all languages. Here's a detailed breakdown
of the components:
1. Training Data:
○ Dtrain1,Dtrain2,…,Dtrainnare Training datasets for each of the n languages.
○ These datasets differ in size, with high-resource languages having significantly more
data than low-resource ones.
2. Development Data:
○ Ddev1, Ddev2,…, Ddevn are the Development (validation) sets for each language.
○ Used to evaluate model performance during training.
3. Scorer (ψt\psi_tψt):
○ A scoring mechanism that determines the importance or weight of each language
dataset at time t.
○ Generates the sampling distribution which specifies how frequently data from each
language should be sampled.
4. Model:
○ The multilingual NMT model being trained.
How It Works
Core Principles
1. Dynamic Sampling:
○ Instead of using fixed sampling rates, the model learns to prioritize languages
dynamically based on their contribution to improving dev set performance.
2. Gradient Alignment:
○ Languages whose training gradients align with the dev set gradients are likely to
improve generalization and are given higher weights.
3. Fair Representation:
○ Low-resource languages, which tend to have less training data, can be upweighted to
ensure they are adequately represented during training.