0% found this document useful (0 votes)
2 views

Multilinguality

The document discusses multilingual language models, categorizing them into encoder-only, decoder-only, and encode-decoder models, while highlighting challenges related to data quantity, quality, and governance. It outlines a data preprocessing workflow that includes collection, cleaning, deduplication, and filtering, as well as various sampling techniques for balancing high and low-resource languages. Additionally, it addresses difficulties in fully multilingual learning and presents an adaptive data balancing approach to optimize model performance across multiple languages.

Uploaded by

Shriram Pradeep
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Multilinguality

The document discusses multilingual language models, categorizing them into encoder-only, decoder-only, and encode-decoder models, while highlighting challenges related to data quantity, quality, and governance. It outlines a data preprocessing workflow that includes collection, cleaning, deduplication, and filtering, as well as various sampling techniques for balancing high and low-resource languages. Additionally, it addresses difficulties in fully multilingual learning and presents an adaptive data balancing approach to optimize model performance across multiple languages.

Uploaded by

Shriram Pradeep
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Multilingual language models

● Encoder only models - XLM, XLM-R


● Decoder only models - GPT 3.5, GPT 4
● Encode-Decoder only models - mBART, mT5

Challenges with data


● Quantity
○ Gaps in quantity across.
○ 57 languages are < 0.001%.
○ Across domains also.
● Quality
○ Incorrect Language Identification (poor quality+similar languages)
○ Machine generated data
○ Limited Identification Tools Available For Toxic/adult content
● Sourcing and governance
○ Initiatives by Government Agencies:
● Establishing frameworks to support data sharing and governance.
● Identifying key roles such as:
○ Data custodians.
○ Rights-holders.
○ Other stakeholders involved in data governance.
● Ensuring governance accounts for:
○ Privacy of data subjects.
○ Intellectual property rights.
○ User rights related to data and algorithm usage.
● Prioritizing local knowledge and expression of guiding values in governance
frameworks.

Summary of Data Preprocessing Workflow

1. Collection

● Goal: Gather raw data for processing.


● Steps:
○ Downloading: Acquire data from online or offline sources.
○ Text Extraction: Extract textual content from collected data (e.g., web pages,
documents).
○ Simple Deduplication: Remove duplicate data using URL-based filtering.

2. Initial Cleaning
● Goal: Perform basic preprocessing and filtering.
● Steps:
○ Language Identification: Detects and isolates data in the desired language(s).
○ Threshold-Based Filtering: Remove low-quality content based on predefined
thresholds (e.g., document length or language probability).
○ Multi-Language Documents: Handle or segregate multi-language data.

3. Deduplication

● Goal: Eliminate redundant data.


● Methods:
○ Exact Substring-Based Deduplication: Match exact substrings (e.g., mC4, OSCAR v*,
CC100).
○ Fuzzy MinHash-Based Deduplication: Identify near-duplicate content using MinHash
(e.g., used in GPT-3, The PILE).
○ Combined Approaches: Use both exact and fuzzy matching (e.g., Refined Web).

4. Filtering

● Goal: Ensure high-quality and safe datasets.


● Steps:
○ Heuristic-Based Filtering: Remove low-quality data using rule-based methods (e.g.,
Refined Web).
○ Model-Based Filtering: Use machine learning models for quality assessment (e.g.,
CC-Net, CC100).
○ Content-Based Filtering:
■ Exclude sensitive content using filters for NSFW URLs or personally
identifiable information (PII).
■ Perform line-based or document-based filtering for granular control.

Tokenization

● BPE
● Sentence Piece
● Word Piece
● Byte T5

Data sources for training


● Monolingual corpora - mBERT, mT5
● Bitext corpora
○ English centric - XLM, mBART
○ X-Y Directions - XY-LENT
SAMPLING TECHNIQUE

Monolingual Corpora

1. Temperature Sampling

● Formula:

● How It Works:
○ Adjusts the sampling probabilities to balance high-resource and low-resource
languages.
○ Upsamples low-resource languages (increases their probability).
○ Downsamples high-resource languages (reduces their probability).
● Impact:
○ By modifying α
■ α = 0 -> Uniform sampling
■ α = 1 -> Proportion based distribution
■ So we need between this

2. Unimax Sampling

● How It Works:
○ Allocates the sampling budget as uniformly as possible.
○ Starts with the lowest-resource languages, ensuring their inclusion, and then
gradually allocates remaining budget to higher-resource languages.
● Advantages:
○ Provides better performance compared to temperature sampling by ensuring that
even the smallest language resources get sampled sufficiently.

Bitext Corpora

Bitext corpora are datasets containing parallel sentences in two languages (e.g., English-French,
English-Chinese). The techniques described here aim to balance the sampling for language pairs.

1. Temperature Sampling (English-Centric)

● Focuses on non-English languages when English is the central pivot language in bitext
corpora.
● Adjusts sampling probabilities to ensure fair representation of these non-English languages
relative to English.

2. Temperature Sampling (X-Y Directions)

● Formula:

● How It Works:
○ Balances sampling probabilities for language pairs (e.g., English-French,
English-Spanish).
○ Prevents overrepresentation of high-resource language pairs and
underrepresentation of low-resource pairs.

Simple Multilingual Modeling

● It is possible to learn a single model that handles several languages


● Multilingual Input: Can just process different input languages using the same network
● MultilingualOutput: Add a tag or prompt about the target language for generation
Understanding Cross-Lingual Tasks in Unicoder
The Unicoder model is a universal language encoder trained using various cross-lingual tasks to learn
shared representations for multiple languages. Let’s break down the specific tasks mentioned:

1. Cross-Lingual Word Recovery

● Objective:
○ Train the model to align word-level representations across different languages.
○ The task forces the model to recover a word in one language given its aligned
counterpart in another language.
● How It Works:
○ A word from a sentence in the source language (e.g., English) is masked.
○ The corresponding sentence in the target language (e.g., Spanish) is provided as
input alongside the source sentence.
○ The model predicts the masked word in the source language using contextual
information from both languages.
● Purpose:
○ Encourages the model to build a shared embedding space for words with similar
meanings across languages.
● Example:
○ Input: Source: "The cat is [MASK]" (English), Target: "El gato está en la alfombra"
(Spanish).
○ Output: Predict the masked word: "on."

2. Cross-Lingual Paraphrase Classification

● Objective:
○ Determine whether two sentences in different languages are paraphrases (i.e.,
convey the same meaning).
● How It Works:
○ Sentence pairs in different languages are passed to the model.
○ The model encodes both sentences and predicts whether they are paraphrases
(similar meaning) or not.
● Purpose:
○ Improves the model's ability to align sentence-level semantics across languages.
● Example:
○ Input:
■ Sentence 1: "The dog is barking loudly" (English).
■ Sentence 2: "El perro está ladrando en voz alta" (Spanish).
○ Output: Yes, they are paraphrases.
● Training Data:
○ Uses parallel datasets or translation pairs that provide semantically equivalent
sentences in different languages.

3. Cross-Lingual Masked Language Modeling (MLM)

● Objective:
○ Extend the traditional Masked Language Modeling (MLM) task to leverage
cross-lingual contexts for filling in masked tokens.
● How It Works:
○ Mask random tokens in a sentence from the source language.
○ Provide additional context from a parallel sentence in the target language.
○ The model predicts the masked tokens in the source language.
● Purpose:
○ Strengthens the model's ability to use cross-lingual context for word prediction.
○ Encourages shared contextual understanding across languages.
● Example:
○ Input:
■ Sentence in English: "The [MASK] is barking loudly."
■ Parallel sentence in Spanish: "El perro está ladrando en voz alta."
○ Output: Predict the masked word: "dog."
Difficulties in Fully Multilingual Learning

Multilingual models aim to learn and support multiple languages within a single model. However,
this approach comes with significant challenges, particularly when dealing with a mix of
high-resource and low-resource languages. Let’s break this down:
1. Curse of Multilinguality

● Problem:
○ For a fixed-sized model, the per-language capacity decreases as the number of
supported languages increases.
○ This means:
■ The model must distribute its limited parameters across all languages.
■ As more languages are added, the model's ability to represent and learn
each language diminishes.
● Example:
○ A model trained on 10 languages has more capacity per language compared to one
trained on 100 languages.
○ Result: The model's performance on individual languages, especially low-resource
ones, can degrade.

2. Impact of Adding Low-Resource Languages

● Problem:
○ Adding more low-resource languages to the training set can negatively impact the
performance of high-resource languages.
○ The model reallocates capacity and focuses on balancing all languages, which can:
■ Hurt high-resource language translation quality.
■ Spread the model too thin across both high-resource and low-resource
languages.
● Example:
○ Adding languages like Quechua or Tigrinya (low-resource) to a model that already
supports English, French, and German can reduce the quality of English-to-French or
English-to-German translations.

Explanation of "Learning to Balance Data"

The image represents an adaptive data balancing approach for training multilingual neural machine
translation (NMT) models. The method dynamically adjusts the sampling distribution of training data
for each language to optimize model performance across all languages. Here's a detailed breakdown
of the components:

Key Components in the Diagram

1. Training Data:
○ Dtrain1,Dtrain2,…,Dtrainnare Training datasets for each of the n languages.
○ These datasets differ in size, with high-resource languages having significantly more
data than low-resource ones.
2. Development Data:
○ Ddev1, Ddev2,…, Ddevn are the Development (validation) sets for each language.
○ Used to evaluate model performance during training.
3. Scorer (ψt\psi_tψt​):
○ A scoring mechanism that determines the importance or weight of each language
dataset at time t.
○ Generates the sampling distribution which specifies how frequently data from each
language should be sampled.
4. Model:
○ The multilingual NMT model being trained.

How It Works

1. Gradient Calculation on Training Data:


○ The model computes the training gradients for each language dataset i.

2. Evaluate Gradients on Development Data:


○ After applying training gradients, the model evaluates its performance on the
multilingual development set Ddev.
○ This results in gradients reflecting how well the model generalizes to the dev set.
3. Scorer Adjusts Sampling Distribution:
○ The scorer compares the training gradients with the development gradients
○ Languages whose training gradients align well with the dev set gradients are
upweighted, meaning they are sampled more frequently.
○ The sampling probabilities are adjusted accordingly.
4. Optimize Data Sampling:
○ The updated sampling distribution ensures that low-resource languages are not
overshadowed by high-resource languages, improving the overall multilingual
performance.

Core Principles

1. Dynamic Sampling:
○ Instead of using fixed sampling rates, the model learns to prioritize languages
dynamically based on their contribution to improving dev set performance.
2. Gradient Alignment:
○ Languages whose training gradients align with the dev set gradients are likely to
improve generalization and are given higher weights.
3. Fair Representation:
○ Low-resource languages, which tend to have less training data, can be upweighted to
ensure they are adequately represented during training.

Benefits of This Approach

1. Better Multilingual Performance:


○ Optimizes performance across all languages by ensuring fair treatment of both
high-resource and low-resource languages.
2. Adaptive Balancing:
○ Automatically adjusts to the needs of each language based on model performance
during training.
3. Efficient Use of Data:
○ Ensures that even small datasets for low-resource languages contribute effectively to
the model’s training.

You might also like