Name	Name	Last commit message	Last commit date
parent directory ..
data_helper	data_helper
mined_train_sets	mined_train_sets
speech_to_speech	speech_to_speech
valid_test_sets	valid_test_sets
README.md	README.md
__init__.py	__init__.py
speech_laser_encoders.md	speech_laser_encoders.md

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

Installation

Speech-to-Speech Mined Data

SpeechMatrix provides massive parallel speech which is mined from VoxPopuli in 17 languages: Czech (cs), German (de), English (en), Spanish (es), Estonian (et), Finnish (fi), French (fr), Croatian (hr), Hungarian (hu), Italian (it), Lithuanian (lt), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk) and Slovenian (sl).

Here is a summary of mined speech-to-speech data statistics, and the duration (hours) of source speech is reported in each of 272 language directions.

Src/Tgt	cs	de	en	es	et	fi	fr	hr	hu	it	lt	nl	pl	pt	ro	sk	sl
cs	-	2381	3208	2290	952	1312	2476	726	1396	2410	84	2377	2516	1867	1190	2146	452
de	2386	-	4734	3113	901	1477	3536	498	1871	3476	41	3384	2632	2250	1281	1646	361
en	3172	4676	-	4715	1585	2169	5178	824	2266	4897	82	4422	3583	3572	2258	2306	586
es	2240	3041	4708	-	862	1373	4446	528	1599	4418	47	3067	2646	3484	1857	1603	308
et	943	892	1593	877	-	1201	934	265	1119	1019	39	1055	949	721	419	780	196
fi	1296	1463	2180	1393	1197	-	1449	306	1473	1599	47	1654	1350	1128	621	977	260
fr	2424	3457	5171	4455	923	1435	-	560	1711	4618	50	3273	2822	3384	1991	1657	326
hr	736	507	854	553	273	317	588	-	328	615	24	546	660	433	277	586	136
hu	1417	1897	2346	1672	1140	1507	1787	328	-	1855	68	1839	1566	1315	808	1064	311
it	2404	3460	4948	4500	1028	1614	4700	607	1823	-	103	3414	2848	3421	1995	1656	474
lt	78	38	79	46	37	44	48	21	61	95	-	77	80	35	18	64	6
nl	2322	3305	4396	3066	1040	1633	3269	521	1768	3355	80	-	2459	2399	1352	1646	458
pl	2530	2646	3662	2735	967	1378	2913	656	1554	2883	88	2540	-	2121	1301	1892	431
pt	1849	2224	3606	3525	722	1131	3421	421	1279	3403	37	2436	2087	-	1579	1358	247
ro	1187	1275	2290	1894	423	627	2024	271	789	1996	19	1384	1288	1592	-	870	125
sk	2127	1628	2329	1631	781	982	1685	574	1038	1650	69	1676	1869	1361	867	-	370
sl	436	350	579	307	192	254	324	128	295	461	6	454	413	241	121	359	-

Speech-to-Speech Alignments

The mined data has speech alignments at the threshold of 1.06. For the train set, its overlap with VoxPopuli valid and test data has been removed.

# SAVE_ROOT: the directory to save mined data
python mined_train_sets/download_mined_data.py \
    --save-root ${SAVE_ROOT}

Audios are saved to ${SAVE_ROOT}/audios/. For example, English audios are compressed into en_aud.zip.

Speech alignments are saved to ${SAVE_ROOT}/aligned_speech/. For example, en-fr.tsv.gz contains a pair of aligned audio paths in English and French respectively together with their alignment score in each line.

Speech-to-Unit Data

Speech-to-unit manifests are provided to facilitate S2U training, which has units extracted from HuBERT models. VoxPopuli test and valid manifests are also released.

# SAVE_ROOT: the directory to save mined data
python mined_train_sets/download_speech_to_unit.py \
    --save-root ${SAVE_ROOT}

The speech-to-unit data for a language pair is saved to ${SAVE_ROOT}/s2u_manifests/${lang_pair}

Reproduce Bilingual Train Data

In our bilingual speech-to-unit experiments, we set different thresholds to select a subset of mined data for the purpose of training efficiency. You can reproduce the training data and configurations as below:

# SAVE_ROOT: the directory to save SpeechMatrix mined data
python3 speech_to_speech/prep_bilingual_textless_manifest.py --save-root ${SAVE_ROOT}

Speech-to-Speech Valid and Test Data

For speech-to-speech training, we use mined data as the train set, and prepare valid and test sets using VoxPopuli, EuroParl-ST and FLEURS. Check out here for valid and test data preparations.

Speech-to-Speech Training

Textless Model

Check out here for Textless S2U model training.

Bilingual Textless Model Release

Src/Tgt	de	en	es	fr	it	nl	pl	pt	ro
cs	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
de	-	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
en	ckpt	-	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
es	ckpt	ckpt	-	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
et	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
fi	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
fr	ckpt	ckpt	ckpt	-	ckpt	ckpt	ckpt	ckpt	ckpt
hr	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
hu	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
it	ckpt	ckpt	ckpt	ckpt	-	ckpt	ckpt	ckpt	ckpt
nl	ckpt	ckpt	ckpt	ckpt	ckpt	-	ckpt	ckpt	ckpt
pl	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	-	ckpt	ckpt
pt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	-	ckpt
ro	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	-
sk	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt
sl	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt	ckpt

Multilingual Textless Model

Download 260M Slavic-to-English model.

XM Transformer

Check out here for XM Transformer model training.

Example training command:

python3 $FAIRSEQ_ROOT/train.py --distributed-world-size 32 --distributed-port 12314 --save-dir $SAVE_DIR $DATA_ROOT --ddp-backend legacy_ddp --num-workers 0 --task speech_to_text --criterion label_smoothed_cross_entropy --no-epoch-checkpoints --report-accuracy --clip-norm 10.0 --log-format simple --log-interval 500 --seed 121 --max-update 160000 --share-decoder-input-output-embed --validate-interval 1 --save-interval 1 --save-interval-updates 500 --skip-invalid-size-inputs-valid-test --keep-best-checkpoints 10 --optimizer adam --adam-betas '"'"'(0.9, 0.98)'"'"' --lr 0.0001 --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 --lr-scheduler inverse_sqrt --warmup-updates 5000 --arch xm_transformer --normalize --adaptor-n-layers 1 --decoder-attention-heads 16 --decoder-normalize-before --load-pretrained-decoder-from ${UNIT_MBART_PATH} --w2v-path ${XLSR_PATH} --config-yaml config.yaml --mask-prob 0.3 --mask-channel-prob 0.25 --mask-channel-length 10 --layerdrop 0.1 --finetune-decoder-params all --label-smoothing 0.2 --patience 30 --max-tokens 9000 --max-tokens-valid 9000 --max-target-positions 9000 --max-source-positions 9000 --max-positions 9000 --update-freq 2 --train-subset $TRAIN_SUBSET --valid-subset $VALID_SUBSET --checkpoint-activations --encoder-proj

Architecture	direction	#params	Link
Dense XM	Slavic-to-English	1.3B	ckpt
Dense XM	All-to-English	1.3B	ckpt

Sparse XM Transformer (gshard)

Following arguments could be added for gshard in XM Transformer training, following the settings in paper (decoder-only MoE):

--moe-freq 2 --encoder-moe-freq 0 --decoder-moe-freq 2 --moe-expert-count $NUM_EXPERTS

Architecture	direction	#params	Link
gshard XM	Slavic-to-English	4.3B	ckpts
gshard XM	All-to-English	4.3B	ckpts

Inference

Download vocoders below to synthesize target speech from predicted unit sequence.

Unit-based HiFi-GAN vocoder

Unit config	Unit size	Vocoder language	Dataset	Model
mHuBERT, layer 11	1000	de	CSS10	ckpt, config
mHuBERT, layer 11	1000	nl	CSS10	ckpt, config
mHuBERT, layer 11	1000	fi	CSS10	ckpt, config
mHuBERT, layer 11	1000	hu	CSS10	ckpt, config
mHuBERT, layer 11	1000	et	Common Voice	ckpt, config
mHuBERT, layer 11	800	it	VoxPopuli	ckpt, config
mHuBERT, layer 11	1000	pt	Common Voice	ckpt, config
mHuBERT, layer 11	1000	ro	VoxPopuli	ckpt, config
mHuBERT, layer 11	1000	cs	VoxPopuli	ckpt, config
mHuBERT, layer 11	1000	pl	VoxPopuli	ckpt, config
mHuBERT, layer 11	1000	hr	VoxPopuli	ckpt, config
mHuBERT, layer 11	1000	lt	VoxPopuli	ckpt, config
mHuBERT, layer 11	1000	sk	VoxPopuli	ckpt, config
mHuBERT, layer 11	1000	sl	VoxPopuli	ckpt, config

For English (en), Spanish (es) and French (fr), we reuse HuBERT and kmeans models released by textless model.

Textless Model Inference

Check out here for Textless model inference.

XM Transformer Model Inference

Check out the inference in fairseq-S2T to generate unit sequences (${RESULTS_PATH}/generate-${GEN_SUBSET}.txt).

fairseq-generate $DATA_ROOT \
  --config-yaml config.yaml \
  --task speech_to_text  \
  --path $MODEL_DIR/checkpoint_best.pt  --gen-subset $GEN_SUBSET \
  --max-tokens 18000 \
  --beam 10 --max-len-a 0.003125 --max-len-b 200 \
  --results-path ${RESULTS_PATH}

For MoE inference, add the following options and make sure

(1) #num_experts % #num_gpus == 0

(2) No OOM issue

--is-moe --distributed-world-size $NUM_GPUS

Convert unit sequences to waveform.

grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
  sed 's/^D-//ig' | sort -nk1 | cut -f3 \
  > ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit

python examples/speech_to_speech/generate_waveform_from_code.py \
  --in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
  --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
  --results-path ${RESULTS_PATH} --dur-prediction

ASR-BLEU evaluation

Use the following command to compute ASR-BLEU with inference results on test sets

TGT_LANG: Target language

AUDIO_FOLDER: A folder contains all inference results (audio files)

REFERENCE_PATH: A txt file with each line to be translation result in plain text

cd ${FAIRSEQ_ROOT}/examples/speech_to_speech/asr_bleu
python3 compute_asr_bleu.py --lang ${TGT_LANG} \
--audio_dirpath ${AUDIO_FOLDER} \
--reference_path ${REFERENCE_PATH} \
--reference_format txt

Mining with Speech Encoder

For people who are interested in trying out our speech encoders and mining parallel speech by themselves, we also release speech encoders. Please check out speech encoding intructions for more details.

Huggingface Demo for SpeechMatrix models

This demo on huggingface has all-en multilingual model and bilingual models with target languages {en,fr,es} trained with SpeechMatrix data. https://huggingface.co/spaces/facebook/speech_matrix

Citation

@inproceedings{speech-matrix,
    title = "{S}peech{M}atrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations",
    author = "Paul-Ambroise Duquenne and
      Hongyu Gong and
      Ning Dong and
      Jingfei Du and
      Ann Lee and
      Vedanuj Goswani and
      Changhan Wang and
      Juan Pino and
      Beno Sagot and
      Holger Schwenk",
}

License

The released models and dataset are under CC-BY-NC 4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

Installation

Speech-to-Speech Mined Data

Speech-to-Speech Alignments

Speech-to-Unit Data

Reproduce Bilingual Train Data

Speech-to-Speech Valid and Test Data

Speech-to-Speech Training

Textless Model

Bilingual Textless Model Release

Multilingual Textless Model

XM Transformer

Sparse XM Transformer (gshard)

Inference

Unit-based HiFi-GAN vocoder

Textless Model Inference

XM Transformer Model Inference

ASR-BLEU evaluation

Mining with Speech Encoder

Huggingface Demo for SpeechMatrix models

Citation

License

FilesExpand file tree

speech_matrix

Directory actions

More options

Directory actions

More options

Latest commit

History

speech_matrix

Folders and files

parent directory

README.md

SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations

Installation

Speech-to-Speech Mined Data

Speech-to-Speech Alignments

Speech-to-Unit Data

Reproduce Bilingual Train Data

Speech-to-Speech Valid and Test Data

Speech-to-Speech Training

Textless Model

Bilingual Textless Model Release

Multilingual Textless Model

XM Transformer

Sparse XM Transformer (gshard)

Inference

Unit-based HiFi-GAN vocoder

Textless Model Inference

XM Transformer Model Inference

ASR-BLEU evaluation

Mining with Speech Encoder

Huggingface Demo for SpeechMatrix models

Citation

License