SpeechMatrix provides massive parallel speech which is mined from VoxPopuli in 17 languages: Czech (cs), German (de), English (en), Spanish (es), Estonian (et), Finnish (fi), French (fr), Croatian (hr), Hungarian (hu), Italian (it), Lithuanian (lt), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Slovak (sk) and Slovenian (sl).
Here is a summary of mined speech-to-speech data statistics, and the duration (hours) of source speech is reported in each of 272 language directions.
| Src/Tgt | cs | de | en | es | et | fi | fr | hr | hu | it | lt | nl | pl | pt | ro | sk | sl |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cs | - | 2381 | 3208 | 2290 | 952 | 1312 | 2476 | 726 | 1396 | 2410 | 84 | 2377 | 2516 | 1867 | 1190 | 2146 | 452 |
| de | 2386 | - | 4734 | 3113 | 901 | 1477 | 3536 | 498 | 1871 | 3476 | 41 | 3384 | 2632 | 2250 | 1281 | 1646 | 361 |
| en | 3172 | 4676 | - | 4715 | 1585 | 2169 | 5178 | 824 | 2266 | 4897 | 82 | 4422 | 3583 | 3572 | 2258 | 2306 | 586 |
| es | 2240 | 3041 | 4708 | - | 862 | 1373 | 4446 | 528 | 1599 | 4418 | 47 | 3067 | 2646 | 3484 | 1857 | 1603 | 308 |
| et | 943 | 892 | 1593 | 877 | - | 1201 | 934 | 265 | 1119 | 1019 | 39 | 1055 | 949 | 721 | 419 | 780 | 196 |
| fi | 1296 | 1463 | 2180 | 1393 | 1197 | - | 1449 | 306 | 1473 | 1599 | 47 | 1654 | 1350 | 1128 | 621 | 977 | 260 |
| fr | 2424 | 3457 | 5171 | 4455 | 923 | 1435 | - | 560 | 1711 | 4618 | 50 | 3273 | 2822 | 3384 | 1991 | 1657 | 326 |
| hr | 736 | 507 | 854 | 553 | 273 | 317 | 588 | - | 328 | 615 | 24 | 546 | 660 | 433 | 277 | 586 | 136 |
| hu | 1417 | 1897 | 2346 | 1672 | 1140 | 1507 | 1787 | 328 | - | 1855 | 68 | 1839 | 1566 | 1315 | 808 | 1064 | 311 |
| it | 2404 | 3460 | 4948 | 4500 | 1028 | 1614 | 4700 | 607 | 1823 | - | 103 | 3414 | 2848 | 3421 | 1995 | 1656 | 474 |
| lt | 78 | 38 | 79 | 46 | 37 | 44 | 48 | 21 | 61 | 95 | - | 77 | 80 | 35 | 18 | 64 | 6 |
| nl | 2322 | 3305 | 4396 | 3066 | 1040 | 1633 | 3269 | 521 | 1768 | 3355 | 80 | - | 2459 | 2399 | 1352 | 1646 | 458 |
| pl | 2530 | 2646 | 3662 | 2735 | 967 | 1378 | 2913 | 656 | 1554 | 2883 | 88 | 2540 | - | 2121 | 1301 | 1892 | 431 |
| pt | 1849 | 2224 | 3606 | 3525 | 722 | 1131 | 3421 | 421 | 1279 | 3403 | 37 | 2436 | 2087 | - | 1579 | 1358 | 247 |
| ro | 1187 | 1275 | 2290 | 1894 | 423 | 627 | 2024 | 271 | 789 | 1996 | 19 | 1384 | 1288 | 1592 | - | 870 | 125 |
| sk | 2127 | 1628 | 2329 | 1631 | 781 | 982 | 1685 | 574 | 1038 | 1650 | 69 | 1676 | 1869 | 1361 | 867 | - | 370 |
| sl | 436 | 350 | 579 | 307 | 192 | 254 | 324 | 128 | 295 | 461 | 6 | 454 | 413 | 241 | 121 | 359 | - |
The mined data has speech alignments at the threshold of 1.06. For the train set, its overlap with VoxPopuli valid and test data has been removed.
# SAVE_ROOT: the directory to save mined data
python mined_train_sets/download_mined_data.py \
--save-root ${SAVE_ROOT}Audios are saved to ${SAVE_ROOT}/audios/. For example, English audios are compressed into en_aud.zip.
Speech alignments are saved to ${SAVE_ROOT}/aligned_speech/. For example, en-fr.tsv.gz contains a pair of aligned audio paths in English and French respectively together with their alignment score in each line.
Speech-to-unit manifests are provided to facilitate S2U training, which has units extracted from HuBERT models. VoxPopuli test and valid manifests are also released.
# SAVE_ROOT: the directory to save mined data
python mined_train_sets/download_speech_to_unit.py \
--save-root ${SAVE_ROOT}The speech-to-unit data for a language pair is saved to ${SAVE_ROOT}/s2u_manifests/${lang_pair}
In our bilingual speech-to-unit experiments, we set different thresholds to select a subset of mined data for the purpose of training efficiency. You can reproduce the training data and configurations as below:
# SAVE_ROOT: the directory to save SpeechMatrix mined data
python3 speech_to_speech/prep_bilingual_textless_manifest.py --save-root ${SAVE_ROOT}For speech-to-speech training, we use mined data as the train set, and prepare valid and test sets using VoxPopuli, EuroParl-ST and FLEURS. Check out here for valid and test data preparations.
Check out here for Textless S2U model training.
| Src/Tgt | de | en | es | fr | it | nl | pl | pt | ro |
|---|---|---|---|---|---|---|---|---|---|
| cs | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| de | - | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| en | ckpt | - | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| es | ckpt | ckpt | - | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| et | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| fi | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| fr | ckpt | ckpt | ckpt | - | ckpt | ckpt | ckpt | ckpt | ckpt |
| hr | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| hu | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| it | ckpt | ckpt | ckpt | ckpt | - | ckpt | ckpt | ckpt | ckpt |
| nl | ckpt | ckpt | ckpt | ckpt | ckpt | - | ckpt | ckpt | ckpt |
| pl | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | - | ckpt | ckpt |
| pt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | - | ckpt |
| ro | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | - |
| sk | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
| sl | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt | ckpt |
Download 260M Slavic-to-English model.
Check out here for XM Transformer model training.
Example training command:
python3 $FAIRSEQ_ROOT/train.py --distributed-world-size 32 --distributed-port 12314 --save-dir $SAVE_DIR $DATA_ROOT --ddp-backend legacy_ddp --num-workers 0 --task speech_to_text --criterion label_smoothed_cross_entropy --no-epoch-checkpoints --report-accuracy --clip-norm 10.0 --log-format simple --log-interval 500 --seed 121 --max-update 160000 --share-decoder-input-output-embed --validate-interval 1 --save-interval 1 --save-interval-updates 500 --skip-invalid-size-inputs-valid-test --keep-best-checkpoints 10 --optimizer adam --adam-betas '"'"'(0.9, 0.98)'"'"' --lr 0.0001 --dropout 0.1 --attention-dropout 0.1 --relu-dropout 0.1 --lr-scheduler inverse_sqrt --warmup-updates 5000 --arch xm_transformer --normalize --adaptor-n-layers 1 --decoder-attention-heads 16 --decoder-normalize-before --load-pretrained-decoder-from ${UNIT_MBART_PATH} --w2v-path ${XLSR_PATH} --config-yaml config.yaml --mask-prob 0.3 --mask-channel-prob 0.25 --mask-channel-length 10 --layerdrop 0.1 --finetune-decoder-params all --label-smoothing 0.2 --patience 30 --max-tokens 9000 --max-tokens-valid 9000 --max-target-positions 9000 --max-source-positions 9000 --max-positions 9000 --update-freq 2 --train-subset $TRAIN_SUBSET --valid-subset $VALID_SUBSET --checkpoint-activations --encoder-proj
| Architecture | direction | #params | Link |
|---|---|---|---|
| Dense XM | Slavic-to-English | 1.3B | ckpt |
| Dense XM | All-to-English | 1.3B | ckpt |
Following arguments could be added for gshard in XM Transformer training, following the settings in paper (decoder-only MoE):
--moe-freq 2 --encoder-moe-freq 0 --decoder-moe-freq 2 --moe-expert-count $NUM_EXPERTS
| Architecture | direction | #params | Link |
|---|---|---|---|
| gshard XM | Slavic-to-English | 4.3B | ckpts |
| gshard XM | All-to-English | 4.3B | ckpts |
Download vocoders below to synthesize target speech from predicted unit sequence.
| Unit config | Unit size | Vocoder language | Dataset | Model |
|---|---|---|---|---|
| mHuBERT, layer 11 | 1000 | de | CSS10 | ckpt, config |
| mHuBERT, layer 11 | 1000 | nl | CSS10 | ckpt, config |
| mHuBERT, layer 11 | 1000 | fi | CSS10 | ckpt, config |
| mHuBERT, layer 11 | 1000 | hu | CSS10 | ckpt, config |
| mHuBERT, layer 11 | 1000 | et | Common Voice | ckpt, config |
| mHuBERT, layer 11 | 800 | it | VoxPopuli | ckpt, config |
| mHuBERT, layer 11 | 1000 | pt | Common Voice | ckpt, config |
| mHuBERT, layer 11 | 1000 | ro | VoxPopuli | ckpt, config |
| mHuBERT, layer 11 | 1000 | cs | VoxPopuli | ckpt, config |
| mHuBERT, layer 11 | 1000 | pl | VoxPopuli | ckpt, config |
| mHuBERT, layer 11 | 1000 | hr | VoxPopuli | ckpt, config |
| mHuBERT, layer 11 | 1000 | lt | VoxPopuli | ckpt, config |
| mHuBERT, layer 11 | 1000 | sk | VoxPopuli | ckpt, config |
| mHuBERT, layer 11 | 1000 | sl | VoxPopuli | ckpt, config |
For English (en), Spanish (es) and French (fr), we reuse HuBERT and kmeans models released by textless model.
Check out here for Textless model inference.
- Check out the inference in fairseq-S2T to generate unit sequences (
${RESULTS_PATH}/generate-${GEN_SUBSET}.txt).
fairseq-generate $DATA_ROOT \
--config-yaml config.yaml \
--task speech_to_text \
--path $MODEL_DIR/checkpoint_best.pt --gen-subset $GEN_SUBSET \
--max-tokens 18000 \
--beam 10 --max-len-a 0.003125 --max-len-b 200 \
--results-path ${RESULTS_PATH}
For MoE inference, add the following options and make sure
(1) #num_experts % #num_gpus == 0
(2) No OOM issue
--is-moe --distributed-world-size $NUM_GPUS
- Convert unit sequences to waveform.
grep "^D\-" ${RESULTS_PATH}/generate-${GEN_SUBSET}.txt | \
sed 's/^D-//ig' | sort -nk1 | cut -f3 \
> ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit
python examples/speech_to_speech/generate_waveform_from_code.py \
--in-code-file ${RESULTS_PATH}/generate-${GEN_SUBSET}.unit \
--vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG \
--results-path ${RESULTS_PATH} --dur-prediction
Use the following command to compute ASR-BLEU with inference results on test sets
TGT_LANG: Target language
AUDIO_FOLDER: A folder contains all inference results (audio files)
REFERENCE_PATH: A txt file with each line to be translation result in plain text
cd ${FAIRSEQ_ROOT}/examples/speech_to_speech/asr_bleu
python3 compute_asr_bleu.py --lang ${TGT_LANG} \
--audio_dirpath ${AUDIO_FOLDER} \
--reference_path ${REFERENCE_PATH} \
--reference_format txt
For people who are interested in trying out our speech encoders and mining parallel speech by themselves, we also release speech encoders. Please check out speech encoding intructions for more details.
This demo on huggingface has all-en multilingual model and bilingual models with target languages {en,fr,es} trained with SpeechMatrix data. https://huggingface.co/spaces/facebook/speech_matrix
@inproceedings{speech-matrix,
title = "{S}peech{M}atrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations",
author = "Paul-Ambroise Duquenne and
Hongyu Gong and
Ning Dong and
Jingfei Du and
Ann Lee and
Vedanuj Goswani and
Changhan Wang and
Juan Pino and
Beno Sagot and
Holger Schwenk",
}
The released models and dataset are under CC-BY-NC 4.0.