Whisper Speaker Identification (WSI) is a state-of-the-art speaker identification model designed for multilingual scenarios.The WSI model adapts OpenAI's Whisper encoder and fine-tunes it with a projection head with hybrid loss (Online Triplet + Multi-View Self-Supervised). This approach enhances its ability to generate discriminative, language-agnostic speaker embeddings.WSI demonstrates state-of-the-art performance on multilingual datasets, achieving lower Equal Error Rates (EER) and higher F1 Scores.
Coming Soon!
If you use this work, please cite:
Jakaria Islam Emon, Md Abu Salek, Kazi Tamanna Alam
"Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings"
arXiv preprint, 2025.
https://doi.org/10.48550/arXiv.2503.10446
@article{emon2025whisper,
author = {Jakaria Islam Emon and Md Abu Salek and Kazi Tamanna Alam},
title = {Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings},
journal = {arXiv preprint},
year = {2025},
eprint = {2503.10446},
archivePrefix = {arXiv},
primaryClass = {cs.SD},
doi = {10.48550/arXiv.2503.10446}
}This project is licensed under the CC BY-NC-SA 4.0 License.