WeSpeaker

WeSpeaker mainly focuses on speaker embedding learning, with application to the speaker verification task. We support online feature extraction or loading pre-extracted features in kaldi-format.

Installation

Install python package

pip install git+https://github.com/wenet-e2e/wespeaker.git

Command-line usage (use -h for parameters):

$ wespeaker --task embedding --audio_file audio.wav --output_file embedding.txt
$ wespeaker --task embedding_kaldi --wav_scp wav.scp --output_file /path/to/embedding
$ wespeaker --task similarity --audio_file audio.wav --audio_file2 audio2.wav
$ wespeaker --task diarization --audio_file audio.wav

Python programming usage:

import wespeaker

model = wespeaker.load_model('chinese')
embedding = model.extract_embedding('audio.wav')
utt_names, embeddings = model.extract_embedding_list('wav.scp')
similarity = model.compute_similarity('audio1.wav', 'audio2.wav')
diar_result = model.diarize('audio.wav')

Please refer to python usage for more command line and python programming usage.

Install for development & deployment

Clone this repo

git clone https://github.com/wenet-e2e/wespeaker.git

Create conda env: pytorch version >= 1.12.1 is recommended !!!

conda create -n wespeaker python=3.9
conda activate wespeaker
conda install pytorch=1.12.1 torchaudio=0.12.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install -r requirements.txt
pre-commit install  # for clean and tidy code

🔥 News

2024.09.03: Support the SimAM_ResNet and the model pretrained on VoxBlink2, check Pretrained Models for the pretrained model, VoxCeleb Recipe for the super performance, and python usage for the command line usage!
2024.08.30: We support whisper_encoder based frontend and propose the Whisper-PMFA framework, check #356.
2024.08.20: Update diarization recipe for VoxConverse dataset by leveraging umap dimensionality reduction and hdbscan clustering, see #347 and #352.
2024.08.18: Support using ssl pre-trained models as the frontend. The WavLM recipe is also provided, see #344.
2024.05.15: Add support for quality-aware score calibration, see #320.
2024.04.25: Add support for the gemini-dfresnet model, see #291.
2024.04.23: Support MNN inference engine in runtime, see #310.
2024.04.02: Release Wespeaker document with detailed model-training tutorials, introduction of various runtime platforms, etc.
2024.03.04: Support the eres2net-cn-common-200k and campplus-cn-common-200k of damo #281, check python usage for details.
2024.02.05: Support the ERes2Net #272 and Res2Net #273 models.
2023.11.13: Support CLI usage of wespeaker, check python usage for details.
2023.07.18: Support the kaldi-compatible PLDA and unsupervised adaptation, see #186.
2023.07.14: Support the NIST SRE16 recipe, see #177.

Recipes

VoxCeleb: Speaker Verification recipe on the VoxCeleb dataset
- 🔥 UPDATE 2024.05.15: We support score calibration for Voxceleb and achieve better performance!
- 🔥 UPDATE 2023.07.10: We support self-supervised learning recipe on Voxceleb! Achieving 2.627% (ECAPA_TDNN_GLOB_c1024) EER on vox1-O-clean test set without any labels.
- 🔥 UPDATE 2022.10.31: We support deep r-vector up to the 293-layer version! Achieving 0.447%/0.043 EER/mindcf on vox1-O-clean test set
- 🔥 UPDATE 2022.07.19: We apply the same setups as the CNCeleb recipe, and obtain SOTA performance considering the open-source systems
  - EER/minDCF on vox1-O-clean test set are 0.723%/0.069 (ResNet34) and 0.728%/0.099 (ECAPA_TDNN_GLOB_c1024), after LM fine-tuning and AS-Norm
CNCeleb: Speaker Verification recipe on the CnCeleb dataset
- 🔥 UPDATE 2024.05.16: We support score calibration for Cnceleb and achieve better EER.
- 🔥 UPDATE 2022.10.31: 221-layer ResNet achieves 5.655%/0.330 EER/minDCF
- 🔥 UPDATE 2022.07.12: We migrate the winner system of CNSRC 2022 report slides
  - EER/minDCF reduction from 8.426%/0.487 to 6.492%/0.354 after large margin fine-tuning and AS-Norm
NIST SRE16: Speaker Verification recipe for the 2016 NIST Speaker Recognition Evaluation Plan. Similar recipe can be found in Kaldi.
- 🔥 UPDATE 2023.07.14: We support NIST SRE16 recipe. After PLDA adaptation, we achieved 6.608%, 10.01%, and 2.974% EER on trial Pooled, Tagalog, and Cantonese, respectively.
VoxConverse: Diarization recipe on the VoxConverse dataset

Discussion

For Chinese users, you can scan the QR code on the left to follow our offical account of WeNet Community. We also created a WeChat group for better discussion and quicker response. Please scan the QR code on the right to join the chat group.

Citations

If you find wespeaker useful, please cite it as

@article{wang2024advancing,
  title={Advancing speaker embedding learning: Wespeaker toolkit for research and production},
  author={Wang, Shuai and Chen, Zhengyang and Han, Bing and Wang, Hongji and Liang, Chengdong and Zhang, Binbin and Xiang, Xu and Ding, Wen and Rohdin, Johan and Silnova, Anna and others},
  journal={Speech Communication},
  volume={162},
  pages={103104},
  year={2024},
  publisher={Elsevier}
}

@inproceedings{wang2023wespeaker,
  title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
  author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
  booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Looking for contributors

If you are interested to contribute, feel free to contact @wsstriving or @robin1001

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

WeSpeaker

Installation

Install python package

Install for development & deployment

🔥 News

Recipes

Discussion

Citations

Looking for contributors

Files

README.md

Latest commit

History

README.md

File metadata and controls

WeSpeaker

Installation

Install python package

Install for development & deployment

🔥 News

Recipes

Discussion

Citations

Looking for contributors