The results reported in our paper were based on Windows system, while we recently found that the execution of the same repo and dataset on Linux yielded different results, using the pretrained models:
- Res-TSSDNet ASVspoof2019 eval EER: 1.6590%;
- Inc-TSSDNet ASVspoof2019 eval EER: 4.0384%.
We have identified issues of the package soundfile on Windows when writing and reading flac files, but this problem does not exist on Linux for the same package. The similar problem has been pointed out here.
We present two light-weight neural network models, termed time-domain synthetic speech detection net (TSSDNet), having the classic ResNet and Inception Net style structures (Res-TSSDNet and Inc-TSSDNet), for end-to-end synthetic speech detection. They achieve the state-of-the-art performance in terms of equal error rate (EER) on ASVspoof 2019 challenge and are also shown to have promising generalization capability when tested on ASVspoof 2015.
- ASVspoof 2019 train set is used for training;
- ASVspoof 2019 dev set is used for model selection;
- ASVspoof 2019 eval set is used for testing;
- ASVspoof 2015 eval set is used for cross-dataset testing.
The two models with 1.64% and 4.04% eval EER (below), and their train logs, are provided in folder pretrained.
Fixing all hyperparameters, the distribution of the lowest dev (and the corresponding eval) EERs among 100 epochs, trained from scratch (below):
ASVspoof15&19_LA_Data_Preparation.py
It generates
- equal-duration time domain raw waveform
- 2D log power of constant Q transform
from ASVspoof2019 and ASVspoof2015 official datasets, respectively. The calculation of CQT is adopted from Li et al. ICASSP 2021.
train.py
It supports training using
- standard cross-entropy vs weighted cross-entropy
- standard train loader vs mixup regularization
- 1D raw waveforms vs 2D CQT feature
- ASVspoof 2019 training set vs ASVspoof 2015 training set
A train log will be generated, and trained models per epoch will be saved.
test.py
It generates softmax accuracy, ROC curve, and EER.
G. Hua, A. B. J. Teoh, and H. Zhang, “Towards end-to-end synthetic speech detection,” IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021. arXiv | IEEE Xplore