a) Download and extract the LJ Speech dataset, then create a link to the dataset folder: ln -s /xxx/LJSpeech-1.1/ data/raw/
b) Download and Unzip the ground-truth duration extracted by MFA: tar -xvf mfa_outputs.tar; mv mfa_outputs data/processed/ljspeech/
c) Run the following scripts to pack the dataset for training/inference.
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config configs/tts/lj/fs2.yaml
# `data/binary/ljspeech` will be generated.
We provide the pre-trained model of HifiGAN vocoder.
Please unzip this file into checkpoints
before training your acoustic model.
First, you need a pre-trained FastSpeech2 checkpoint. You can use the pre-trained model, or train FastSpeech2 from scratch, run:
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config configs/tts/lj/fs2.yaml --exp_name fs2_lj_1 --reset
Then, to train DiffSpeech, run:
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset
Remember to adjust the "fs2_ckpt" parameter in usr/configs/lj_ds_beta6.yaml
to fit your path.
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/lj_ds_beta6.yaml --exp_name lj_ds_beta6_1213 --reset --infer
We also provide:
- the pre-trained model of DiffSpeech;
- the individual pre-trained model of FastSpeech 2 for the shallow diffusion mechanism in DiffSpeech;
Remember to put the pre-trained models in checkpoints
directory.
Along vertical axis, DiffSpeech: [0-80]; FastSpeech2: [80-160].
DiffSpeech vs. FastSpeech 2 |
---|