Tensorflow implementation of Deep mind's Tacotron-2. A deep neural network architecture described in this paper: Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions
To have an overview of our advance on this project, please refer to this discussion
since the two parts of the global model are trained separately, we can start by training the feature prediction model to use his predictions later during the wavenet training.
first, you need to have python 3 installed along with Tensorflow v1.6.
next you can install the requirements using:
pip install -r requirements.txt
We tested the code above on the ljspeech dataset, which has almost 24 hours of labeled single actress voice recording. (further info on the dataset are available in the README file when you download it)
After downloading the dataset, extract the compressed file, and place the folder inside the cloned repository.
From this point and further, you'll have to be located inside the "tacotron" folder
cd tacotron
Preprocessing can then be started using:
python preprocess.py
This should take few minutes.
Feature prediction model can be trained using:
python train.py
checkpoints will be made each 100 steps and stored under logs-<model_name> folder.
There are three types of mel spectrograms synthesis using this model:
- Evaluation (synthesis on custom sentences). This is what we'll usually use after having a full end to end model.
python synthesize.py --mode='eval'
- Natural synthesis (let the model make predictions alone by feeding last decoder output to the next time step).
python synthesize.py --GTA=False
- Ground Truth Aligned synthesis (DEFAULT: the model is assisted by true labels in a teacher forcing manner). This synthesis method is used when predicting mel spectrograms used to train the wavenet vocoder. (yields better results as stated in the paper)
python synthesize.py
Due to some constraints, we won't be able to provide a pretrained feature prediction model at the moment. Vocoder Wavenet model however is in development state. In the mean time, if someone can train a feature prediction model, we will gladly add it to the repository.
Due to the large size of audio samples in the dataset, we advise you to drop the batch size to 32 or lower (depending on your gpu load). Please keep in mind that this will slow the training process.
- Tensorflow original tacotron implementation
- Original tacotron paper
- Attention-Based Models for Speech Recognition
- Natural TTS synthesis by conditioning Wavenet on MEL spectogram predictions
- r9y9/wavenet_vocoder
** Work in progress, further info will be added **
** This work is independant from deep mind **