A repository with comprehensive instructions for using the Festvox toolkit for generating emotional speech from text. This was done as a part of a course project for Speech Recognition and Understanding (ECE557/CSE5SRU) at IIIT Delhi during Winter 2020.
Festvox project is part of the work at Carnegie Mellon University's speech group aimed at advancing the state of Speech Synthesis.
We will be using Festvox to train our HMM models and build voices.
- Docker
- Audio Files: The audio files to be used for training.
- File with utterances: A file which contains the path to the audio file and their transcripts. Schema is described below.
An already configured Docker Image is created by mjansche for the Text-to-Speech tutorial at SLTU 2016. We will be training our HMM models using this Docker Image.
The Docker Image can be pulled by
docker pull mjansche/tts-tutorial-sltu2016
After pulling the docker image, we need to setup flite which is an open source small fast run-time text to speech engine.
To setup flite
, run the docker image and once in the directory /usr/local/src
run the following commands
git clone https://github.com/festvox/flite.git
cd flite
./configure
make
The training requires PCM encoded 16bit mono wav audio files with a sampling rate of 16kHz. Please use ffmpeg
to convert the recorded audio files to the correct format by running the following
ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav
For training you need to make a file named txt.done.data with the base filenames of all the utterances and the text of each utterance. e.g.
( audio_0001 "a whole joy was reaping." )
( audio_0002 "but they've gone south." )
( audio_0003 "you should fetch azure mike." )
Caution There is a space after/before the round braces and between the file name and the utterance. The utterance must be in double quotes.
The first step to train HMM is to prepare the directory. After running the docker image,
cd /usr/local/src/festvox/src/clustergen
mkdir cmu_us_ss
cd cmu_us_ss
$FESTVOXDIR/src/clustergen/setup_cg cmu us ss
Instead of "cmu" and "ss" you can pick any names you want, but please keep "us" so that Festival knows to use the US English pronunciation dictionary. For indic voices, use "indic" instead of "us".
Assuming that you have already prepared the audio files and the list of utterances,
cp -p WHATEVER/txt.done.data etc/
cp -p WHATEVER/wav/*.wav recording/
Assuming the recordings might not be as good as the could be you can power normalize them.
./bin/get_wavs recording/*.wav
Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by
./bin/prune_silence wav/*.wav
Note: If you do not require these three stages, you can put your wavefiles directly into wav/
For building voices, you can use an automated script that will do the feature extraction, build the models and generate some text examples.
./bin/build_cg_rfs_voice
Firsty build the prompts and label the data.
./bin/do_build build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
./bin/do_clustergen generate_statename
./bin/do_clustergen generate_filters
Then do feature extraction
./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v
Build the models
./bin/traintest etc/txt.done.data
./bin/do_clustergen parallel cluster etc/txt.done.data.train
./bin/do_clustergen dur etc/txt.done.data.train
We will use flite to generate audio from the trained model.
rm -rf flite
$FLITEDIR/tools/setup_flite
./bin/build_flite cg
cd flite
make
flite requires .flitevox object to build the voices. Create the .flitevox object by
./flite_cmu_us_${NAME} -voicedump output.flitevox
Then audio can be easily generated for any utterance by
./flite_cmu_us_${NAME} "<sentence to utter>" output.wav
We also make our system demonstration publicaly available within the hmm_wrapper
directory. Further details are provided in the README of the directory.
We also make the trained models for the different emotions available here.
These models can be used for further fine-tuning or running the system provided in hmm_wrapper
directory.
Festvox : Festvox project developed by Carnegie Mellon University.
Docker : Festvox configured docker image.
Building Data : The format for utterance file.
Training : Steps to train the HMM Model.
Automated Script : Description of the automated script.
For any errors or help in running the project, please open an issue or write to any of the project members -
- Pranav Jain (pranav16255 [at] iiitd [dot] ac [dot] in)
- Srija Anand (srija17199 [at] iiitd [dot] ac [dot] in)
- Eshita (eshita17149 [at] iiitd [dot] ac [dot] in)
- Shruti Singh (shruti17211 [at] iiitd [dot] ac [dot] in)
- Pulkit Madaan (pulkit16257 [at] iiitd [dot] ac [dot] in)
- Aditya Chetan (aditya16217 [at] iiitd [dot] ac [dot] in)
- Brihi Joshi (brihi16142 [at] iiitd [dot] ac [dot] in)