Skip to content

A repository with comprehensive instructions for using the Festvox toolkit for generating Emotional speech from text

License

Notifications You must be signed in to change notification settings

Srija616/hmm-for-emo-tts

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HMM-based Emotional Text-to-speech

A repository with comprehensive instructions for using the Festvox toolkit for generating emotional speech from text. This was done as a part of a course project for Speech Recognition and Understanding (ECE557/CSE5SRU) at IIIT Delhi during Winter 2020.

demo


Contents

Training your own HMM models

Festvox project is part of the work at Carnegie Mellon University's speech group aimed at advancing the state of Speech Synthesis.

We will be using Festvox to train our HMM models and build voices.

Requirements

  • Docker
  • Audio Files: The audio files to be used for training.
  • File with utterances: A file which contains the path to the audio file and their transcripts. Schema is described below.

Setup

Docker Image

An already configured Docker Image is created by mjansche for the Text-to-Speech tutorial at SLTU 2016. We will be training our HMM models using this Docker Image.

The Docker Image can be pulled by

docker pull mjansche/tts-tutorial-sltu2016

After pulling the docker image, we need to setup flite which is an open source small fast run-time text to speech engine. To setup flite, run the docker image and once in the directory /usr/local/src run the following commands

git clone https://github.com/festvox/flite.git
cd flite
./configure
make

Audio Files

The training requires PCM encoded 16bit mono wav audio files with a sampling rate of 16kHz. Please use ffmpeg to convert the recorded audio files to the correct format by running the following

ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav

File with Utterances

For training you need to make a file named txt.done.data with the base filenames of all the utterances and the text of each utterance. e.g.

( audio_0001 "a whole joy was reaping." )
( audio_0002 "but they've gone south." )
( audio_0003 "you should fetch azure mike." )

Caution There is a space after/before the round braces and between the file name and the utterance. The utterance must be in double quotes.

Training

Preparing the Directory

The first step to train HMM is to prepare the directory. After running the docker image,

cd /usr/local/src/festvox/src/clustergen
mkdir cmu_us_ss
cd cmu_us_ss
$FESTVOXDIR/src/clustergen/setup_cg cmu us ss

Instead of "cmu" and "ss" you can pick any names you want, but please keep "us" so that Festival knows to use the US English pronunciation dictionary. For indic voices, use "indic" instead of "us".

Synthesis of Audio Files

Assuming that you have already prepared the audio files and the list of utterances,

cp -p WHATEVER/txt.done.data etc/
cp -p WHATEVER/wav/*.wav recording/

Assuming the recordings might not be as good as the could be you can power normalize them.

./bin/get_wavs recording/*.wav

Also synthesis builds (especially labeling) work best if there is only a limited amount of leading and trailing silence. We can do this by

./bin/prune_silence wav/*.wav

Note: If you do not require these three stages, you can put your wavefiles directly into wav/

Building Voices

For building voices, you can use an automated script that will do the feature extraction, build the models and generate some text examples.

./bin/build_cg_rfs_voice

Manual build

Firsty build the prompts and label the data.

./bin/do_build build_prompts etc/txt.done.data
./bin/do_build label etc/txt.done.data
./bin/do_clustergen parallel build_utts etc/txt.done.data
./bin/do_clustergen generate_statename
./bin/do_clustergen generate_filters

Then do feature extraction

./bin/do_clustergen parallel f0_v_sptk
./bin/do_clustergen parallel mcep_sptk
./bin/do_clustergen parallel combine_coeffs_v

Build the models

./bin/traintest etc/txt.done.data
./bin/do_clustergen parallel cluster etc/txt.done.data.train
./bin/do_clustergen dur etc/txt.done.data.train

Generating Voices

We will use flite to generate audio from the trained model.

rm -rf flite
$FLITEDIR/tools/setup_flite
./bin/build_flite cg
cd flite
make

flite requires .flitevox object to build the voices. Create the .flitevox object by

./flite_cmu_us_${NAME} -voicedump output.flitevox

Then audio can be easily generated for any utterance by

./flite_cmu_us_${NAME} "<sentence to utter>" output.wav

Demonstration

We also make our system demonstration publicaly available within the hmm_wrapper directory. Further details are provided in the README of the directory.

Trained Models

We also make the trained models for the different emotions available here.

These models can be used for further fine-tuning or running the system provided in hmm_wrapper directory.

References

Festvox : Festvox project developed by Carnegie Mellon University.
Docker : Festvox configured docker image.
Building Data : The format for utterance file.
Training : Steps to train the HMM Model.
Automated Script : Description of the automated script.


Contact

For any errors or help in running the project, please open an issue or write to any of the project members -

  • Pranav Jain (pranav16255 [at] iiitd [dot] ac [dot] in)
  • Srija Anand (srija17199 [at] iiitd [dot] ac [dot] in)
  • Eshita (eshita17149 [at] iiitd [dot] ac [dot] in)
  • Shruti Singh (shruti17211 [at] iiitd [dot] ac [dot] in)
  • Pulkit Madaan (pulkit16257 [at] iiitd [dot] ac [dot] in)
  • Aditya Chetan (aditya16217 [at] iiitd [dot] ac [dot] in)
  • Brihi Joshi (brihi16142 [at] iiitd [dot] ac [dot] in)

About

A repository with comprehensive instructions for using the Festvox toolkit for generating Emotional speech from text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • CSS 96.3%
  • HTML 2.3%
  • JavaScript 1.4%