GitHub Link: https://github.com/TonnyTran/ISCAP_Height_Estimation
- Install Kaldi
git clone -b 5.4 https://github.com/kaldi-asr/kaldi.git kaldi
cd kaldi/tools/;
# Run this next line to check for dependencies, and then install them
extras/check_dependencies.sh
make; cd ../src; ./configure; make depend; make
- Install EspNet
git clone -b v.0.9.7 https://github.com/espnet/espnet.git
cd espnet/tools/ # change to tools folder
ln -s {kaldi_root} # Create link to Kaldi. e.g. ln -s home/theanhtran/kaldi/
- Set up Conda environment
./setup_anaconda.sh anaconda espnet 3.7.9 # Create a anaconda environmetn - espnet with Python 3.7.9
make TH_VERSION=1.8.0 CUDA_VERSION=10.2 # Install Pytorch and CUDA
. ./activate_python.sh; python3 check_install.py # Check the installation
conda install torchvision==0.9.0 torchaudio==0.8.0 -c pytorch
- Install Pytorch Lightning
conda install pytorch-lightning -c conda-forge
- Install ffmpeg and openpyxl
conda install ffmpeg
conda install openpyxl
- Clone the project from GitHub into your workspace
git clone https://github.com/TonnyTran/ISCAP_Height_Estimation.git
cd ISCAP_Height_Estimation
ln -s {kaldi_root}/egs/wsj/s5/utils # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/utils
ln -s {kaldi_root}/egs/wsj/s5/steps # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/steps
- Point to your espnet
Open ISCAP_Height_Estimation/path.sh
file, change MAIN_ROOT=/home/theanhtran/espnet
- Data preparation step
bash prepare_TIMIT_data.sh # prepare wideband data (16kHz)
bash prepare_data_narrowband.sh # prepare narrow data (8kHz)
This step will download .zip file of TIMIT dataset => extract and then generate features using Kaldi format
- Run the program: train and test
bash run_height_estimation.sh program running band gender_input
# $program in {1, 2} indicates which program you want to run
# $running in {TRAINING, TESTING}
# $band in {wideband, narrowband} indicates which data set is used (wideband - 16kHz data; narrowband - 8kHz data)
# $gender_input in {withgender, nogender} -> without or with gender as an input
e.g. bash run_height_estimation.sh 1 TRAINING wideband nogender # train the model on wideband data and without gender input
e.g. bash run_height_estimation.sh 1 TESTING narrowband withgender # test the pretrained model on narrowband data and with gender as an input
- program=1 => Model 1: LSTM + Cross_Attention + MAE_Loss | FBank Features | MultiTask Estimation (both age & height)
- program=2 => Model 2: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation
We can monitor training and testing
- You may get "USER: Unbound variable issue" error when running in Google Colab or Docker, please see the instruction in file
prepare_TIMIT_data.sh
from line 13 to 20 - You may change the hyper-parameters such as the
batch_size
,max_epochs
,early_stopping_patience
,learning_rate
,num_layers
,loss_criterion
, etc. in the run.py file of any model. - Please note that the if you are not using a GPU for processing, change the hyper-parameter of
gpu
in thetrainer
function (in the run.py files) to0
.
This document is to compile the summary of all the models for height estimation using TIMIT dataset.
We predominantly use below feature extraction for these models:
- Filter Bank: 80 FBank + 3 Pitch + 1 Binary_Gender (Features_Dimension: 83)
Moreover, we use 3 data augmentations for our data:
- CMVN: Cepstral mean and variance normalization for FBank features
- Speed Perturbation: Triple the training data using 0.9x and 1.1x speed perturbed data.
- Spectral Augmentation: SpecAugment to randomly mask 15%-25% for better generalization and robustness.
- Model:
LSTM + Cross_Attention + MSE_Loss
| FBank Features | MultiTask Estimation (both age & height) - Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a
Mean Squared Error (MSE)
loss andAdam
optimizer for both age and height estimation withheight_loss
given twice the weight as comapred toage_loss
. We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally,MSE
andMAE
metrics are used to gauge the performance of the model on thetest_set
for height estimation. Thebatch_size
used is of 32 samples. - Model Architecture:
-
Model:
LSTM + Cross_Attention + Triplet & MSE_Loss
| FBank Features | Height Estimation -
Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a
Mean Squared Error (MAE)
loss combined with aTriplet Loss
, used to train theembeddings
obtained right after thecross_attention layer
.Triplet loss
is given one-third the weighatge in total loss whileMSE
is given two-thirds.Adam
is used the optimizer. The height labels are quantized and classified into groups of 5cms for Triplet Loss (i.e. height labels from 140-145cm in class_0, 145-150cm in class_1 and so on, giving us a total of 13 classes). We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally,MSE
andMAE
metrics are used to gauge the performance of the model on the test_set for height estimation. Thebatch_size
used is of 32 samples. -
Model Architecture: