Skip to content


Repository files navigation

ISCAP - Identifying Speaker Characteristics through Audio Profiling - HEIGHT ESTIMATION

GitHub Link:


Setting up environment

  1. Install Kaldi
git clone -b 5.4 kaldi
cd kaldi/tools/; 
# Run this next line to check for dependencies, and then install them
make; cd ../src; ./configure; make depend; make
  1. Install EspNet
git clone -b v.0.9.7
cd espnet/tools/        # change to tools folder
ln -s {kaldi_root}      # Create link to Kaldi. e.g. ln -s home/theanhtran/kaldi/
  1. Set up Conda environment
./ anaconda espnet 3.7.9   # Create a anaconda environmetn - espnet with Python 3.7.9
make TH_VERSION=1.8.0 CUDA_VERSION=10.2     # Install Pytorch and CUDA
. ./; python3  # Check the installation
conda install torchvision==0.9.0 torchaudio==0.8.0 -c pytorch
  1. Install Pytorch Lightning
conda install pytorch-lightning -c conda-forge
  1. Install ffmpeg and openpyxl
conda install ffmpeg
conda install openpyxl

Download the project

  1. Clone the project from GitHub into your workspace
git clone
cd ISCAP_Height_Estimation
ln -s {kaldi_root}/egs/wsj/s5/utils     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/utils
ln -s {kaldi_root}/egs/wsj/s5/steps     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/steps 
  1. Point to your espnet

Open ISCAP_Height_Estimation/ file, change $MAIN_ROOT$ to your espnet directory, e.g. MAIN_ROOT=/home/theanhtran/espnet

How to run Height Estimation systems

  1. Data preparation step
bash      # prepare wideband data (16kHz)
bash      # prepare narrow data (8kHz)

This step will download .zip file of TIMIT dataset => extract and then generate features using Kaldi format

  1. Run the program: train and test
bash program running band gender_input
# $program in {1, 2} indicates which program you want to run
# $running in {TRAINING, TESTING}
# $band in {wideband, narrowband} indicates which data set is used (wideband - 16kHz data; narrowband - 8kHz data)
# $gender_input in {withgender, nogender} -> without or with gender as an input
e.g. bash 1 TRAINING wideband nogender # train the model on wideband data and without gender input
e.g. bash 1 TESTING narrowband withgender # test the pretrained model on narrowband data and with gender as an input
  • program=1 => Model 1: LSTM + Cross_Attention + MAE_Loss | FBank Features | MultiTask Estimation (both age & height)
  • program=2 => Model 2: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation

We can monitor training and testing

Other instructions:

  • You may get "USER: Unbound variable issue" error when running in Google Colab or Docker, please see the instruction in file from line 13 to 20
  • You may change the hyper-parameters such as the batch_size, max_epochs, early_stopping_patience, learning_rate, num_layers, loss_criterion, etc. in the file of any model.
  • Please note that the if you are not using a GPU for processing, change the hyper-parameter of gpu in the trainer function (in the files) to 0.

Models & Results:

This document is to compile the summary of all the models for height estimation using TIMIT dataset.
We predominantly use below feature extraction for these models:

  • Filter Bank: 80 FBank + 3 Pitch + 1 Binary_Gender (Features_Dimension: 83)

Moreover, we use 3 data augmentations for our data:

  • CMVN: Cepstral mean and variance normalization for FBank features
  • Speed Perturbation: Triple the training data using 0.9x and 1.1x speed perturbed data.
  • Spectral Augmentation: SpecAugment to randomly mask 15%-25% for better generalization and robustness.


  • Model: LSTM + Cross_Attention + MSE_Loss | FBank Features | MultiTask Estimation (both age & height)
  • Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MSE) loss and Adam optimizer for both age and height estimation with height_loss given twice the weight as comapred to age_loss. We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.
  • Model Architecture:


  • Model: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation

  • Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MAE) loss combined with a Triplet Loss, used to train the embeddings obtained right after the cross_attention layer. Triplet loss is given one-third the weighatge in total loss while MSE is given two-thirds. Adam is used the optimizer. The height labels are quantized and classified into groups of 5cms for Triplet Loss (i.e. height labels from 140-145cm in class_0, 145-150cm in class_1 and so on, giving us a total of 13 classes). We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.

  • Model Architecture:


No description, website, or topics provided.






No releases published


No packages published