ISCAP - Identifying Speaker Characteristics through Audio Profiling - HEIGHT ESTIMATION

GitHub Link: https://github.com/TonnyTran/ISCAP_Height_Estimation

Installation:

Setting up environment

Install Kaldi

git clone -b 5.4 https://github.com/kaldi-asr/kaldi.git kaldi
cd kaldi/tools/; 
# Run this next line to check for dependencies, and then install them
extras/check_dependencies.sh
make; cd ../src; ./configure; make depend; make

Install EspNet

git clone -b v.0.9.7 https://github.com/espnet/espnet.git
cd espnet/tools/        # change to tools folder
ln -s {kaldi_root}      # Create link to Kaldi. e.g. ln -s home/theanhtran/kaldi/

Set up Conda environment

./setup_anaconda.sh anaconda espnet 3.7.9   # Create a anaconda environmetn - espnet with Python 3.7.9
make TH_VERSION=1.8.0 CUDA_VERSION=10.2     # Install Pytorch and CUDA
. ./activate_python.sh; python3 check_install.py  # Check the installation
conda install torchvision==0.9.0 torchaudio==0.8.0 -c pytorch

Install Pytorch Lightning

conda install pytorch-lightning -c conda-forge

Install ffmpeg and openpyxl

conda install ffmpeg
conda install openpyxl

Download the project

Clone the project from GitHub into your workspace

git clone https://github.com/TonnyTran/ISCAP_Height_Estimation.git
cd ISCAP_Height_Estimation
ln -s {kaldi_root}/egs/wsj/s5/utils     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/utils
ln -s {kaldi_root}/egs/wsj/s5/steps     # e.g. ln -s /home/theanhtran/kaldi/egs/wsj/s5/steps

Point to your espnet

Open ISCAP_Height_Estimation/path.sh file, change $MAIN_ROOT$ to your espnet directory, e.g. MAIN_ROOT=/home/theanhtran/espnet

How to run Height Estimation systems

Data preparation step

bash prepare_TIMIT_data.sh      # prepare wideband data (16kHz)
bash prepare_data_narrowband.sh      # prepare narrow data (8kHz)

This step will download .zip file of TIMIT dataset => extract and then generate features using Kaldi format

Run the program: train and test

bash run_height_estimation.sh program running band gender_input
# $program in {1, 2} indicates which program you want to run
# $running in {TRAINING, TESTING}
# $band in {wideband, narrowband} indicates which data set is used (wideband - 16kHz data; narrowband - 8kHz data)
# $gender_input in {withgender, nogender} -> without or with gender as an input
e.g. bash run_height_estimation.sh 1 TRAINING wideband nogender # train the model on wideband data and without gender input
e.g. bash run_height_estimation.sh 1 TESTING narrowband withgender # test the pretrained model on narrowband data and with gender as an input

program=1 => Model 1: LSTM + Cross_Attention + MAE_Loss | FBank Features | MultiTask Estimation (both age & height)
program=2 => Model 2: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation

We can monitor training and testing

Other instructions:

You may get "USER: Unbound variable issue" error when running in Google Colab or Docker, please see the instruction in file prepare_TIMIT_data.sh from line 13 to 20
You may change the hyper-parameters such as the batch_size, max_epochs, early_stopping_patience, learning_rate, num_layers, loss_criterion, etc. in the run.py file of any model.
Please note that the if you are not using a GPU for processing, change the hyper-parameter of gpu in the trainer function (in the run.py files) to 0.

Models & Results:

This document is to compile the summary of all the models for height estimation using TIMIT dataset.
We predominantly use below feature extraction for these models:

Filter Bank: 80 FBank + 3 Pitch + 1 Binary_Gender (Features_Dimension: 83)

Moreover, we use 3 data augmentations for our data:

CMVN: Cepstral mean and variance normalization for FBank features
Speed Perturbation: Triple the training data using 0.9x and 1.1x speed perturbed data.
Spectral Augmentation: SpecAugment to randomly mask 15%-25% for better generalization and robustness.

Model_1:

Model: LSTM + Cross_Attention + MSE_Loss | FBank Features | MultiTask Estimation (both age & height)
Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MSE) loss and Adam optimizer for both age and height estimation with height_loss given twice the weight as comapred to age_loss. We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.
Model Architecture:

Model_2:

Model: LSTM + Cross_Attention + Triplet & MSE_Loss | FBank Features | Height Estimation
Model Description: The model uses FBank features for only height estimation using standart LSTM + Cross_Attnetion + Dense Layer and is trained using a Mean Squared Error (MAE) loss combined with a Triplet Loss, used to train the embeddings obtained right after the cross_attention layer. Triplet loss is given one-third the weighatge in total loss while MSE is given two-thirds. Adam is used the optimizer. The height labels are quantized and classified into groups of 5cms for Triplet Loss (i.e. height labels from 140-145cm in class_0, 145-150cm in class_1 and so on, giving us a total of 13 classes). We use a patience of 10 epochs before early stopping the model based on Validation Loss. Finally, MSE and MAE metrics are used to gauge the performance of the model on the test_set for height estimation. The batch_size used is of 32 samples.
Model Architecture:

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.vscode		.vscode
best_model		best_model
code		code
conf		conf
imgs		imgs
local		local
.gitignore		.gitignore
README.md		README.md
cmd.sh		cmd.sh
path.sh		path.sh
prepare_TIMIT_data.sh		prepare_TIMIT_data.sh
prepare_data_narrowband.sh		prepare_data_narrowband.sh
run_height_estimation.sh		run_height_estimation.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ISCAP - Identifying Speaker Characteristics through Audio Profiling - HEIGHT ESTIMATION

Installation:

Setting up environment

Download the project

How to run Height Estimation systems

Other instructions:

Models & Results:

Model_1:

Model_2:

About

Releases

Packages

Languages

TonnyTran/ISCAP_Height_Estimation

Folders and files

Latest commit

History

Repository files navigation

ISCAP - Identifying Speaker Characteristics through Audio Profiling - HEIGHT ESTIMATION

Installation:

Setting up environment

Download the project

How to run Height Estimation systems

Other instructions:

Models & Results:

Model_1:

Model_2:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages