This recipe trains a Ukrainian GMM-HMM on Common Voice 9.0 to use for segmentation of long audio files into short utterances using its full transcript.
It avoids performing full large vocabulary speech recognition by limiting its search options to word sequences from the input transcript.
git clone https://github.com/kaldi-asr/kaldi
cd kaldi
cat INSTALL
# install kaldi to $HOME/kaldi
Speed run through Kaldi installation instructions:
cd tools
./extras/check_dependencies.sh
# Pay attention to dependency errors.
# If you're on macOS you don't need OpenBLAS.
# You won't need python2.7 and subversion.
# Make sure python command runs some python:
ln -sf $(which python3) $HOME/.local/bin/python
# On Ubuntu you can do this instead:
apt-get install python-is-python3
# Ignore all subsequent dependency checks.
echo > extras/check_dependencies.sh
# Build all tools (primarily openfst and pocolm)
make -j8
./extras/install_pocolm.sh
# Build kaldi itself
cd ../src
./configure --shared
make -j clean depend
make -j8
pip3 install --editable .
Prerequisites:
- Request access at https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0
- Get a Hugging Face token at https://huggingface.co/settings/tokens
- run
huggingface-cli login
# bring parts of kaldi into $PATH
source path.sh
# prepare dataset for training
# FWIW downloading takes longer than training :)
python3 -m uk.prepare_dataset
# progressively train mono, tri, tri2b, tri3b models
python3 -m uk.train_gmm exp/tri3b
python3 -m uk.prepare_dataset --dataset darkproger/librispeech_asr --subset train.clean.100 --split full --alphabet latin
# bring parts of kaldi into $PATH
source path.sh
# make a subset of librispeech
utils/subset_data_dir.sh --per-spk data/darkproger/librispeech_asr/train.clean.100/full 30 data/librispeech_mini
# progressively train mono, tri, tri2b, tri3b models
python3 -m uk.train_gmm -d data/librispeech_mini --lexicon data/local/dict/english_mfa_reference.dict exp/english
Some Ukrainian unvoiced sounds are considered noise by VADs trained on English. It's useful to perform segmentation using an acoustic model instead.
Uses the Lada dataset.
data/dataset_lada
comes from dataset_lada_ogg.zip
file.
Assumes a Ukrainian model to be present in exp/tri3b
, see above.
# Prepare data directory. We need text, utt2spk and wav.scp
mkdir -p data/lada
< data/dataset_lada/accept/metadata.jsonl jq -rc '[.file,.orig_text] | @tsv' | python3 -m uk.clean_text | sed 's,.ogg,,' | sort > data/lada/text
find data/dataset_lada/accept/ -name '*.ogg' | sort | awk -F/ '{s=$4;sub(".ogg","",s); print s, "lada"}' > data/lada/utt2spk
find data/dataset_lada/accept/ -name '*.ogg' | sort | awk -F/ '{s=$4;sub(".ogg","",s); print s, "ffmpeg -nostdin -i data/"$0" -ac 1 -acodec pcm_s16le -f wav - |"}' > data/lada/wav.scp
utils/fix_data_dir.sh data/lada
# Prepare lang. Makes every word pronunciation known by running G2P.
python3 -m uk.prepare_dict -o exp/dict_base data/local/dict/lexicon_common_voice_uk.txt
python3 -m uk.prepare_lang -d exp/dict_base -o data/lada --text data/lada/text
# Compute and export alignments to exp/lada
steps/make_mfcc.sh data/lada
python3 -m uk.align_utterances -a exp/ali -l data/lada/lang -m exp/tri3b data/lada
# Write a segments file for utterances without leading and trailing silence
python3 -m uk.trim_silence -l data/lada/lang/lang exp/ali/ > data/lada/segments
# Extract separate wavs
python3 -m uk.extract_segments -o data/lada_seg -i data/lada/wav.scp data/lada/segments
# get example data to align
git clone https://github.com/lang-uk/semesyuk-to-text
# note: text can contain extra words
cat semesyuk-to-text/texts/tokenized/semesyuk_farshrutka/01_prologue.txt | python3 -m uk.nlp_uk_tokens \
> data/local/semesyuk_farshrutka/01_prologue.txt
# run segmentation using tri3b model
# in this example it outputs a kaldi-style data directory to data/semesyuk_farshrutka_prologue
python3 -m uk.segment_long_utterances -w exp/segment1 -o data/semesyuk_farshrutka_prologue \
data/local/semesyuk_farshrutka/01_prologue.txt \
semesyuk-to-text/audio/raw/semesyuk_farshrutka/01_prologue.mp3
# upload result to wandb
python3 -m uk.share data/semesyuk_farshrutka_prologue
- Common Voice Dataset https://commonvoice.mozilla.org/
- G2P model https://github.com/kosti4ka/ukro_g2p
- Baseline https://github.com/lang-uk/semesyuk-to-text
- Alignment method (paper inside) https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/segment_long_utterances.sh