-
Notifications
You must be signed in to change notification settings - Fork 115
Get Started
We use Hydra to control all the training configurations. If you are not familiar with Hydra we recommend visiting the Hydra website. Generally, Hydra is an open-source framework that simplifies the development of research applications by providing the ability to create a hierarchical configuration dynamically. If you want to know how we used Hydra, we recommend you to read here.
We support LibriSpeech, KsponSpeech, and AISHELL-1.
LibriSpeech is a corpus of approximately 1,000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data was derived from reading audiobooks from the LibriVox project, and has been carefully segmented and aligned.
Aishell is an open-source Chinese Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd. 400 people from different accent areas in China were invited to participate in the recording, which was conducted in a quiet indoor environment using high fidelity microphone and downsampled to 16kHz.
KsponSpeech is a large-scale spontaneous speech corpus of Korean. This corpus contains 969 hours of general open-domain dialog utterances, spoken by about 2,000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. To start training, the KsponSpeech dataset must be prepared in advance. To download KsponSpeech, you need permission from AI Hub.
Dataset | Unit | Manifest | Vocab | SP-Model |
---|---|---|---|---|
LibriSpeech | character | [Link] | [Link] | - |
LibriSpeech | subword | [Link] | [Link] | [Link] |
AISHELL-1 | character | [Link] | [Link] | - |
KsponSpeech | character | [Link] | [Link] | - |
KsponSpeech | subword | [Link] | [Link] | [Link] |
KsponSpeech | grapheme | [Link] | [Link] | - |
KsponSpeech needs permission from AI Hub.
Please send e-mail including the approved screenshot to [email protected].
- Acoustic model manifest file format:
LibriSpeech/test-other/8188/269288/8188-269288-0052.flac ▁ANNIE ' S ▁MANNER ▁WAS ▁VERY ▁MYSTERIOUS 4039 20 5 531 17 84 2352
LibriSpeech/test-other/8188/269288/8188-269288-0053.flac ▁ANNIE ▁DID ▁NOT ▁MEAN ▁TO ▁CONFIDE ▁IN ▁ANYONE ▁THAT ▁NIGHT ▁AND ▁THE ▁KIND EST ▁THING ▁WAS ▁TO ▁LEAVE ▁HER ▁A LONE 4039 99 35 251 9 4758 11 2454 16 199 6 4 323 200 255 17 9 370 30 10 492
LibriSpeech/test-other/8188/269288/8188-269288-0054.flac ▁TIRED ▁OUT ▁LESLIE ▁HER SELF ▁DROPP ED ▁A SLEEP 1493 70 4708 30 115 1231 7 10 1706
LibriSpeech/test-other/8188/269288/8188-269288-0055.flac ▁ANNIE ▁IS ▁THAT ▁YOU ▁SHE ▁CALL ED ▁OUT 4039 34 16 25 37 208 7 70
LibriSpeech/test-other/8188/269288/8188-269288-0056.flac ▁THERE ▁WAS ▁NO ▁REPLY ▁BUT ▁THE ▁SOUND ▁OF ▁HURRY ING ▁STEPS ▁CAME ▁QUICK ER ▁AND ▁QUICK ER ▁NOW ▁AND ▁THEN ▁THEY ▁WERE ▁INTERRUPTED ▁BY ▁A ▁GROAN 57 17 56 1368 33 4 489 8 1783 14 1381 133 571 49 6 571 49 82 6 76 45 54 2351 44 10 3154
LibriSpeech/test-other/8188/269288/8188-269288-0057.flac ▁OH ▁THIS ▁WILL ▁KILL ▁ME ▁MY ▁HEART ▁WILL ▁BREAK ▁THIS ▁WILL ▁KILL ▁ME 299 46 71 669 50 41 235 71 977 46 71 669 50
...
...
You can simply train with LibriSpeech dataset like below:
- Example1: Train the
conformer-lstm
model withfilter-bank
features on GPU.
$ python ./openspeech_cli/hydra_train.py \
dataset=librispeech \
dataset.dataset_download=True \
dataset.dataset_path=$DATASET_PATH \
dataset.manifest_file_path=$MANIFEST_FILE_PATH \
vocab=libri_subword \
model=conformer_lstm \
audio=fbank \
lr_scheduler=warmup_reduce_lr_on_plateau \
trainer=gpu \
criterion=joint_ctc_cross_entropy
You can simply train with KsponSpeech dataset like below:
- Example2: Train the
listen-attend-spell
model withmel-spectrogram
features On TPU:
$ python ./openspeech_cli/hydra_train.py \
dataset=ksponspeech \
dataset.dataset_path=$DATASET_PATH \
dataset.manifest_file_path=$MANIFEST_FILE_PATH \
dataset.test_dataset_path=$TEST_DATASET_PATH \
dataset.test_manifest_dir=$TEST_MANIFEST_DIR \
vocab=kspon_character \
model=listen_attend_spell \
audio=melspectrogram \
lr_scheduler=warmup_reduce_lr_on_plateau \
trainer=tpu \
criterion=joint_ctc_cross_entropy
You can simply train with AISHELL-1 dataset like below:
- Example3: Train the
quartznet
model withmfcc
features On GPU with FP16:
$ python ./openspeech_cli/hydra_train.py \
dataset=aishell \
dataset.dataset_path=$DATASET_PATH \
dataset.dataset_download=True \
dataset.manifest_file_path=$MANIFEST_FILE_PATH \
vocab=aishell_character \
model=quartznet15x5 \
audio=mfcc \
lr_scheduler=warmup_reduce_lr_on_plateau \
trainer=gpu-fp16 \
criterion=ctc
- Example1: Evaluation the
listen_attend_spell
model:
$ python ./openspeech_cli/hydra_eval.py \
audio=melspectrogram \
eval.model_name=listen_attend_spell \
eval.dataset_path=$DATASET_PATH \
eval.checkpoint_path=$CHECKPOINT_PATH \
eval.manifest_file_path=$MANIFEST_FILE_PATH
- Example2: Evaluation the
listen_attend_spell
,conformer_lstm
models with ensemble:
$ python ./openspeech_cli/hydra_eval.py \
audio=melspectrogram \
eval.model_names=(listen_attend_spell, conformer_lstm) \
eval.dataset_path=$DATASET_PATH \
eval.checkpoint_paths=($CHECKPOINT_PATH1, $CHECKPOINT_PATH2) \
eval.ensemble_weights=(0.3, 0.7) \
eval.ensemble_method=weighted \
eval.manifest_file_path=$MANIFEST_FILE_PATH
Language model training requires only data to be prepared in the following format:
openspeech is a framework for making end-to-end speech recognizers.
end to end automatic speech recognition is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits.
because of these advantages, many end-to-end speech recognition related open sources have emerged.
...
...
Note that you need to use the same vocabulary as the acoustic model.
- Example: Train the
lstm_lm
model:
$ python ./openspeech_cli/hydra_lm_train.py \
dataset=lm \
dataset.dataset_path=../../../lm.txt \
vocab=kspon_character \
vocab.vocab_path=../../../labels.csv \
model=lstm_lm \
lr_scheduler=tri_stage \
trainer=gpu \
criterion=perplexity