Description of parameters available in config files for training and inference.
Each config should include data
/hparas
/model
, see example on LibrSpeech.
Options under this category are all data-related.
-
Corpus
For each corpus, a corresponding source python file
<corpus_name>.py
file should be placed atcorpus/
, checkoutlibrispeech.py
for example.Parameter Description Note name str
name of corpus (used indata.py
to import the dataset defined in<corpus_name>.py
)Available: Librispeech
path str
path to the specified corpus, parsing file structure should be handled in<corpus_name>.py
train_split list
which includes subsets of corpus used for training, accepted partition names should be defined in<corpus_name>.py
dev_split list
which includes subsets of corpus used for validation, accepted partition names should be defined in<corpus_name>.py
bucketing bool
to enable bucketing, i.e. similar length in each batch, should be implemented in<corpus_name>.py
More effecient training but biased sampling batch_size int
Batch size for training/validation, will be send to Torch Dataloader -
Audio
Hyperparameters of feature extraction performed on-the-fly mostly done by torchaudio, checkout audio.py for implementation.
Parameter Description Note feat_type str
name of audio feature to be used. Please note that MFCC required latest torch audioAvailable: fbank
/mfcc
feat_dim int
dimensionality of audio feature, if you are not fimiliar with audio features,40
forfbank
and13
formfcc
generally worksframe_length int
size of the window (millisecond) for feature extractionframe_shift int
hop size of the window (millisecond) for feature extractiondither float
dither when extracting featureSee doc apply_cmvn bool
to activate feature normalizationUsing our own implementation delta_order int
to apply delta on feature.0
: do nothing,1
: add delta,2
: also add accelerateUsing our own implementation delta_window_size int
to specify the window size for delta calculation -
Text
Options to specify how text should be encoded, subword models use sentencepiece
Parameter Description Note mode str
text unit for encoding sentencesAvailable: character
/subword
/word
vocab_file src
path to file containing vocabulary setPlease use generate_vocab_file.py to generate it
Options under this category are all training-related.
Parameter | Description | Note |
---|---|---|
valid_step | int interval, numbers of training step for each validation |
|
max_step | int total training step |
|
tf_start | float init. teacher forcing probability in scheduled sampling |
|
tf_end | float final teacher forcing probability in scheduled sampling |
|
tf_step | int number of steps to linearly decrease teacher forcing probability |
|
optimizer | str the name of pytorch optimizer for training |
Tested: Adam /Adadelta |
lr | float learning rate for optimizer |
|
eps | float epsilon for optimizer |
|
lr_scheduler | str learning rate scheduler |
Available: fixed /warmup |
curriculum | int numbers of epochs to perform curriculum learning (short uttr. first) |
-
ctc_weight
: weight of CTC in hybird CTC-Attention model (between0~1
,0
=disabled,1
is under development) -
Encoder
Parameter Description Note prenet str
to employ VGG/CNN based encoder before RNNvgg
/cnn
module str
the name of recurrent unit for encoder RNN layerOnly LSTM
was testedbidirection bool
to enable bidirectional RNN over input sequencedim list
of number of cells for each RNN layer (per direction)dropout list
of dropout probability for each RNN layerLength must match dim
layer_norm list
ofbool
to enable LayerNorm for each RNN layerNot recommended proj list
ofbool
to enable linear projection after each RNN layerLength must match dim
sample_rate list
sample rate for each RNN layer. For each layer, the length of output on the time dimension will be input/sample_rate
.Length must match dim
sample_style str
the down sampling mechanism.concat
will concatenate multiple time steps according to sample rate into one vector,drop
will drop the unsampled timesteps.Available: concat
/drop
-
Attention
Parameter Description Note mode str
attention mechanism,dot
is the vanilla attention andloc
indicates the location-based attention.Available: dot
/loc
dim int
dimension of all networks in attentionnum_head int
number of head in multi-head attention,1
: normal attentionPerformance untested v_proj bool
to apply additional linear transform to encoder feature before weighted sumtemperature float
the temperature to controll sharpness of sofmax function in attentionloc_kernel_size int
kernel size for convolution in location awared attentionFor loc
onlyloc_kernel_num int
number of kernel for convolution in location awared attentionFor loc
only -
Decoder
Parameter Description Note module str
the name of recurrent unit for encoder RNN layerOnly LSTM
was testeddim int
number of cells in decoderlayer int
number of layers in decoderdropout float
of dropout probability
The following mechanisms are our proposed methods, can be activate by inserting these parameters to config file
-
Emb
Parameter Description Note enable bool
to enable word embedding regularization or fused decoding on ASRsrc str
path to pre-trained embedding table or BERT modelThe bert-base-uncased
model fine-tuned on librispeech text data is available heredistance str
measurement of distance between word embedding and model outputAvailable: CosEmb
/MSE
(untested)weight float
$\lambda$ in paperfuse float
$\lambda_f$ in paperfuse_normalize bool
to normalize output before Cosine-Softmax in paper, should be on whendistance==CosEmb
bert str
name of BERT model if using BERT as target embedding, e.g.bert-base-uncased
mutually exclusive to fuse>0
Each config should include src
/decode
/data
, see example on LibrSpeech.
Note that most of the options (audio feature, model structure, etc.) will be imported from the training config specified in src
.
Specify the ASR to use in decoding process.
Parameter | Description | Note |
---|---|---|
ckpt | str path to ASR checkpoint to be load |
|
config | str path to ASR training config which belongs to the checkpoint |
-
Corpus
Parameter Description Note name See corpus
section in training configdev_split See corpus
section in training configtest_split Like dev set, ASR will perform exactly same decoding process on this set, should also be defined by user like train/dev set
Options for decoding that will dramatically change the decoding result.
Parameter | Description | Note |
---|---|---|
beam_size |
int beam size for beam search algorithm, be careful that larger beam increases memory usage |
|
min_len_ratio |
float the minimum length of any hypothesis will be min_len_ratio input length
|
|
max_len_ratio |
float the maximum decoding time step will be max_len_ratio input length , hypothesis will end if <eos> is predicted or maximum decoding step reached |
|
lm_path |
str the path to pre-trained LM for joint decoding, this is not language model rescoring
|
paper |
lm_config |
str the path to the config of pre-trained LM for joint decoding |
paper |
lm_weight |
float the weight for RNNLM in joint decoding |
paper, slower inference |
ctc_weight |
float the weight for CTC network in joint decoding, this will only be available if ctc_weight was not zero in training config |
paper, slower inference |
vocab_candidate |
int the number of vocabulary candidates considered in CTC beam decoding, the smaller the value the faster the decoding, but it must be greater than beam_size
|
paper |