Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Setting length_penalty to a negative score is helpful for CTC models, since they are often biased towards taking shorter length paths through the WFST graph. (Since shorter paths have smaller costs, in general.)
However, a side effect of using length penalty this way is that stuff like "no one cares" would come out as "no one caress" instead because "caress" has a longer WFST path than "cares".
Applying the penalty only when olabel != 0 (epsilon) can help work around this issue, while still preserving some of the benefits from length_penalty.
Note that this word_length_penalty is applied in both the emitting and non-emitting ExpandArcs, while length_penalty is applied only in the emitting ExpandArcs. I believe this is the proper way to do things.
Here are some experiments from running the test test_sub_ins_del:
model is stt_en_conformer_ctc_small
dataset is test-clean
For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was: wer=0.04530584297017651, ins=369, sub=1650, del=363
For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386
For compact "ctc" topology, the best length_penalty was -9.5. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386
For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.04309951308581862, ins=302, sub=1572, del=392
The best result comes from using compact CTC topology with word_length_penalty=-10.0
It makes sense that a more negative length penalty is required to minimize WER for the compact CTC topology; it has fewer self-loops.
Insertion, Substitution, and Deletion statistics were obtained by applying this diff: