Add word_length_penalty option. #40

galv · 2024-06-12T21:49:47Z

Setting length_penalty to a negative score is helpful for CTC models, since they are often biased towards taking shorter length paths through the WFST graph. (Since shorter paths have smaller costs, in general.)

However, a side effect of using length penalty this way is that stuff like "no one cares" would come out as "no one caress" instead because "caress" has a longer WFST path than "cares".

Applying the penalty only when olabel != 0 (epsilon) can help work around this issue, while still preserving some of the benefits from length_penalty.

Note that this word_length_penalty is applied in both the emitting and non-emitting ExpandArcs, while length_penalty is applied only in the emitting ExpandArcs. I believe this is the proper way to do things.

Here are some experiments from running the test test_sub_ins_del:

model is stt_en_conformer_ctc_small
dataset is test-clean

For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was: wer=0.04530584297017651, ins=369, sub=1650, del=363

For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best length_penalty was -9.5. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386

For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.04309951308581862, ins=302, sub=1572, del=392

The best result comes from using compact CTC topology with word_length_penalty=-10.0

It makes sense that a more negative length penalty is required to minimize WER for the compact CTC topology; it has fewer self-loops.

Insertion, Substitution, and Deletion statistics were obtained by applying this diff:

modified   src/riva/asrlib/decoder/test_graph_construction.py
@@ -963,6 +963,8 @@ class TestGraphConstruction:
         references = [s.lower() for s in references]
         # Might want to try a different WER implementation, for sanity.
         my_wer = wer(references, predictions)
+        wer_ratio, ins, sub, deletions = my_wer
+        print(f"GALVEZ:wer={wer_ratio}, ins={ins}, sub={sub}, del={deletions}")
         other_wer = word_error_rate(references, predictions)
         print("beam search WER:", my_wer)
         print("other beam search WER:", other_wer)

Setting length_penalty to a negative score is helpful for CTC models, since they are often biased towards taking shorter length paths through the WFST graph. (Since shorter paths have smaller costs, in general.) However, a side effect of using length penalty this way is that stuff like "no one cares" would come out as "no one caress" instead because "caress" has a longer WFST path than "cares". Applying the penalty only when olabel != 0 (epsilon) can help work around this issue, while still preserving some of the benefits from length_penalty. Note that this word_length_penalty is applied in both the emitting and non-emitting ExpandArcs, while length_penalty is applied only in the emitting ExpandArcs. I believe this is the proper way to do things. Here are some experiments from running the test test_sub_ins_del: model is stt_en_conformer_ctc_small dataset is test-clean For vanilla "ctc" topology, the best length_penalty was -5.0. The WER was: wer=0.04530584297017651, ins=369, sub=1650, del=363 For vanilla "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386 For compact "ctc" topology, the best length_penalty was -9.5. The WER was: wer=0.045058581862446746, ins=375, sub=1608, del=386 For compact "ctc" topology, the best word_length_penalty was -10.0. The WER was: wer=0.04309951308581862, ins=302, sub=1572, del=392 The best result comes from using compact CTC topology with word_length_penalty=-10.0 It makes sense that a more negative length penalty is required to minimize WER for the compact CTC topology; it has fewer self-loops. Insertion, Substitution, and Deletion statistics were obtained by applying this diff: modified src/riva/asrlib/decoder/test_graph_construction.py @@ -963,6 +963,8 @@ class TestGraphConstruction: references = [s.lower() for s in references] # Might want to try a different WER implementation, for sanity. my_wer = wer(references, predictions) + wer_ratio, ins, sub, deletions = my_wer + print(f"GALVEZ:wer={wer_ratio}, ins={ins}, sub={sub}, del={deletions}") other_wer = word_error_rate(references, predictions) print("beam search WER:", my_wer) print("other beam search WER:", other_wer)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add word_length_penalty option. #40

Add word_length_penalty option. #40

galv commented Jun 12, 2024 •

edited

Loading

Add word_length_penalty option. #40

Are you sure you want to change the base?

Add word_length_penalty option. #40

Conversation

galv commented Jun 12, 2024 • edited Loading

galv commented Jun 12, 2024 •

edited

Loading