NoiseModifier should use a tokenizer to generate correct alignments #55

gregtatum · 2024-03-01T15:51:42Z

There are no guarantees that the alignments are correct in the NoiseModifer. It generates random tokens through the get_random_unicode_words, but these could be tokenized as combined words. For instance, if it chooses basic Latin, it could generate cat which would be a single token rather than c a t as generated.

The fix here would be to use a configured tokenizer.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoiseModifier should use a tokenizer to generate correct alignments #55

NoiseModifier should use a tokenizer to generate correct alignments #55

gregtatum commented Mar 1, 2024 •

edited

Loading

NoiseModifier should use a tokenizer to generate correct alignments #55

NoiseModifier should use a tokenizer to generate correct alignments #55

Comments

gregtatum commented Mar 1, 2024 • edited Loading

gregtatum commented Mar 1, 2024 •

edited

Loading