You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are no guarantees that the alignments are correct in the NoiseModifer. It generates random tokens through the get_random_unicode_words, but these could be tokenized as combined words. For instance, if it chooses basic Latin, it could generate cat which would be a single token rather than cat as generated.
The fix here would be to use a configured tokenizer.
The text was updated successfully, but these errors were encountered:
There are no guarantees that the alignments are correct in the NoiseModifer. It generates random tokens through the
get_random_unicode_words
, but these could be tokenized as combined words. For instance, if it chooses basic Latin, it could generatecat
which would be a single token rather thanc
a
t
as generated.The fix here would be to use a configured tokenizer.
The text was updated successfully, but these errors were encountered: