Apply Unicode Normalization #108
Marcono1234
started this conversation in
Ideas
Replies: 1 comment
-
@pemistahl, what is your opinion on this? I assume it could help increase the accuracy when text uses separate diacritics characters. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
What do you think about applying Unicode Normalization (German Wikipedia article) to both language model creation and language detection?
This would allow accurate language detection even if input text or training data is non-normalized. Java provides the
Normalizer
class for this, see also Java Tutorial. For Lingua one of the canonical composition forms (NFC or NFKC) should probably be used (with the decomposition forms Lingua's checks for unique letters would not work anymore). NFKC might be best since it also avoids some redundancies (though it could in some rare corner cases also introduce inaccuracies).Beta Was this translation helpful? Give feedback.
All reactions