Apply Unicode Normalization #108

Marcono1234 · 2021-06-20T17:11:39Z

Marcono1234
Jun 20, 2021

What do you think about applying Unicode Normalization (German Wikipedia article) to both language model creation and language detection?
This would allow accurate language detection even if input text or training data is non-normalized. Java provides the Normalizer class for this, see also Java Tutorial. For Lingua one of the canonical composition forms (NFC or NFKC) should probably be used (with the decomposition forms Lingua's checks for unique letters would not work anymore). NFKC might be best since it also avoids some redundancies (though it could in some rare corner cases also introduce inaccuracies).

Marcono1234 · 2022-06-17T21:50:35Z

Marcono1234
Jun 17, 2022
Author

@pemistahl, what is your opinion on this? I assume it could help increase the accuracy when text uses separate diacritics characters.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply Unicode Normalization #108

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Apply Unicode Normalization #108

Marcono1234 Jun 20, 2021

Replies: 1 comment

Marcono1234 Jun 17, 2022 Author

Marcono1234
Jun 20, 2021

Marcono1234
Jun 17, 2022
Author