Bad results with Java version #181

aamirbutt · 2023-08-10T18:58:35Z

aamirbutt
Aug 10, 2023

I noticed that for a particular string (used in the code below), I am getting correct language detection when I use lingua-py, but lingua gives me bad results.

Here is the python version:
detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build() detector.compute_language_confidence_values("Fast shipment. easy order. excellent costumer care!AAA+++") [ConfidenceValue(language=Language.ENGLISH, value=0.4054392221734394), ConfidenceValue(language=Language.TAGALOG, value=0.17739618771502366), ConfidenceValue(language=Language.FRENCH, value=0.05126428674609979), ConfidenceValue(language=Language.DANISH, value=0.045731114924862044), ConfidenceValue(language=Language.LATIN, value=0.02444471663598676), ConfidenceValue(language=Language.DUTCH, value=0.02432267933976313), ConfidenceValue(language=Language.ITALIAN, value=0.020561633912509245)

You can see the the detected language is English with TAGALOG a distant second.

But running the same with Java always gives me TAGALOG as first.
Here is the code:

Set<Language> languages = Sets.newHashSet(); languages.addAll(Language.all()); com.github.pemistahl.lingua.api.LanguageDetector LANGUAGE_DETECTOR = LanguageDetectorBuilder.fromLanguages(languages.toArray(new Language[0])) .build(); System.out.printf("Lingua detected language: %s \n", LANGUAGE_DETECTOR.detectLanguageOf(text)); SortedMap<Language, Double> confidenceValues = LANGUAGE_DETECTOR.computeLanguageConfidenceValues("Fast shipment. easy order. excellent costumer care!AAA+++"); System.out.println("Confidence Values: " + confidenceValues);
Output:

Lingua detected language: TAGALOG Confidence Values: {TAGALOG=1.0, ENGLISH=0.9775208830833435, DUTCH=0.91419917345047, DANISH=0.9110179543495178, AFRIKAANS=0.894393265247345, LATIN=0.8940380811691284, FRENCH=0.8823490738868713, YORUBA=0.8764988780021667, MAORI=0.8754469156265259, ITALIAN=0.8731745481491089, NYNORSK=0.8615179061889648, XHOSA=0.8594332337379456, SWEDISH=0.8565084338188171, FINNISH=0.8524059057235718, INDONESIAN=0.8513398766517639, TURKISH=0.8496785759925842, BOKMAL=0.8471575379371643, ESPERANTO=0.8467212915420532, WELSH=0.8453010320663452, GERMAN=0.8415506482124329, SOTHO=0.8362245559692383, SWAHILI=0.8316057324409485, PORTUGUESE=0.830149233341217, MALAY=0.8224523663520813, ICELANDIC=0.8180419206619263, ROMANIAN=0.8172013163566589, SPANISH=0.8142555356025696, BASQUE=0.8130630850791931, ALBANIAN=0.8069103360176086, TSWANA=0.7953811883926392, ZULU=0.79490727186203, ESTONIAN=0.7938821315765381, SLOVAK=0.7922097444534302, GANDA=0.785712480545044, TSONGA=0.7855940461158752, CZECH=0.783637285232544, POLISH=0.7821556329727173, SLOVENE=0.7810076475143433, HUNGARIAN=0.7796627283096313, LITHUANIAN=0.770524799823761, IRISH=0.765510082244873, SHONA=0.76349276304245, AZERBAIJANI=0.7579408884048462, CROATIAN=0.7558900117874146, CATALAN=0.7528392672538757, BOSNIAN=0.7519121766090393, VIETNAMESE=0.750431478023529, SOMALI=0.7487541437149048, LATVIAN=0.7310714721679688}

I also noticed that lingua_py's version is 1.3.2 whereas the latest version available for Java is 1.2.2

This probably means that the Java version needs to be updated to pick up the new language models. Any plans on doing so?

pemistahl · 2023-08-10T21:07:55Z

pemistahl
Aug 10, 2023
Maintainer

Hi, thanks for your request. Yes, I will update the Java / Kotlin port of my library so that all the improvements of the Rust, Go and Python ports are included here as well. Currently, I don't have much free time, so it will take another while.

0 replies

aamirbutt · 2024-05-06T15:36:05Z

aamirbutt
May 6, 2024
Author

Hi, did you get a chance to update the java lib?

1 reply

pemistahl May 10, 2024
Maintainer

Not yet, unfortunately. Due to my family life and my daily job, I'm very busy. As soon as there is a new release, you will see it in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad results with Java version #181

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Bad results with Java version #181

aamirbutt Aug 10, 2023

Replies: 2 comments · 1 reply

pemistahl Aug 10, 2023 Maintainer

aamirbutt May 6, 2024 Author

pemistahl May 10, 2024 Maintainer

aamirbutt
Aug 10, 2023

Replies: 2 comments 1 reply

pemistahl
Aug 10, 2023
Maintainer

aamirbutt
May 6, 2024
Author

pemistahl May 10, 2024
Maintainer