Replies: 2 comments 1 reply
-
Hi, thanks for your request. Yes, I will update the Java / Kotlin port of my library so that all the improvements of the Rust, Go and Python ports are included here as well. Currently, I don't have much free time, so it will take another while. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hi, did you get a chance to update the java lib? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I noticed that for a particular string (used in the code below), I am getting correct language detection when I use
lingua-py
, butlingua
gives me bad results.Here is the python version:
detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build() detector.compute_language_confidence_values("Fast shipment. easy order. excellent costumer care!AAA+++") [ConfidenceValue(language=Language.ENGLISH, value=0.4054392221734394), ConfidenceValue(language=Language.TAGALOG, value=0.17739618771502366), ConfidenceValue(language=Language.FRENCH, value=0.05126428674609979), ConfidenceValue(language=Language.DANISH, value=0.045731114924862044), ConfidenceValue(language=Language.LATIN, value=0.02444471663598676), ConfidenceValue(language=Language.DUTCH, value=0.02432267933976313), ConfidenceValue(language=Language.ITALIAN, value=0.020561633912509245)
You can see the the detected language is English with TAGALOG a distant second.
But running the same with Java always gives me TAGALOG as first.
Here is the code:
Set<Language> languages = Sets.newHashSet(); languages.addAll(Language.all()); com.github.pemistahl.lingua.api.LanguageDetector LANGUAGE_DETECTOR = LanguageDetectorBuilder.fromLanguages(languages.toArray(new Language[0])) .build(); System.out.printf("Lingua detected language: %s \n", LANGUAGE_DETECTOR.detectLanguageOf(text)); SortedMap<Language, Double> confidenceValues = LANGUAGE_DETECTOR.computeLanguageConfidenceValues("Fast shipment. easy order. excellent costumer care!AAA+++"); System.out.println("Confidence Values: " + confidenceValues);
Output:
Lingua detected language: TAGALOG Confidence Values: {TAGALOG=1.0, ENGLISH=0.9775208830833435, DUTCH=0.91419917345047, DANISH=0.9110179543495178, AFRIKAANS=0.894393265247345, LATIN=0.8940380811691284, FRENCH=0.8823490738868713, YORUBA=0.8764988780021667, MAORI=0.8754469156265259, ITALIAN=0.8731745481491089, NYNORSK=0.8615179061889648, XHOSA=0.8594332337379456, SWEDISH=0.8565084338188171, FINNISH=0.8524059057235718, INDONESIAN=0.8513398766517639, TURKISH=0.8496785759925842, BOKMAL=0.8471575379371643, ESPERANTO=0.8467212915420532, WELSH=0.8453010320663452, GERMAN=0.8415506482124329, SOTHO=0.8362245559692383, SWAHILI=0.8316057324409485, PORTUGUESE=0.830149233341217, MALAY=0.8224523663520813, ICELANDIC=0.8180419206619263, ROMANIAN=0.8172013163566589, SPANISH=0.8142555356025696, BASQUE=0.8130630850791931, ALBANIAN=0.8069103360176086, TSWANA=0.7953811883926392, ZULU=0.79490727186203, ESTONIAN=0.7938821315765381, SLOVAK=0.7922097444534302, GANDA=0.785712480545044, TSONGA=0.7855940461158752, CZECH=0.783637285232544, POLISH=0.7821556329727173, SLOVENE=0.7810076475143433, HUNGARIAN=0.7796627283096313, LITHUANIAN=0.770524799823761, IRISH=0.765510082244873, SHONA=0.76349276304245, AZERBAIJANI=0.7579408884048462, CROATIAN=0.7558900117874146, CATALAN=0.7528392672538757, BOSNIAN=0.7519121766090393, VIETNAMESE=0.750431478023529, SOMALI=0.7487541437149048, LATVIAN=0.7310714721679688}
I also noticed that lingua_py's version is 1.3.2 whereas the latest version available for Java is 1.2.2
This probably means that the Java version needs to be updated to pick up the new language models. Any plans on doing so?
Beta Was this translation helpful? Give feedback.
All reactions