Doesn't seem so accurate #11

gidzr · 2024-01-06T06:10:29Z

  $text = "Magnums & Movies - A Fish Called Wanda";
  $detector = new LanguageDetector();                    // composer require landrok/language-detector
  $langCode = (string)$detector->evaluate($text)->getLanguage();

= af, Afrikaans

Is A Fish Called Wanda in English or Afrikaans?

The text was updated successfully, but these errors were encountered:

jdwx · 2025-01-17T19:18:30Z

Noticed something similar when I saw"New Order" reports as da, Danish. Given the approach being used, some trouble with very short texts like that is expected, but it got me looking into this a bit more.

I dug into the scores exposed by LanguageDetector::getScores(). They're all between 0 and 1. Based on a quick examination of the code, it looks like it's doing a frequency analysis on the n-grams in the provided text and comparing that to stored samples for each language.

For a larger sample of text (e.g., the first paragraph of the Wikipedia article on kurtosis), they tend to look like this:

en => double(0.70911276841421)
it => double(0.6746941863833)
es => double(0.67450761486222)
fr => double(0.6542429903409)
pt => double(0.65338632087529)
af => double(0.65263502621527)
nl => double(0.64689347669993)
tl => double(0.64540757969025)
da => double(0.64057588200622)
no => double(0.63495307916678)

So the scores are saying "this text is 70.91% similar to English, 67.47% similar to Italian," etc. And if you just take the ->getLanguage() result to get the highest-scoring language you're missing out on the information that the top 10+ results are all within a few percentage points. Based on that, it's hard to have confidence in the result.

The code is well-written, and the API is very usable, so I like this library a lot. It would be great if the reliability could be improved. Unfortunately, because only the processed frequency tables are included, the ability for third parties to explore modifications is extremely limited.

Perhaps at some point in the future, the individual language samples used to build the tables could be exposed as a separate repository?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't seem so accurate #11

Doesn't seem so accurate #11

gidzr commented Jan 6, 2024

jdwx commented Jan 17, 2025

Doesn't seem so accurate #11

Doesn't seem so accurate #11

Comments

gidzr commented Jan 6, 2024

jdwx commented Jan 17, 2025