Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't seem so accurate #11

Open
gidzr opened this issue Jan 6, 2024 · 1 comment
Open

Doesn't seem so accurate #11

gidzr opened this issue Jan 6, 2024 · 1 comment

Comments

@gidzr
Copy link

gidzr commented Jan 6, 2024

  $text = "Magnums & Movies - A Fish Called Wanda";
  $detector = new LanguageDetector();                    // composer require landrok/language-detector
  $langCode = (string)$detector->evaluate($text)->getLanguage();

= af, Afrikaans

Is A Fish Called Wanda in English or Afrikaans?

@jdwx
Copy link

jdwx commented Jan 17, 2025

Noticed something similar when I saw"New Order" reports as da, Danish. Given the approach being used, some trouble with very short texts like that is expected, but it got me looking into this a bit more.

I dug into the scores exposed by LanguageDetector::getScores(). They're all between 0 and 1. Based on a quick examination of the code, it looks like it's doing a frequency analysis on the n-grams in the provided text and comparing that to stored samples for each language.

For a larger sample of text (e.g., the first paragraph of the Wikipedia article on kurtosis), they tend to look like this:

en => double(0.70911276841421)
it => double(0.6746941863833)
es => double(0.67450761486222)
fr => double(0.6542429903409)
pt => double(0.65338632087529)
af => double(0.65263502621527)
nl => double(0.64689347669993)
tl => double(0.64540757969025)
da => double(0.64057588200622)
no => double(0.63495307916678)

So the scores are saying "this text is 70.91% similar to English, 67.47% similar to Italian," etc. And if you just take the ->getLanguage() result to get the highest-scoring language you're missing out on the information that the top 10+ results are all within a few percentage points. Based on that, it's hard to have confidence in the result.

The code is well-written, and the API is very usable, so I like this library a lot. It would be great if the reliability could be improved. Unfortunately, because only the processed frequency tables are included, the ability for third parties to explore modifications is extremely limited.

Perhaps at some point in the future, the individual language samples used to build the tables could be exposed as a separate repository?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants