You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
$text = "Magnums & Movies - A Fish Called Wanda";
$detector = new LanguageDetector(); // composer require landrok/language-detector
$langCode = (string)$detector->evaluate($text)->getLanguage();
= af, Afrikaans
Is A Fish Called Wanda in English or Afrikaans?
The text was updated successfully, but these errors were encountered:
Noticed something similar when I saw"New Order" reports as da, Danish. Given the approach being used, some trouble with very short texts like that is expected, but it got me looking into this a bit more.
I dug into the scores exposed by LanguageDetector::getScores(). They're all between 0 and 1. Based on a quick examination of the code, it looks like it's doing a frequency analysis on the n-grams in the provided text and comparing that to stored samples for each language.
For a larger sample of text (e.g., the first paragraph of the Wikipedia article on kurtosis), they tend to look like this:
en => double(0.70911276841421)
it => double(0.6746941863833)
es => double(0.67450761486222)
fr => double(0.6542429903409)
pt => double(0.65338632087529)
af => double(0.65263502621527)
nl => double(0.64689347669993)
tl => double(0.64540757969025)
da => double(0.64057588200622)
no => double(0.63495307916678)
So the scores are saying "this text is 70.91% similar to English, 67.47% similar to Italian," etc. And if you just take the ->getLanguage() result to get the highest-scoring language you're missing out on the information that the top 10+ results are all within a few percentage points. Based on that, it's hard to have confidence in the result.
The code is well-written, and the API is very usable, so I like this library a lot. It would be great if the reliability could be improved. Unfortunately, because only the processed frequency tables are included, the ability for third parties to explore modifications is extremely limited.
Perhaps at some point in the future, the individual language samples used to build the tables could be exposed as a separate repository?
= af, Afrikaans
Is A Fish Called Wanda in English or Afrikaans?
The text was updated successfully, but these errors were encountered: