Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repetition of words causes detection error #55

Open
joewong826 opened this issue Jun 30, 2016 · 2 comments
Open

Repetition of words causes detection error #55

joewong826 opened this issue Jun 30, 2016 · 2 comments

Comments

@joewong826
Copy link

joewong826 commented Jun 30, 2016

When I input strings like 'hello world hello world hello world', langid can't identify it as English text.
>>> import langid
>>> langid.classify('hello world hello world hello world')
('af', 0.683057652874482)

@saffsd
Copy link
Owner

saffsd commented Jul 5, 2016

Thanks for getting in touch! This is an interesting one!

>>> hello world
(array([1426, 1428, 2273, 3948]),)
[1 1 1 1]
('en', -23.719746112823486)
>>> hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[1 2 2 2 2]
('en', -62.565943241119385)
>>> hello world hello world hello world
(array([1339, 1426, 1428, 2273, 3948]),)
[2 3 3 3 3]
('af', -100.6344223022461)
>>> ld 
(array([1339]),)
[1]
('en', 2.9972290992736816)

The issue is that in the training data, the pattern "ld " must be more strongly associated with afrikaans than English, especially when considered with the other patterns in "hello world".

Unfortunately, there's no easy fix for this. Is this a problem in a real use case for you?

@joewong826
Copy link
Author

Not yet. But my code using langid might process millions of data and texts, and I cannot guarantee there would be no extreme cases like this one.
With that being said, I have to admit such circumstances may not even happen. If there's no easy fix, then not fixing it is fine. Thank you for your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants