-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate other language detection solutions #656
Comments
I have been tinkering around with Lingua recently and with all the languages covered originally by #104 the memory usage for running Lingua on its own is ~100MB with models for all these languages preloaded. That is, memory usage will only increase temporarily during detection. When used with Arisa it would probably be higher since the other modules of Arisa also take up some memory. Is that acceptable for use with Arisa? Also no worries if that is still too high and Lingua is still not an option; it was definitely not wasted time for me. These Lingua changes can be found here (Jitpack build should hopefully succeed, but have not tested it). However these changes are not in a state in which they can be submitted upstream (tests won't compile, Git history is messy) and I am not sure if they would even be accepted due to some extensive refactoring. Accuracy seems to be roughly the same as the upstream one. I have been testing it on a few reports and results seem to be fairly good. However, there are a few things to consider:
Additionally the notes from #60 and #104 are likely still relevant (some of this is covered by the points above). |
100mb should be fine. Our main problem was it was using more than 2GB meaning GitHub actions didn't run it |
Hmm. I'm not sure whether it's really worth it to maintain a separate fork of lingua just for our purposes. In regard to accuracy, I've tested the few examples from your comment with whatlang too, and it appeared to have gotten those right. The additional advantage of whatlang is that we don't need to interpret our results -- whatlang will straight up let us know whether its results can be considered reliable or not. I haven't tinkered with whatlang too much yet, so I also can't really say how much memory it would need compared to lingua, but to me it seems like a more straight-forward approach that would also require less maintenance. Nevertheless, getting lingua from gigabytes of memory usage down to only 100MB is really impressive, good job! |
The issue with this is that we don't have a chance to find out if a reports consists of multiple languages, so we have to trust Whatlang to pick the correct language. Additionally it might not work well for languages with logograms (Chinese, Korean and Japanese) when the text also contains a few sentences of other languages, prefering that other language. For example MC-228001 is detected with a confidence of 100% as English. However, that might not actually be a problem because such reports are likely rare and it would only result in a false negative. On the other hand in some cases it seems to be more accurate than Lingua. For example MC-212097 (including the summary!) is not reliable detected by Whatlang, while Lingua (at least with my changes) is rather certain that it is Italian ( Here is a query for some potentially interesting reports (also contains some short texts which are currently ignored by Arisa). |
Yeah, that's true, whatlang takes a fairly naïve approach when it comes to mixed languages. For example with MC-228001 gets correctly detected as Japanese when there are equally as many Japanese characters as there are Latin characters. Very interesting that it changes from 100% confidence English to 100% confidence Japanese if you do this. I think this is actually an advantage: In case of mixed languages English/Chinese (for example MC-227856) whatlang will decide that it's English simply because of the fact that English uses more letters. This avoids false positives, which is very good. I've looked through your filter and from what I can tell, whatlang generally appears to make the right call. Don't have lingua set up to compare it though. Mixed languages in bug reports actually happens relatively frequently on the bug tracker actually, people who aren't sure of their English skills will simply also add the same text in their native tongue too. IMO the main advantage of changing language detection solutions is that we'll be able to expand the module to work on ticket updates as well, instead of just ticket creations. Then we can also make Arisa reopen bug reports once the reporter has translated it (currently we just tell them to file a new ticket). From that perspective, considering mixed languages in tickets will become more relevant and something important to consider, since it's likely that users might just append a translation to the description to the bug report. I'd propose that we could try implementing both libraries in the background, while still using Dandelion in the meantime, just to collect some data on where the three approaches differ in practice. Perhaps we could combine both, e.g. if lingua doesn't give crystal clear results, let whatlang make the final call, or vice-versa. Edit: For things like MC-227773, I wonder if it'd make sense to exclude some phrases (e.g. the template in that report) from the text that gets sent over to lingua/whatlang/dandelion? That way we could get rid of some false negatives. Edit 2: Another interesting example: MC-227132 -- whatlang is very hesistant to detect this as French, you basically need to delete every English word for it to do so. To keep in mind, what I said above in regards to Japanese/Chinese is not true for languages that don't use as "condensed" characters, like Turkish: MC-227029. This ticket doesn't get detected correctly. Wondering whether something like a minimum length would still be required? |
Btw you should PR your changes to Lingua :D |
The Problem
Currently, we're using dandelion.eu for language detection. This has the major disadvantage that we need to send all public bug reports to another server to have them analyzed there. Additionally there's a limit on how many requests we can send to the dandelion API.
Both of these things are suboptimal and it would be best to not rely on a third-party service for language detection. If we get rid of that dependency, we would be able to detect the language on private tickets as well, which would be very helpful.
We've initially tried to do language detection directly in Arisa (see #60 and #104), however quickly noticed that the library we used (lingua) needed way too much memory.
We only have limited resources on our server, so we need to be careful about that or get a better server.
Possible Solutions
At the moment, I see the following possibilities:
It seems to be fairly trivial to use Rust crates together with Kotlin, even though it introduces some complications in the build process.
The text was updated successfully, but these errors were encountered: