Not so good by estimating high confidence for random text/gibberish #113

GabrielKesler · 2021-07-28T17:37:46Z

GabrielKesler
Jul 28, 2021

Hi,

I have this code snippet:

LanguageDetectorBuilder
    .fromIsoCodes639_1(IsoCode639_1.EN, IsoCode639_1.NB)
    .withMinimumRelativeDistance(0)
    .build()
    .computeLanguageConfidenceValues(" lsdkaslkldkaskloda dsajpioqwulj sdlkjflksdj wqiupoieq sakjljkas ds;lk;klda qwopipoidsa ;lkfds;lk woowieqp[ fdslkl;asd w[epo[offd lk';sdal'lda pppasoda jkk");

Resulting in this:

{Language@20525} BOKMAL -> {Double@20750} 1.0
{Language@20508} ENGLISH -> {Double@20751} 0.9383189360356875

Seems that this library is giving very high confidence values for gibberish/random words, which is unacceptable.
Any suggestions ?

pemistahl · 2021-07-28T17:54:49Z

pemistahl
Jul 28, 2021
Maintainer

Hi @GabrielKesler, thanks for your question.

Have you read the documentation about the confidence metric? It is a relative metric, i.e. the most likely language always gets the value 1.0, even though the most likely language might actually be unlikely for the given input text. I will think about improving the confidence metric calculation but this is not a trivial task.

What is the point of feeding the language detector with gibberish text anyway? This is a very contrived example. I don't think that the texts you want to classify are of this sort.

2 replies

GabrielKesler Jul 28, 2021
Author

Hi,

Thanks for getting back to me,
I understand, but still should not result a 1 considering there is no known word.

I will probably return to your library in the future to check up on your progress, until then much appreciate your contributions to the open source community.

Ah, and yes as for why in the world would someone feed your library with gibberish ? And here is the answer : when you try to OCR crazy documents or scanned images you get all sorts of gibberish :)

Cheers,
Gabe!

pemistahl Jul 29, 2021
Maintainer

Perhaps you slightly misunderstand how Lingua works. It doesn't know anything about words, it only knows about ngrams, i.e. sequences of up to five characters.

In your example, you are asking Lingua to give you the most likely language for your text, hereby only deciding between Bokmal and English. It tells you that Bokmal is more likely than English by assigning it the value 1.0, based on the probabilities for the observed ngrams.

What you want is an absolute confidence metric telling you that both Bokmal and English are rather unlikely for the given text. This is difficult but surely not impossible to implement. Until a solution for this has been implemented, Lingua probably is not the right tool for you to decide whether a scanned text is gibberish or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not so good by estimating high confidence for random text/gibberish #113

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Not so good by estimating high confidence for random text/gibberish #113

GabrielKesler Jul 28, 2021

Replies: 1 comment · 2 replies

pemistahl Jul 28, 2021 Maintainer

GabrielKesler Jul 28, 2021 Author

pemistahl Jul 29, 2021 Maintainer

GabrielKesler
Jul 28, 2021

Replies: 1 comment 2 replies

pemistahl
Jul 28, 2021
Maintainer

GabrielKesler Jul 28, 2021
Author

pemistahl Jul 29, 2021
Maintainer