Turkish detection takes longer than others #134

elish214 · 2022-05-15T07:52:54Z

elish214
May 15, 2022

Hi!
we're using your wonderful language detection package :)
We're consuming version 1.0.0

However, it seems that Turkish is causing a lag:
Took 0.982967443 seconds to detect language of DANISH
Took 0.013375939 seconds to detect language of GERMAN
Took 0.00895221 seconds to detect language of ENGLISH
Took 0.004931715 seconds to detect language of SPANISH
Took 0.007953544 seconds to detect language of FRENCH
Took 0.004886938 seconds to detect language of ITALIAN
Took 0.003902518 seconds to detect language of JAPANESE
Took 0.002207307 seconds to detect language of KOREAN
Took 0.01076291 seconds to detect language of MALAY
Took 0.004513402 seconds to detect language of DUTCH
Took 0.005380963 seconds to detect language of NORWEGIAN
Took 0.009575043 seconds to detect language of POLISH
Took 0.004399465 seconds to detect language of PORTUGUESE
Took 0.00420948 seconds to detect language of SWEDISH
Took 0.001934336 seconds to detect language of THAI
Took 6.708378177 seconds to detect language of TURKISH
Took 8.58397E-4 seconds to detect language of CHINESE_SIMPLIFIED
Took 7.28538E-4 seconds to detect language of CHINESE_TRADITIONAL

any idea what causing this issue?
Thank you!

Elisheva.

pemistahl · 2022-05-16T07:39:35Z

pemistahl
May 16, 2022
Maintainer

Hi Elisheva, thanks for using my library.

any idea what causing this issue?

No. How could I? You haven't shown me a single line of your code, so I'm not able to help you. But if you do, there will be a chance. :)

0 replies

elish214 · 2022-05-16T07:53:46Z

elish214
May 16, 2022
Author

hi!
we're initializing once the detector:
LanguageDetector detector = LanguageDetectorBuilder.fromAllLanguages().build();

and calling it using:
SortedMap<Language, Double> languageToConfidenceMap = detector.computeLanguageConfidenceValues(text);

and that's it :)
nothing else.

Thanks

0 replies

pemistahl · 2022-05-16T09:13:47Z

pemistahl
May 16, 2022
Maintainer

Alright, but this is still too little information. Please show me the code of your benchmark, too. And the content of your text files, if possible.

0 replies

elish214 · 2022-05-16T15:03:29Z

elish214
May 16, 2022
Author

hi

we're running this:

@Test
public void testTurkishStrings()
{

    String[] turkishStrings = {"Bu uzun bir paragraf olacak. Çok uzun değil ama Türkçe çeviriyi Lingua kütüphanesini kullanarak test etmek için yeterince uzun olmalı. Bazı testlerimiz, kitaplığın belirli bir dizenin Türkçe olduğunu algılaması için geçen süre içinde başarısız oluyor. Umarım sorunu çözebiliriz.","Bu yerli bir metin dizesinde bir dil bulmak yeteneği bizim doğrulamak gereken bir testtir. Ben iyi çalışıyor umuyoruz.","Merhaba. Markete gidiyorum. Yarın sahile gideceğim. Biraz havlu almak istiyorum.", "Bu uzun bir paragraf olacak. Çok uzun değil ama Türkçe çeviriyi Lingua kütüphanesini kullanarak test etmek için yeterince uzun olmalı. Bazı testlerimiz, kitaplığın belirli bir dizenin Türkçe olduğunu algılaması için geçen süre içinde başarısız oluyor. Umarım sorunu çözebiliriz."};

    LanguageDetector detector = LanguageDetectorBuilder.fromAllLanguages().build();

    StopWatch stopWatch = new StopWatch();

    for (int i = 0; i < turkishStrings.length; i++)
    {
        stopWatch.start();
        detector.computeLanguageConfidenceValues(turkishStrings[i]);
        stopWatch.stop();
        System.out.println("Took " + ((double)stopWatch.getNanoTime()/1_000_000_000) + " seconds to detect Turkish for String No. " + (i+1));
        stopWatch.reset();
    }
}

Lingua is struggling to detect a particular Turkish string : Bu yerli bir metin dizesinde bir dil bulmak yeteneği bizim doğrulamak gereken bir testtir. Ben iyi çalışıyor umuyoruz.
first time it takes 6-7 seconds. But if you give the same string again, to the same lingua instance, it takes less than a second

other sentences are detected in less than a second.

Thanks!

0 replies

pemistahl · 2022-05-16T15:11:07Z

pemistahl
May 16, 2022
Maintainer

Ah, now I think I know what's going on. It takes longer the first time because the library loads the language models lazily into memory, i.e. only on demand. If you load all language models beforehand, then these performance differences will go away. This is documented in the docs.

Build your language detector like so:

LanguageDetector detector = LanguageDetectorBuilder
    .fromAllLanguages()
    .withPreloadedLanguageModels() // this method loads all models eagerly
    .build();

0 replies

elish214 · 2022-05-16T15:35:15Z

elish214
May 16, 2022
Author

ok, thanks!
would it make all 75 languages available all the time?
wouldnt it make the package very heavy?

0 replies

pemistahl · 2022-05-16T15:40:32Z

pemistahl
May 16, 2022
Maintainer

would it make all 75 languages available all the time?

Yes. That's why you should think about using only a subset of the supported languages for your task. It's very likely that you don't need all 75 languages for detecting the languages of your data. Alternatively, simply leave the current lazy loading as it is.

0 replies

elish214 · 2022-05-16T16:04:58Z

elish214
May 16, 2022
Author

we use lingua in an SDK that in use by many services in my team
therefore i need it to support as many languages as possible in all time :)
my question is if i'll build the detector with all languages once in the initializing, will it cause the detections slower?

0 replies

pemistahl · 2022-05-16T17:19:17Z

pemistahl
May 16, 2022
Maintainer

No, the detection won't be slower. If you load all languages models at once, they will just consume more memory. For version 1.2.0, I'm currently working on improving performance and reducing the memory footprint, so this will become even better in the future.

2 replies

elish214 May 18, 2022
Author

Thank you!
Do you have an estimation for version 1.2.0?

pemistahl May 18, 2022
Maintainer

No, I haven't. It's done when it's done. But I hope to release 1.2.0 within this month.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turkish detection takes longer than others #134

{{title}}

Replies: 9 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Turkish detection takes longer than others #134

elish214 May 15, 2022

Replies: 9 comments · 2 replies

pemistahl May 16, 2022 Maintainer

elish214 May 16, 2022 Author

pemistahl May 16, 2022 Maintainer

elish214 May 16, 2022 Author

pemistahl May 16, 2022 Maintainer

elish214 May 16, 2022 Author

pemistahl May 16, 2022 Maintainer

elish214 May 16, 2022 Author

pemistahl May 16, 2022 Maintainer

elish214 May 18, 2022 Author

pemistahl May 18, 2022 Maintainer

elish214
May 15, 2022

Replies: 9 comments 2 replies

pemistahl
May 16, 2022
Maintainer

elish214
May 16, 2022
Author

pemistahl
May 16, 2022
Maintainer

elish214
May 16, 2022
Author

pemistahl
May 16, 2022
Maintainer

elish214
May 16, 2022
Author

pemistahl
May 16, 2022
Maintainer

elish214
May 16, 2022
Author

pemistahl
May 16, 2022
Maintainer

elish214 May 18, 2022
Author

pemistahl May 18, 2022
Maintainer