Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uzbek cyrillic being thrown away #14

Open
ZJaume opened this issue Nov 8, 2023 · 2 comments
Open

Uzbek cyrillic being thrown away #14

ZJaume opened this issue Nov 8, 2023 · 2 comments
Assignees

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Nov 8, 2023

Noticed that in most of HPLT documents that CLD2 says it is Uzbek and are written in cyrillic, fasttext is saying that sentences are other cyrillic langs like ru, kk, tt, ug, az. The list of possible cases is large so I this language may need a special mode where we simply check cyr and lat Uzbek dictionaries and if error is less than 30%, we keep it as uz.

There is one dictionary for both scripts here: https://github.com/u2b3k/uz-hunspell

@mbanon mbanon self-assigned this Nov 9, 2023
@mbanon
Copy link
Owner

mbanon commented Nov 14, 2024

@ZJaume Hello one year after opening the issue xD Do we have any samples of cyrillic Uzbek? Current HPLT v2 is only uzbek latin (uzn_Latn)

@ZJaume
Copy link
Collaborator Author

ZJaume commented Nov 14, 2024

For example

zstdcat uz_1.jsonl.zst | grep '"tt"

from v1.2 will give many of those.

Not really sure about CLD2 being correct in this, but according to Wikipedia, despite Latin being official since '92, Cyrillic is still widespread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants