Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add source #3

Open
kargaranamir opened this issue Mar 14, 2024 · 6 comments
Open

add source #3

kargaranamir opened this issue Mar 14, 2024 · 6 comments
Assignees
Labels
improvement New feature or request

Comments

@kargaranamir
Copy link
Member

kargaranamir commented Mar 14, 2024

Group A: Please add here any possible speculation to have cleaner sources and evaluation data.

Group B: Please add any possible new sources here, especially those concerning languages not included.

@kargaranamir
Copy link
Member Author

kargaranamir commented Mar 14, 2024

Group A:

@kargaranamir kargaranamir self-assigned this Mar 14, 2024
@kargaranamir kargaranamir added the improvement New feature or request label Mar 14, 2024
@kargaranamir
Copy link
Member Author

kargaranamir commented Apr 8, 2024

Group B:

@kargaranamir kargaranamir changed the title Source inspection add source. Apr 18, 2024
@kargaranamir kargaranamir changed the title add source. add source Apr 18, 2024
@MedAymenF
Copy link

Group B:

* add domain and multilple langs from [Pontoon-Translations](https://huggingface.co/datasets/ayymen/Pontoon-Translations): cleaning is a bit challenging

Are you talking about cleaning the data itself or the metadata (lang codes)?
I intend to release new versions of both Pontoon Translations and Weblate Translations (which has more languages BTW, but probably less quality for LID), but I'm not really sure how I'm going to fix lang codes.

@kargaranamir
Copy link
Member Author

Group B:

* add domain and multilple langs from [Pontoon-Translations](https://huggingface.co/datasets/ayymen/Pontoon-Translations): cleaning is a bit challenging

Are you talking about cleaning the data itself or the metadata (lang codes)? I intend to release new versions of both Pontoon Translations and Weblate Translations (which has more languages BTW, but probably less quality for LID), but I'm not really sure how I'm going to fix lang codes.

about the cleaning, I meant more the tags like <playIcon> or {$goal}, for LID it should be removed, or otherwise it learn bad features. It's not too difficult, but it should be done. I will check your HF every once in a while to see if you publish anything new.

@laubonghaudoi
Copy link

Can you clarify why facebookresearch/flores#61 is solved? I don't see any update in their data.

@kargaranamir
Copy link
Member Author

@laubonghaudoi For my project (GlotLID), the issue is resolved because I deleted the yue in my Flores benchmark. This project is GlotLID, which trains a better language identification system. Flores-200 is one of the benchmarks I used.

But to answer your question in general, this issue is not resolved in Flores-200 at its root. They made another project to maintain Flores: https://github.com/openlanguagedata/flores, but that also does not address this issue! Maybe someone needs to bring up this issue in the new project again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants