Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Group semantics regression on FLORES 200 #295

Closed
mkuchnik opened this issue Nov 3, 2023 · 3 comments · Fixed by #309
Closed

Group semantics regression on FLORES 200 #295

mkuchnik opened this issue Nov 3, 2023 · 3 comments · Fixed by #309
Labels
bug Something isn't working

Comments

@mkuchnik
Copy link
Contributor

mkuchnik commented Nov 3, 2023

FLORES 200 was working on 0c66893 but it is different on main (487ec5c).

Repro from recipes directory:

import mlcroissant as mlc
dataset = mlc.Dataset(file="../../../datasets/flores-200/metadata.json", debug=False)
print(dataset)
records = dataset.records(record_set="language_translations_train_data_with_metadata")
print(sum(1 for r in records))

Before the test set was added (causing the record_set to be renamed), the call would be:

import mlcroissant as mlc
dataset = mlc.Dataset(file="../../../datasets/flores-200/metadata.json", debug=False)
print(dataset)
records = dataset.records(record_set="language_translations_with_metadata")
print(sum(1 for r in records))

Note: this can block for a long time in the correct case, so it may be better to print iteratively to see if count goes over 997.

The length should be 204 * 997, but it is only 997. This is because the data from each of the 204 .dev files have been collapsed at some point. Instead of each of the files yielding 997 rows, the effect is 997 samples and some of the columns are None.

@mkuchnik mkuchnik added the bug Something isn't working label Nov 3, 2023
@marcenacp
Copy link
Contributor

@mkuchnik Thanks for opening the bug. I expected https://github.com/mlcommons/croissant/blob/main/datasets/recipes/output/translations_from_zip.jsonl to cover this test case (extract the filename for each line), but apparently there is a difference. I can come up with a better implementation for tomorrow.

mkuchnik added a commit that referenced this issue Nov 6, 2023
Tests that FLORES 200 is loading records correctly by comparing against
a ground truth of 10 records. While the test is not exhaustive, checking
the first 10 records can prevent regressions like #295 in many cases.
marcenacp added a commit that referenced this issue Nov 7, 2023
This allows to have all fields computed at the same place. For example,
this avoids repeating computation to compute separately lines and
lineNumbers. Now, all fields are in the same ReadFields.__call__
function.

Fixes: #295
marcenacp added a commit that referenced this issue Nov 7, 2023
This allows to have all fields computed at the same place. For example,
this avoids repeating computation to compute separately lines and
lineNumbers. Now, all fields are in the same ReadFields.__call__
function.

Fixes: #295
marcenacp added a commit that referenced this issue Nov 7, 2023
This allows to have all fields computed at the same place. For example,
this avoids repeating computation to compute separately lines and
lineNumbers. Now, all fields are in the same ReadFields.__call__
function.

Fixes: #295
marcenacp added a commit that referenced this issue Nov 7, 2023
This allows to have all fields computed at the same place. For example,
this avoids repeating computation to compute separately lines and
lineNumbers. Now, all fields are in the same ReadFields.__call__
function.

Fixes: #295
@marcenacp
Copy link
Contributor

marcenacp commented Nov 7, 2023

@mkuchnik Do you confirm that PR #309 fixes this issue?

translation and language used to be computed separately, so it was hard to reconstruct the matching. Hence the na values. Now, all fields are computed together within the same pd.DataFrame. The operation is called ReadFields. It simplifies the code (1 operation instead of 3 operations) and it should fix the bug.

@mkuchnik
Copy link
Contributor Author

mkuchnik commented Nov 7, 2023

@marcenacp Looks good to me! Seems a bit faster, too.

marcenacp added a commit that referenced this issue Nov 7, 2023
…elds. (#309)

This allows to have all fields computed at the same place. For example,
this avoids repeating computation to compute separately lines and
lineNumbers. Now, all fields are in the same `ReadFields.__call__`
function.

Fixes: #295

Note for reviewers: the main commit is the first commit of the chain.
Then we remove code, rename, comment and test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants