Group semantics regression on FLORES 200 #295

mkuchnik · 2023-11-03T16:47:14Z

FLORES 200 was working on 0c66893 but it is different on main (487ec5c).

Repro from recipes directory:

import mlcroissant as mlc
dataset = mlc.Dataset(file="../../../datasets/flores-200/metadata.json", debug=False)
print(dataset)
records = dataset.records(record_set="language_translations_train_data_with_metadata")
print(sum(1 for r in records))

Before the test set was added (causing the record_set to be renamed), the call would be:

import mlcroissant as mlc
dataset = mlc.Dataset(file="../../../datasets/flores-200/metadata.json", debug=False)
print(dataset)
records = dataset.records(record_set="language_translations_with_metadata")
print(sum(1 for r in records))

Note: this can block for a long time in the correct case, so it may be better to print iteratively to see if count goes over 997.

The length should be 204 * 997, but it is only 997. This is because the data from each of the 204 .dev files have been collapsed at some point. Instead of each of the files yielding 997 rows, the effect is 997 samples and some of the columns are None.

The text was updated successfully, but these errors were encountered:

marcenacp · 2023-11-06T15:59:17Z

@mkuchnik Thanks for opening the bug. I expected https://github.com/mlcommons/croissant/blob/main/datasets/recipes/output/translations_from_zip.jsonl to cover this test case (extract the filename for each line), but apparently there is a difference. I can come up with a better implementation for tomorrow.

Tests that FLORES 200 is loading records correctly by comparing against a ground truth of 10 records. While the test is not exhaustive, checking the first 10 records can prevent regressions like #295 in many cases.

This allows to have all fields computed at the same place. For example, this avoids repeating computation to compute separately lines and lineNumbers. Now, all fields are in the same ReadFields.__call__ function. Fixes: #295

marcenacp · 2023-11-07T10:19:46Z

@mkuchnik Do you confirm that PR #309 fixes this issue?

translation and language used to be computed separately, so it was hard to reconstruct the matching. Hence the na values. Now, all fields are computed together within the same pd.DataFrame. The operation is called ReadFields. It simplifies the code (1 operation instead of 3 operations) and it should fix the bug.

mkuchnik · 2023-11-07T15:30:29Z

@marcenacp Looks good to me! Seems a bit faster, too.

…elds. (#309) This allows to have all fields computed at the same place. For example, this avoids repeating computation to compute separately lines and lineNumbers. Now, all fields are in the same `ReadFields.__call__` function. Fixes: #295 Note for reviewers: the main commit is the first commit of the chain. Then we remove code, rename, comment and test.

mkuchnik added the bug Something isn't working label Nov 3, 2023

mkuchnik mentioned this issue Nov 3, 2023

Add FLORES 200 non-hermetic test #296

Merged

marcenacp mentioned this issue Nov 7, 2023

[Issue # 295] Replace GroupRecordSetStart/GroupRecordSetEnd by ReadFields. #309

Merged

marcenacp closed this as completed in #309 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group semantics regression on FLORES 200 #295

Group semantics regression on FLORES 200 #295

mkuchnik commented Nov 3, 2023 •

edited

Loading

marcenacp commented Nov 6, 2023

marcenacp commented Nov 7, 2023 •

edited

Loading

mkuchnik commented Nov 7, 2023

Group semantics regression on FLORES 200 #295

Group semantics regression on FLORES 200 #295

Comments

mkuchnik commented Nov 3, 2023 • edited Loading

marcenacp commented Nov 6, 2023

marcenacp commented Nov 7, 2023 • edited Loading

mkuchnik commented Nov 7, 2023

mkuchnik commented Nov 3, 2023 •

edited

Loading

marcenacp commented Nov 7, 2023 •

edited

Loading