Skip to content

Commit

Permalink
Merge branch 'master' into exp5_2
Browse files Browse the repository at this point in the history
  • Loading branch information
jordimas committed Aug 25, 2024
2 parents 1742bb5 + dcc2645 commit b257a8f
Show file tree
Hide file tree
Showing 24 changed files with 2,036 additions and 5 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,8 @@ Language pair | SC model BLEU | SC Flores200 BLEU | Google BLEU | Meta NLLB200 B
|Catalan-German | 28.5 |25.4 |32.9 |29.1|15.8| 3142257 | [cat-deu-2022-11-16.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-deu-2022-11-16.zip)
|English-Catalan | 46.9 |43.8 |46.0 |41.7|29.8| 7856208 | [eng-cat-2023-10-30.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/eng-cat-2023-10-30.zip)
|Catalan-English | 47.4 |43.5 |47.0 |48.0|29.6| 7856208 | [cat-eng-2023-10-29.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-eng-2023-10-29.zip)
|Basque-Catalan | 38.8 |24.9 |29.6 |N/A|N/A| 9546180 | [eus-cat-2024-08-09.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/eus-cat-2024-08-09.zip)
|Catalan-Basque | 27.3 |17.1 |18.0 |N/A|N/A| 9546180 | [cat-eus-2024-08-12.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-eus-2024-08-12.zip)
|Basque-Catalan | 38.8 |24.9 |29.6 |25.7|N/A| 9546180 | [eus-cat-2024-08-09.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/eus-cat-2024-08-09.zip)
|Catalan-Basque | 27.3 |17.1 |18.0 |10.5|N/A| 9546180 | [cat-eus-2024-08-12.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-eus-2024-08-12.zip)
|French-Catalan | 41.3 |31.6 |37.3 |33.3|27.2| 2566302 | [fra-cat-2022-11-09.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/fra-cat-2022-11-09.zip)
|Catalan-French | 41.4 |35.4 |41.7 |39.6|27.9| 2566302 | [cat-fra-2022-11-14.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/cat-fra-2022-11-14.zip)
|Galician-Catalan | 74.1 |31.4 |36.5 |33.2|N/A| 2710149 | [glg-cat-2022-11-17.zip](https://www.softcatala.org/pub/softcatala/opennmt/models/2022-11-22/glg-cat-2022-11-17.zip)
Expand Down
6 changes: 4 additions & 2 deletions data-processing-tools/join-single-file.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,9 @@ def _is_sentence_len_good(src, trg):
return True

# How to split test-val sets based on https://en.wikipedia.org/wiki/Per_mille
# Two goals:
# - The split is predictable instead of random
# - The split has a good distribution over the different corpus
def _get_val_test_split_steps(lines, per_mille_val, per_mille_test):
lines_val = round(lines * per_mille_val / 1000)
steps_val = round(lines / lines_val)
Expand Down Expand Up @@ -179,7 +182,7 @@ def split_in_six_files(src_filename, tgt_filename, directory, source_lang, targe
SAMPLE_PER_MILLE_VAL = 1
SAMPLE_PER_MILLE_TEST = 1
steps_val, steps_test = _get_val_test_split_steps(total_lines, SAMPLE_PER_MILLE_VAL, SAMPLE_PER_MILLE_TEST)
clean_src = clean_trg = 0
cnt_steps_val = cnt_steps_test = clean_src = clean_trg = 0
equal = 0
bad_length = 0
dots = 0
Expand Down Expand Up @@ -209,7 +212,6 @@ def split_in_six_files(src_filename, tgt_filename, directory, source_lang, targe

print("total_lines {0}".format(total_lines))

cnt_steps_val = cnt_steps_test = clean_src = clean_trg = 0
while True:

src = read_source.readline()
Expand Down
4 changes: 3 additions & 1 deletion evaluate/meta-bleu.json
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,7 @@
"glg-cat": "33.2",
"cat-glg": "31.7",
"oci-cat": "36.2",
"cat-oci": "27.8"
"cat-oci": "27.8",
"eus-cat": "25.7",
"cat-eus": "10.5"
}
1,012 changes: 1,012 additions & 0 deletions evaluate/meta-nllb-200/flores200-nllb-200-3.3B-cat-eus.eus

Large diffs are not rendered by default.

1,012 changes: 1,012 additions & 0 deletions evaluate/meta-nllb-200/flores200-nllb-200-3.3B-eus-cat.cat

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions evaluate/meta-nllb200-translate.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,9 @@ def main():

"oc-ca" : ["oci", "cat"],
"ca-oc" : ["cat", "oci"],

"eu-ca" : ["eus", "cat"],
"ca-eu" : ["cat", "eus"],
}

blue_scores = {}
Expand Down

0 comments on commit b257a8f

Please sign in to comment.