Skip to content

Commit

Permalink
Release v0.3.5
Browse files Browse the repository at this point in the history
  • Loading branch information
thammegowda committed Mar 11, 2022
1 parent 9ffba89 commit 5a9c034
Show file tree
Hide file tree
Showing 16 changed files with 133,937 additions and 64 deletions.
9 changes: 6 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,20 @@
# Change Log

## v0.3.5 - WIP
## v0.3.5 - 20220310

- Parallel download support `-j/--n-jobs` argument (with default `4`)
- Add histogram to web search interface (Thanks, @sgowdaks)
- Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets are added.
- WARNING: Some of OPUS IDs are not backward compatible (version number mismatch)
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
- Fix: JESC dataset language IDs were wrong
- New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
- Option to set `MTDATA_RECIPES` dir (default is $PWD). All files matching the glob `${MTDATA_RECIPES}/mtdata.recipes*.yml` are loaded
- Option to set `MTDATA_RECIPES` dir (default is $PWD). All files matching the glob `${MTDATA_RECIPES}/mtdata.recipes*.yml` are loaded
- WMT22 recipes added
- JW300 is disabled [#77](https://github.com/thammegowda/mtdata/issues/77)
- Automatically create references.bib file based on datasets selected

## v0.3.4 - 20220206

Expand Down
54 changes: 27 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,35 +45,35 @@ pip install --editable .
We have added some commonly used datasets - you are welcome to add more!
These are the summary of datasets from various sources (Updated: Feb 2022).


| Source | Dataset Count |
|------------------------:|--------------:|
| OPUS<sup>$1</sup> | 80,830 |
| OPUS_JW300<sup>$2</sup> | 91,248 |
| Neulab | 4,455 |
| Facebook | 1,617 |
| ELRC | 1,341 |
| EU | 1,178 |
| Statmt | 699 |
| Tilde | 519 |
| LinguaTools | 253 |
| Anuvaad | 196 |
| AI4Bharath | 192 |
| ParaCrawl | 126 |
| Lindat | 56 |
| UN<sup>$3</sup> | 30 |
| JoshuaDec | 29 |
| Phontron | 4 |
| NRC_CA | 4 |
| IITB | 3 |
| WAT | 3 |
| StanfordNLP | 3 |
| KECL | 1 |
| *Total* | *182.8K* |
| Source | Dataset Count |
|-------------------------------:|--------------:|
| OPUS<sup>$1</sup> | 80,830 |
| Neulab | 4,455 |
| Facebook | 1,617 |
| ELRC | 1,394 |
| EU | 1,178 |
| Statmt | 750 |
| Tilde | 519 |
| LinguaTools | 253 |
| Anuvaad | 196 |
| AI4Bharath | 192 |
| ParaCrawl | 126 |
| Lindat | 56 |
| UN<sup>$3</sup> | 30 |
| JoshuaDec | 29 |
| StanfordNLP | 15 |
| ParIce | 8 |
| Phontron | 4 |
| NRC_CA | 4 |
| IITB | 3 |
| WAT | 3 |
| KECL | 2 |
| Masakhane | 2 |
| *Total* | *131,301* |


- <sup>$1</sup> - OPUS contains duplicate entries from other listed sources, but they are often older releases of corpus.
- <sup>$2</sup> - ~~JW300 is also retrieved from OPUS, however handled differently due to the difference in the scale and internal format. It has two versions: `v1` (tokenized) and `v1c` (raw)~~ This dataset has been taken down at source
- <sup>$3</sup> - Only test sets are included
- <sup>$2</sup> - Only test sets are included

# CLI Usage
- After pip installation, the CLI can be called using `mtdata` command or `python -m mtdata`
Expand Down
2 changes: 1 addition & 1 deletion docs/asciidoctor.css
2 changes: 1 addition & 1 deletion docs/dids.txt
5 changes: 3 additions & 2 deletions docs/how-to-release.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ Using twine : https://twine.readthedocs.io/en/latest/
Clear `rm -r build dist *.egg-info` if those dir exist.
2. Build :: `$ python setup.py sdist bdist_wheel`
where `sdist` is source code; `bdist_wheel` is universal ie. for all platforms
3. Upload to **testpypi** :: `$ twine upload -r testpypi dist/*`
4. Upload to **pypi** :: `$ twine upload -r pypi dist/*`
3. Make docs: `docs/make-docs.sh`
4. Upload to **testpypi** :: `$ twine upload -r testpypi dist/*`
5. Upload to **pypi** :: `$ twine upload -r pypi dist/*`


### The `.pypirc` file
Expand Down
45 changes: 23 additions & 22 deletions docs/index.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -68,28 +68,29 @@ Here is a summary of datasets from various sources (Updated: Feb 2022).

|===
| Source | Dataset Count
| OPUS<sup>$1</sup> | 80,830
| OPUS_JW300<sup>$2</sup> | 91,248
| Neulab | 4,455
| Facebook | 1,617
| ELRC | 1,341
| EU | 1,178
| Statmt | 699
| Tilde | 519
| LinguaTools | 253
| Anuvaad | 196
| AI4Bharath | 192
| ParaCrawl | 126
| Lindat | 56
| UN<sup>$3</sup> | 30
| JoshuaDec | 29
| Phontron | 4
| NRC_CA | 4
| IITB | 3
| WAT | 3
| StanfordNLP | 3
| KECL | 1
| *Total* | *182.8K*
| OPUS | 120,465
| Neulab | 4,455
| Facebook | 1,617
| ELRC | 1,394
| EU | 1,178
| Statmt | 750
| Tilde | 519
| LinguaTools | 253
| Anuvaad | 196
| AI4Bharath | 192
| ParaCrawl | 126
| Lindat | 56
| UN | 30
| JoshuaDec | 29
| StanfordNLP | 15
| ParIce | 8
| Phontron | 4
| NRC_CA | 4
| IITB | 3
| WAT | 3
| KECL | 2
| Masakhane | 2
| *Total* | *131,301*
|===

- <sup>$1</sup> - OPUS contains duplicate entries from other listed sources, but they are often older releases of corpus.
Expand Down
2 changes: 1 addition & 1 deletion docs/index.html
2 changes: 1 addition & 1 deletion docs/rouge-github.css
7 changes: 2 additions & 5 deletions docs/task-wmt22.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,10 @@ This document helps download datasets for WMT22 General MT task using `mtdata`.
== Setup

NOTE: mtdata v0.3.5 is required which is currently under testing. So install it from develop. We will release it to PyPi once testing is complete.

[source,bash]
----
# pip install mtdata==0.3.5
# Install from develop branch
pip install https://github.com/thammegowda/mtdata/archive/develop.zip
pip install mtdata==0.3.5
# pip install https://github.com/thammegowda/mtdata/archive/develop.zip # Install from develop branch
----

== Get Recipes File
Expand Down
Loading

0 comments on commit 5a9c034

Please sign in to comment.