Releases: thammegowda/mtdata
v0.4.1
- Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
mtdata cache
added. Improves concurrency by supporting multiple recipes- Added WMT general test 2022 and 2023
- Added news commentary 18.1. news crawl 2023
- mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
- mtdata-bcp47 : --script {suppress-default,suppress-all,express}
- Uses
pigz
to read and write gzip files by default when pigz is in PATH. exportUSE_PIGZ=0
to disable
v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime
- Fix: allenai_nllb.json is now included in MANIFEST.in #137. Also fixed CI: Travis -> github actions
- Update ELRC datasets #138. Thanks @AlexUmnov
- Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
- Add Flores200 dev and devtests #145. Thanks @ZenBel
- Add support for
mtdata echo <ID>
- dataset entries only store bibtext keys and not full citation text
- creates index cache as JSONLine file. (WIP towards dataset statistics)
- Simplified index loading
- simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
- all resources are moved to
mtdata/resource
dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )
New and exciting features:
- Support for adding new datasets at runtime (
mtdata*.py
from run dir). Note: you have to reindex by callingmtdata -ri list
- Monolingual datasets support in progress (currently testing)
- Dataset IDs are now
Group-name-version-lang1-lang2
for bitext andGroup-name-version-lang
for monolingual mtdata list
is updated.mtdata list -l eng-deu
for bitext andmtdata list -l eng
for monolingual- Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...
- Dataset IDs are now
skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x
0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats
- CLI arg
--log-level
with default set toWARNING
- progressbar can be disabled from CLI
--no-pbar
; default is enabled--pbar
mtdata stats --quick
does HTTP HEAD and shows content length; e.g.mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu
python -m mtdata.scripts.recipe_stats
to read stats from output directory- Security fix with tar extract | Thanks @TrellixVulnTeam
- Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
- Opus and ELRC datasets update | Thanks @ZenBel
- default for
fail_on_error
is set to true; returns non zero exit code on error. set--no-fail
flag to ignore errors whilemtdata get
command
0.3.7
v0.3.6 : fixes and additions for wmt22
- Fixed KECL-JParaCrawl
- added Paracrawl bonus for ukr-eng
- added Yandex rus-eng corpus
- added Yakut sah-eng
- update recipe for wmt22 constrained eval
disable JW300; add WMT22 recipes; auto generate references.bib
- Parallel download support
-j/--n-jobs
argument (with default4
) - Automatically create references.bib file based on datasets selected
- Add histogram to web search interface (Thanks, @sgowdaks)
- ELRC index updates; (Thanks @kpu)
- Update OPUS index. Use OPUS API to download all datasets
- A lot of new datasets added.
- WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
- Fix: JESC dataset language IDs were wrong
- New datasets:
- jpn-eng: add paracrawl v3, and wmt19 TED
- backtranslation datasets for en2ru ru2en en2ru
- Option to set
MTDATA_RECIPES
dir (default is $PWD). All files matching the glob${MTDATA_RECIPES}/mtdata.recipes*.yml
are loaded - WMT22 recipes added
- JW300 is disabled #77
v0.3.3
- bug fix: xml reading inside tar: Element tree's complain about TarPath
mtdata list
has-g/--groups
and-ng/--not-groups
as include exclude filters on group name | closes #91mtdata list
has-id/--id
flag to print only dataset IDs | closes #91- add WMT21 tests | closes #90
- add ccaligned datasets wmt21 | closes #89
- add ParIce datasets | closes #88
- add wmt21 en-ha | closes #87
- add wmt21 wikititles v3 | closes #86
- Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) | closes #84
- Add support for two URLs for a single dataset (i.e. without zip/tar files)
- Fixed a language match bug #92 / #93
- Fix: language compatibility checks; Closes #94
v0.3.2 - 20211205
- Fix: recipes.yml is missing in the pip installed package
- Add Project Anuvaad: 196 datasets belonging to Indian languages
- add CLI
mtdata get
has--fail / --no-fail
arguments to tell whether to crash or no-crash upon errors
faster tar reading; recipes, stats; multiligual source or target support
mtdata [list|get]-recipe
:: Add support for recipes; list-recipe get-recipe subcommands addedmtdata stats
:: add support for viewing stats of dataset; words, chars, segs- FIX url for UN dev and test sets (source was updated so we updated too)
- Multilingual experiment support; ISO 639-3 code
mul
implies multilingual; e.g. mul-eng or eng-mul --dev
accepts multiple datasets, and merges it (useful for multilingual experiments)- tar files are extracted before read (performance improvements)
- setup.py: version and descriptions accessed via regex
v0.3.0 - BCP47, new dataset-id, dataset compression; JW300 v1c
Big Changes: BCP-47, data compression
-
BCP47: (Language, Script, Region)
- Our implementation is strictly not BCP-47. We differ on the following
- We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g.
en
) and three letters for many. - We use
_
(underscore) to join language, script, region whereas BCP-47 uses-
(hyphen)
- We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g.
- Our implementation is strictly not BCP-47. We differ on the following
-
Dataset IDs (aka
did
in short) are standardized<group>-<name>-<version>-<lang1>-<lang2>
<group>
can have mixed case,<name>
has to be lowercase
-
CLI interface now accept
did
s. -
mtdata get --dev <did>
now accepts a single dataset ID; createsdev.{xxx,yyy}
links at the root of out dir -
mtdata get --test <did1> ... <did3>
createstest{1..4}.{xxx,yyy}
links at the root of out dir -
--compress
option to store compressed datasets under output dir -
zip
andtar
files are no longer extracted. we read directly from compressed files without extracting them -
._lock
files are removed after download job is done -
Add JESC, jpn paracrawl, news commentary 15 and 16
-
Force unicode encoding; make it work on windows (Issue #71)
-
JW300 -> JW300_v1 (tokenized); Added JW300_v1c (raw) (Issue #70)
-
Add all Wikititle datasets from lingual tool (Issue #63)
-
progressbar :
englighten
is used -
wget
is replaced withrequests
. User-Agent header along with mtdata version is sent in HTTP request headers -
Paracrawl v9 added