Skip to content

Releases: thammegowda/mtdata


26 Apr 05:08
Choose a tag to compare
  • Better parallelization: parallel and mono data are scheduled at once (previously it was one after the other)
  • mtdata cache added. Improves concurrency by supporting multiple recipes
  • Added WMT general test 2022 and 2023
  • Added news commentary 18.1. news crawl 2023
  • mtdata-bcp47 : -p/--pipe to map codes from stdin -> stdout
  • mtdata-bcp47 : --script {suppress-default,suppress-all,express}
  • Uses pigz to read and write gzip files by default when pigz is in PATH. export USE_PIGZ=0 to disable

v0.4.0 - monolingual data, mtdata echo, add new datasets at runtime

27 Mar 04:09
Choose a tag to compare
  • Fix: allenai_nllb.json is now included in #137. Also fixed CI: Travis -> github actions
  • Update ELRC datasets #138. Thanks @AlexUmnov
  • Add Jparacrawl Chinese-Japanese subset #143. Thanks @BrightXiaoHan
  • Add Flores200 dev and devtests #145. Thanks @ZenBel
  • Add support for mtdata echo <ID>
  • dataset entries only store bibtext keys and not full citation text
    • creates index cache as JSONLine file. (WIP towards dataset statistics)
  • Simplified index loading
  • simplified compression format handlers. Added support for opening .bz2 files without creating temp files.
  • all resources are moved to mtdata/resource dir and any new additions to that dir are automatically included in python package (Fail proof for future issues like #137 )

New and exciting features:

  • Support for adding new datasets at runtime (mtdata*.py from run dir). Note: you have to reindex by calling mtdata -ri list
  • Monolingual datasets support in progress (currently testing)
    • Dataset IDs are now Group-name-version-lang1-lang2 for bitext and Group-name-version-lang for monolingual
    • mtdata list is updated. mtdata list -l eng-deu for bitext and mtdata list -l eng for monolingual
    • Added: Statmt Newscrawl, news-discussions, Leipzig corpus, ...

skipped 0.3.9 because the changes are too significant and wanted to bump from 0.3x -> 0.4x

0.3.8 - log level, progress bar, refresh OPUS and ELRC; stats

25 Nov 03:31
Choose a tag to compare
  • CLI arg --log-level with default set to WARNING
  • progressbar can be disabled from CLI --no-pbar; default is enabled --pbar
  • mtdata stats --quick does HTTP HEAD and shows content length; e.g. mtdata stats --quick Statmt-commoncrawl-wmt19-fra-deu
  • python -m mtdata.scripts.recipe_stats to read stats from output directory
  • Security fix with tar extract | Thanks @TrellixVulnTeam
  • Added NLLB datasets prepared by AllenAI | Thanks @AlexUmnov
  • Opus and ELRC datasets update | Thanks @ZenBel
  • default for fail_on_error is set to true; returns non zero exit code on error. set --no-fail flag to ignore errors while mtdata get command


11 Jul 20:43
Choose a tag to compare

Update ELRC data including EU acts which is used for wmt22 (thanks @kpu)

v0.3.6 : fixes and additions for wmt22

08 Jul 22:37
Choose a tag to compare
  • Fixed KECL-JParaCrawl
  • added Paracrawl bonus for ukr-eng
  • added Yandex rus-eng corpus
  • added Yakut sah-eng
  • update recipe for wmt22 constrained eval

disable JW300; add WMT22 recipes; auto generate references.bib

11 Mar 03:20
Choose a tag to compare
  • Parallel download support -j/--n-jobs argument (with default 4)
  • Automatically create references.bib file based on datasets selected
  • Add histogram to web search interface (Thanks, @sgowdaks)
  • ELRC index updates; (Thanks @kpu)
  • Update OPUS index. Use OPUS API to download all datasets
    • A lot of new datasets added.
    • WARNING: Some OPUS IDs are not backward compatible (version number mismatch)
  • Fix: JESC dataset language IDs were wrong
  • New datasets:
    • jpn-eng: add paracrawl v3, and wmt19 TED
    • backtranslation datasets for en2ru ru2en en2ru
  • Option to set MTDATA_RECIPES dir (default is $PWD). All files matching the glob ${MTDATA_RECIPES}/*.yml are loaded
  • WMT22 recipes added
  • JW300 is disabled #77


28 Jan 06:58
Choose a tag to compare
  • bug fix: xml reading inside tar: Element tree's complain about TarPath
  • mtdata list has -g/--groups and -ng/--not-groups as include exclude filters on group name | closes #91
  • mtdata list has -id/--id flag to print only dataset IDs | closes #91
  • add WMT21 tests | closes #90
  • add ccaligned datasets wmt21 | closes #89
  • add ParIce datasets | closes #88
  • add wmt21 en-ha | closes #87
  • add wmt21 wikititles v3 | closes #86
  • Add train and test sets from StanfordNLP NMT page (large: en-cs, medium: en-de, small: en-vi) | closes #84
    • Add support for two URLs for a single dataset (i.e. without zip/tar files)
  • Fixed a language match bug #92 / #93
  • Fix: language compatibility checks; Closes #94

v0.3.2 - 20211205

06 Dec 17:41
Choose a tag to compare
  • Fix: recipes.yml is missing in the pip installed package
  • Add Project Anuvaad: 196 datasets belonging to Indian languages
  • add CLI mtdata get has --fail / --no-fail arguments to tell whether to crash or no-crash upon errors

faster tar reading; recipes, stats; multiligual source or target support

29 Oct 01:58
Choose a tag to compare
  • mtdata [list|get]-recipe :: Add support for recipes; list-recipe get-recipe subcommands added
  • mtdata stats:: add support for viewing stats of dataset; words, chars, segs
  • FIX url for UN dev and test sets (source was updated so we updated too)
  • Multilingual experiment support; ISO 639-3 code mul implies multilingual; e.g. mul-eng or eng-mul
  • --dev accepts multiple datasets, and merges it (useful for multilingual experiments)
  • tar files are extracted before read (performance improvements)
  • version and descriptions accessed via regex

v0.3.0 - BCP47, new dataset-id, dataset compression; JW300 v1c

21 Oct 22:39
Choose a tag to compare

Big Changes: BCP-47, data compression

  • BCP47: (Language, Script, Region)

    • Our implementation is strictly not BCP-47. We differ on the following
      • We use ISO 639-3 codes (i.e three letters) for all languages, where as BCP47 uses two letters for some (e.g. en) and three letters for many.
      • We use _ (underscore) to join language, script, region whereas BCP-47 uses - (hyphen)
  • Dataset IDs (aka did in short) are standardized <group>-<name>-<version>-<lang1>-<lang2>

    • <group> can have mixed case, <name> has to be lowercase
  • CLI interface now accept dids.

  • mtdata get --dev <did> now accepts a single dataset ID; creates dev.{xxx,yyy} links at the root of out dir

  • mtdata get --test <did1> ... <did3> creates test{1..4}.{xxx,yyy} links at the root of out dir

  • --compress option to store compressed datasets under output dir

  • zip and tar files are no longer extracted. we read directly from compressed files without extracting them

  • ._lock files are removed after download job is done

  • Add JESC, jpn paracrawl, news commentary 15 and 16

  • Force unicode encoding; make it work on windows (Issue #71)

  • JW300 -> JW300_v1 (tokenized); Added JW300_v1c (raw) (Issue #70)

  • Add all Wikititle datasets from lingual tool (Issue #63)

  • progressbar : englighten is used

  • wget is replaced with requests. User-Agent header along with mtdata version is sent in HTTP request headers

  • Paracrawl v9 added