Skip to content

AmuNMT for Automatic Post Editing

Marcin Junczys-Dowmunt edited this page Aug 30, 2016 · 25 revisions

The Winning System of the WMT 2016 APE Shared Task

This page provides data and model files for our shared task winning APE system described in Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing. If you use any of the data, systems or ideas, please cite:

@InProceedings{junczysdowmunt-grundkiewicz:2016:WMT,
   author    = {Junczys-Dowmunt, Marcin  and  Grundkiewicz, Roman},
   title     = {Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing},
   booktitle = {Proceedings of the First Conference on Machine Translation},
   month     = {August},
   year      = {2016},
   address   = {Berlin, Germany},
   publisher = {Association for Computational Linguistics},
   pages     = {751--758},
   url       = {http://www.aclweb.org/anthology/W16-2378}
}

Artificially created data

Download the training data (514M)

This file contains the artificially generated post-editing triplets described in Table 1 of the paper. "4M" is the larger set denoted as "round-trip.n10" in that table, 500K is the smaller set denoted as "round-trip.n1". The 20 times oversampled original training data for the shared task is not included, but can be obtained from the original shared task page.

data
├── 4M
│   ├── 4M.mt
│   ├── 4M.pe
│   └── 4M.src
└── 500K
    ├── 500K.mt
    ├── 500K.pe
    └── 500K.src

Models and config files

Download the systems (2.7G)

We also provide the complete primary system and two contrastive variants. To create the submitted output, locate the Makefile and provide the path to the main directory of your working AmuNMT tool (latest master, see Readme) in the following line:

AMUNMT=/home/marcinj/Badania/amunmt 

Next type make. The included files should provide all input files, model files and scripts to produce our exact submission. In the end you should see the three submission files:

AMU_ensemble8-mt+src_PRIMARY
AMU_ensemble4-mt_CONTRASTIVE
AMU_ensemble4-src_CONTRASTIVE

In the future we will provide more hints on how to train a similar system. Currently we supply the following files:

system
├── data
│   ├── de.bpe
│   ├── en.bpe
│   ├── true.de
│   └── true.en
├── Makefile
├── models
│   ├── configs
│   │   ├── mt-pe.ensemble4.tuned.yml
│   │   ├── mtsrc-pe.ensemble.ape.tuned.yml
│   │   └── src-pe.ensemble4.yml
│   ├── mt-pe
│   │   ├── model.iter260000.npz
│   │   ├── model.iter270000.npz
│   │   ├── model.iter280000.npz
│   │   ├── model.iter290000.npz
│   │   ├── vocab.mt.json
│   │   └── vocab.pe.json
│   └── src-pe
│       ├── model.iter340000.npz
│       ├── model.iter350000.npz
│       ├── model.iter360000.npz
│       ├── model.iter370000.npz
│       ├── vocab.pe.json
│       └── vocab.src.json
├── scripts
│   ├── apply_bpe.py
│   ├── deescape-special-chars.perl
│   ├── detruecase.perl
│   ├── escape-special-chars.perl
│   ├── prepare_submission.py
│   └── truecase.perl
└── test
    ├── test.mt
    └── test.src

where data contains truecasing models and BPE codes. models/configs provides the configuration files for amun to load the model ensembles located in mt-pe (monolingual model, trained on MT-output and post-editing data) and src-pe (bilingual model, trained on source and post-editing data).

Clone this wiki locally