OpusDistillery is an end-to-end pipeline to perform systematic multilingual distillation of MT models. It is built on top of the Firefox Translations Training pipeline, originally developed within the Bergamot project, for training efficient NMT models that can run locally in a web browser.
Read our docs.
We have implemented the use of pre-trained OPUS-MT models, the tracking of GPU utilisation and multilinguality support.
-
OPUS-MT models: We have added the option to simply provide the URL of an existing OPUS-MT model. Our tool is also able to select the best available OpusMT model per language pair.
-
GPU Utilisation With the hope of moving towards greener NLP and NMT, we have added GPU utilisation tracking so that we can report the amount of hours and energy consumed by the pipeline.
-
Multilinguality Support: The pipeline supports training multilingual models. This covers two aspects: support for using any combination of multilingual and bilingual teachers, as well as support for multilingual student training.
This branch is based on the main branch and allows for the distilling multilingual students from OPUS-MT models. The different possible distilling scenarios that we envision and that are covered are the following (o2m: one2many, m2o: many2one, m2m: many2many):
ID | Configuration | Teacher | Student | Example config | Tested? |
---|---|---|---|---|---|
1 | bilingual - bilingual | en-et | en-et | Config file | y |
2 | o2m - bilingual | eng-fiu | en-et | Config file | y |
3 | o2m - o2m | eng-fiu | eng-fiu | Config file | y |
4 | m2o - bilingual | fiu-eng | et-en | Config file | y |
5 | m2o - m2o | fiu-eng | fiu-eng | Config file | n |
6 | m2m - bilingual | fiu-gmw | et-en | Config file | y |
7 | m2m - o2m | gmw-fiu | eng-fiu | Config file | y |
8 | m2m - m2o | fiu-gmw | fiu-eng | Config file | n |
9 | m2m - m2m | gmw-fiu | gmw-fiu | Config file | n |
Some things have changed in the configuration file:
- Languages: you can either specify the languages you want to train by
src
andtrg
if the model is bilingual. If the model is multilingual of any kind, you need to specifylangpairs
, you can see how in this example. - Mulilingual configuration: now you need to specify if either the teacher, the backward or the student is a one2many model, so that we can handle language tags appropietly. We created
one2many-teacher
,one2many-backward
andone2many-student
options to hanlde this. You can see how in this example. max-parallel-sents
: this allows you to define the maximum parallel sentences you want to download per language pair in the case of multilingual models.dirname
: usually the directory structure relies on the source and target languages, in case of a multilingual model of any kind, you can specify the name of the directory you want to use. You can see how in this example.
TO DO:
- Download different datasets per language pair, right now it only downloads the same dataset for all language pairs. If a dataset doesn't exist for a given language pair, it creates dummy files.
- Downloading monolingual datasets. The use of monolingual data is not implemented, currently only supports the use of bilingual data.
Not implemented:
- Multiple teachers or backward models: currenlty only multilingual models can be used, not individual models.
- Multilingual Teacher training, at the moment only takes opusmt as teacher
- mono src and trg are not working
- At the moment, if you specify an opus-mt model as a teacher, it will be download for as many language pairs as you have.
We have added support for using OpusFilter, a tool for filtering and combining parallel corpora. For data filtering, instead of the default cleaning or using bicleaner, you can choose to use opusfilter with a default configuration or with a specific configuration you provide.
In the configuration file, if you want to use a default configuration, you can see how in this example. Otherwise, you can specify the path to a specific file with an Opusfilter configuration such as this one.
We have also added support for using OpusTrainer, a tool for curriculum training and data augmentation.
In the configuration file, you can specify a path to the OpusTrainer configuration as in here. However, this assumes that you already now the final paths of the data as specified in here.
At the moment, this is only implement for student training. For future work, we would like to implement it as well for teacher and backward training.
This fork makes it possible to use OPUS-MT models as teacher and backward models in the firefox-translations-training pipeline (FTT). Other additions are profiles for running jobs on CSC supercomputers (puhti, lumi and mahti) and code for monitoring the power usage of jobs.
- Added download rule for Tatoeba-Challenge data.
- Added download rule for OPUS-MT models (tested with Tatoeba-Challenge models, old models might need some changes)
- Added config parameters for specifying OPUS-MT models as teacher and/or backward model.
- Added subword segmentation and desegmentation rules.
The biggest incompatibility with OPUS-MT models and FTT is in subword segmentation: default FTT trains models that use the in-built sentencepiece support in Marian, while OPUS-MT models expect data to be pre-segmented. To make it possible to use both the default FTT training and pre-built OPUS-MT models, segmentation and desegmentation steps have been added around marian-specific rules. This causes some clutter, but it's probably the best solution (instead of e.g. doing the segmentation/desegmentation inside the marian scripts), since it also makes it possible to easily implement other subword segmentation methods in the workflow.
FTT is based on Snakemake, which has many benefits in terms of reproducibility and existing support. Among other things, Snakemake supports HPC environments and SLURM out of the box, which should make it ideal for CSC machines. However, Snakemake also makes heavy use of conda, which has been deprecated on CSC machines due to its unsuitability for HPC file systems (https://docs.csc.fi/computing/usage-policy/#conda-installations), and FTT specifically relies on several conda environments. Fortunately, Snakemake has a functionality for containerizing conda environments, so all the conda environments needed by FTT can be provided in an Apptainer container (Ftt.sif).
Containerization does not entirely solve the conda problem, since the Snakemake program itself requires conda to run. CSC provides a snakemake module, but problematically these modules are container-based, and since containers cannot be nested on CSC machines, it is not possible to use containerized conda environments with the CSC snakemake modules. This can be solved by installing Snakemake with pip (this is discouraged in the Snakemake documentation, but I have seen no problems so far).
FTT uses software that is not included in the containerized conda environments, including several marian installations and other NLP tools. These are automatically built as part of the pipeline. The Ftt.sif container includes the prerequisites for the software components. It's also possible to provide paths to separately built software installations.
- Clone the repository.
- Download the Ftt.sif container to the repository root.
- Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
- The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On puhti and mahti, the python executables in /usr/bin/ should work:
/usr/bin/python3.9 -m venv snakemake_env
. - Activate the virtual environment:
source ./snakemake_env/bin/activate
. - Install snakemake:
pip install snakemake
.
- The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On puhti and mahti, the python executables in /usr/bin/ should work:
- Install micromamba (e.g. in the parent dir of the repository):
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
- Return to the repository directory and update Git submodules:
make git-modules
- Create a data directory (e.g. in the parent dir of the repository) and create a tmp dir in it.
- If the data directory is not located in the parent directory of the repository, edit profiles/slurm-puhti/config.yaml or profiles/slurm-mahti/config.yaml and change the bindings in the singularity-args section to point to your data directory, and also enter the data directory path as the root value of the config section.
- Edit profiles/slurm-puhti/config.cluster.yaml to change the CSC account to one you have access to.
- Load cuda modules: module load gcc/9.4.0 cuda cudnn
- Run pipeline:
make run-hpc PROFILE="slurm-puhti"
ormake run PROFILE="slurm-mahti"
- Clone the repository.
- Download the Ftt.sif container to the repository root.
- Create a virtual Python environment for Snakemake (e.g. in the parent dir of the repository):
- The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On lumi, use the cray-python module (it is not containerized):
module load cray-python; python -m venv snakemake_env
. - Activate the virtual environment:
source ./snakemake_env/bin/activate
. - Install snakemake:
pip install snakemake
.
- The environment needs to be created with a non-containerized python, as otherwise Apptainer integration will not work. On lumi, use the cray-python module (it is not containerized):
- Install micromamba (e.g. in the parent dir of the repository):
curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba
- Return to the repository directory and update Git submodules:
make git-modules
- Create a data directory (e.g. in the parent dir of the repository) and create a tmp dir in it.
- If the data directory is not located in the parent directory of the repository, edit profiles/slurm-lumi/config.yaml and change the bindings in the singularity-args section to point to your data directory, and also enter the data directory path as the root value of the config section.
- Edit profiles/slurm-puhti/config.cluster.yaml to change the CSC account to one you have access to.
- Load rocm module: module load rocm.
- Copy the marian executables to 3rd_party/lumi-marian/build (compiling lumi-marian is currently hacky, so this workaround makes things easier).
- Enter export SINGULARITYENV_LD_LIBRARY_PATH=$LD_LIBRARY_PATH to make sure Marian can find all the libraries when it runs containerized.
- Run pipeline:
make run-hpc PROFILE="slurm-puhti"
Since running the whole pipeline for a high-resource language pair will take a long time, there is a test config available for testing that everything works as it should. The test config is used by default, you can change into the full config by modifying the Makefile and changing config.opusmt-test.yml to config.opusmt.yml. You can also provide the config on the command line as the CONFIG parameter with make. Note that even the test config will take a long time if the training corpus is large (since translating the training data will take time). So to do a quick functionality check, pick a language pair with as little data as possible in Tatoeba-Challenge (while still having trained forward and backward models). The default epo-afr is good for quick checking (although note that bicleaner step will be skipped, as there are no bicleaner packs for those languages).
You can test the pipeline without running it by using make dry-run. If you want to build a specific file or rule, you can use the TARGET parameter with make.
Training pipelines for Firefox Translations machine translation models. The trained models are hosted in firefox-translations-models, compatible with bergamot-translator and can be used by firefox-translations web extension. This work is a part of Bergamot project that focuses on improving client-side machine translation in a web browser.
The pipeline is capable of training a translation model for a language pair end to end. Translation quality depends on chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning.
It uses fast translation engine Marian and Snakemake framework for workflow management and parallelization.
High level overview post on Mozilla Hacks.
Check out Model training guide in the wiki for practical advice how to train models using the pipeline.
- Ubuntu 18.04 (it can work on other Linux distributions, but might require
setup
scripts fixes; see more details in marian installation instructions). - One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory.
- CUDNN installed
- At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better).
- 64 GB RAM (128 GB+ might be required for bigger datasets)
- 200+ GB of disk space ( mostly for datasets and transformations ). It depends on chosen datasets and can be significantly higher.
It was tested on:
- Ubuntu 18.04
- 56 core Xeon server
- 128 GB of RAM
- x8 NVIDIA RTX 2080 GPUs with 12 GB of memory
- CUDA 11.2
- 100 GB of local disk space
- Many terabytes of NFS mounted storage
- Slurm cluster with CPU and Nvidia GPU nodes
- CUDA 11.2 ( it was also tested on 11.5)
- CUDNN library installed
- Singularity module if running with containerization (recommended)
- If running without containerization, there is no procedure to configure the environment automatically.
All the required modules (for example
parallel
) should be preinstalled and loaded in ~/.bashrc
It was tested on Mozilla Slurm cluster using Singularity containers. The pipeline can also be launched on CSD3 HPC but it was not fully tested.
Snakemake workflows can work on Kubernetes, Google Cloud Life Sciences and other cloud platforms. The pipeline was not tested in this mode and might require modification.
Please refer to Cloud execution section of Snakemake documentation.
It is also possible to deploy Slurm cluster in the cloud. For example, using Slurm on Google Cloud Platform.
- Clone the repo:
git clone https://github.com/mozilla/firefox-translations-training.git
cd firefox-translations-training
- Choose a Snakemake profile from
profiles/
or create a new one - Adjust paths in the
Makefile
if needed and setPROFILE
variable to the name of your profile - Adjust Snakemake and workflow settings in the
profiles/<profile>/config.yaml
, see Snakemake CLI reference for details - Configure experiment and datasets in
configs/config.prod.yml
(orconfigs/config.test.yml
for test run) - Change source code if needed for the experiment
- (Cluster mode) Adjust cluster settings in the cluster profile.
For
slurm-moz
:profiles/slurm-moz/config.cluster.yml
You can also modifyprofiles/slurm-moz/submit.sh
or create a new Snakemake profile. - (Cluster mode) It might require further tuning of requested resources in
Snakemake
file:- Use
threads
for a rule to adjust parallelism - Use
resources: mem_mb=<memory>
to adjust total memory requirements per task (default is set inprofile/slurm-moz/config.yaml
)
- Use
See also Snakemake installation
- Install Mamba - fast Conda package manager
make conda
- Install Snakemake
make snakemake
- Update git submodules
make git-modules
- (Optional) Install Singularity if running with containerization
Local mode: See Singularity installation, requries root
Cluster mode:
For example,
module load singularity
but the way to load Singularity depends on cluster installation
- (Optional) Prepare a container image if using Singularity
Either pull the prebuilt image:
make pull
Or build it (requires root):
make build
Dry run first to check that everything was installed correctly:
make dry-run
To run the pipeline:
make run
To test the whole pipeline end to end (it is supposed to run relatively quickly and does not train anything useful):
make test
You can also run a speicific profile or config by overriding variables from Makefile
make run PROFILE=slurm-moz CONFIG=configs/config.test.yml
By default, all Snakemake rules are executed. To run the pipeline up to a specific rule use:
make run TARGET=<non-wildcard-rule-or-path>
For example, collect corpus first:
make run TARGET=merge_corpus
You can also use the full file path, for example:
make run TARGET=/models/ru-en/bicleaner/teacher-base0/model.npz.best-ce-mean-words.npz
If you want to rerun a specific step or steps, you can delete the result files that are expected in the Snakemake rule output.
Snakemake might complain about a missing file and suggest to run it with --clean-metadata
flag. In this case run:
make clean-meta TARGET=<missing-file-name>
and then as usual:
make run
To create a Snakemake html report, run:
make report
See Directory Structure section.
The main directories inside SHARED_ROOT
are:
data/<lang_pair>/<experiment>
- data produced by the pipeline jobslogs/<lang_pair>/<experiment>
- logs of the jobs for troubleshootingexperiments/<lang_pair>/<experiment>
- saved experiment settings for future referencemodels/<lang_pair>/<experiment>
- all models produced by the pipeline. The final compressed models are inexported
folder.
/models/ru-en/test/exported/model.ruen.intgemm.alphas.bin.gz
/models/ru-en/test/exported/lex.50.50.ruen.s2t.bin.gz
/models/ru-en/test/exported/vocab.ruen.spm.gz
The steps are based on train-student recipe.
Step | Description | Bottleneck | Comments |
---|---|---|---|
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour |
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation. |
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to clean_parallel.py. |
Bicleaner | Filters noisy sentence pairs in a parallel corpus using bicleaner or bicleaner-ai depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning). |
Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk | |
Training vocabulary | Trains SentencePiece vocabulary/tokenizer model on parallel corpus. | CPU | |
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a marian example. |
Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others. |
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust early stopping or after-epochs parameters depending on datasets size. |
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust early stopping parameters depending on datasets size. |
Translation by teacher | Translates a corpus and monolingual data combined from configurable dataset.mono-src using the ensemble of teacher models |
GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up by using multiple nodes in cluster mode. |
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets. Very disk intensive. |
Training alignments and shortlist | Trains alignments using fast_align and extracts lexical shortlist using extract_lex tool | CPU, Disk | Some tools require uncompressed datasets on disk and they are huge at this point. Good CPU parallelization. |
Training student | Trains a small transformer student model on filtered data and using alignments. Shuffling in RAM might fail if dataset is huge and there's not enough RAM on the machine, so it's recommended to remove it and use shuffle: batches marian settings (see issue). |
GPU | |
Fine-tuning student | Finetunes the student model by emulating 8bit GEMM during training | GPU | Converges very quickly and then degrades. It's quick but you might want to reduce early stopping threshold. |
Quantizaiton | Applies 8 bit quantization to the fined-tuned student model and runs evaluation on CPU | CPU | CPU threads must be set to 1 for this step. |
Evaluation | Calculates metrics for all models (BLEU, chrf) using SacreBLEU | GPU | Uses datasets.test configuration section. |
Export | Exports trained model and shortlist to (bergamot-translator)(https://github.com/mozilla/bergamot-translator) format |
Dataset importers can be used in datasets
sections of the config.
Example:
train:
- opus_ada83/v1
- mtdata_newstest2014_ruen
Data source | Prefix | Name examples | Type | Comments |
---|---|---|---|---|
MTData | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run mtdata list -l ru-en to see datasets for a specific language pair. |
OPUS | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link. |
SacreBLEU | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in datasets:test config section. Look up supported datasets and language pairs in sacrebleu.dataset python module. |
Flores | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages. |
Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" |
Paracrawl | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer. |
News crawl | news-crawl | news.2019 | mono | Some news monolingual datasets from WMT21 |
Common crawl | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on WMT21 |
Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" |
You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.
conda env create -f envs/corpus.yml
conda activate corpus
python utils/find-corpus.py en ru opus
python utils/find-corpus.py en ru mtdata
python utils/find-corpus.py en ru sacrebleu
Make sure to check licenses of the datasets before using them.
Just add a shell script to corpus or mono which is named as <prefix>.sh
and accepts the same parameters as the other scripts from the same folder.
Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in pipeline/clean/fixes. Naming convention:
<dataset_name>.sh
for parallel dataset cleaning<dataset_name>.<lang>.sh
for language specific cleaning of parallel or monolingual dataset/
in dataset name should be replaced with_
Some parallel datasets require more aggressive filtering.
Dataset specific Bicleaner thresholds can be set in config.
0
means skipping filtering entirely (useful for Paracrawl).
Example:
experiment:
...
bicleaner:
default-threshold: 0.5
dataset-thresholds:
opus_ParaCrawl/v8: 0
mtdata_neulab_tedtalksv1_train: 0.6
To see training graphs run tensorboard:
make install-tensorboard
make tensorboard
Then port forward 6006.
├ data
│ └ ru-en
│ └ test
│ ├ original
│ │ ├ corpus
│ │ │ ├ mtdata_JW300.en.gz
│ │ │ └ mtdata_JW300.ru.gz
│ │ ├ devset
│ │ │ ├ flores_dev.en.gz
│ │ │ └ flores_dev.ru.gz
│ │ ├ eval
│ │ │ ├ sacrebleu_wmt20.en.gz
│ │ │ └ sacrebleu_wmt20.ru.gz
│ │ ├ mono
│ │ │ ├ news-crawl_news.2020.ru.gz
│ │ │ └ news-crawl_news.2020.en.gz
│ │ ├ devset.ru.gz
│ │ └ devset.en.gz
│ ├ clean
│ │ ├ corpus
│ │ │ ├ mtdata_JW300.en.gz
│ │ │ └ mtdata_JW300.ru.gz
│ │ ├ mono
│ │ │ ├ news-crawl_news.2020.ru.gz
│ │ │ └ news-crawl_news.2020.en.gz
│ │ ├ mono.ru.gz
│ │ └ mono.en.gz
│ ├ biclean
│ │ ├ corpus
│ │ │ ├ mtdata_JW300.en.gz
│ │ │ └ mtdata_JW300.ru.gz
│ │ ├ corpus.ru.gz
│ │ ├ corpus.en.gz
│ ├ translated
│ │ ├ mono.ru.gz
│ │ └ mono.en.gz
│ ├ augmented
│ │ ├ corpus.ru.gz
│ │ └ corpus.en.gz
│ ├ alignment
│ │ ├ corpus.aln.gz
│ │ └ lex.s2t.pruned.gz
│ ├ merged
│ │ ├ corpus.ru.gz
│ │ └ corpus.en.gz
│ └ filtered
│ ├ corpus.ru.gz
│ └ corpus.en.gz
├ models
│ └ ru-en
│ └ test
│ ├ backward
│ ├ teacher-base0
│ ├ teacher-base1
│ ├ teacher-finetuned0
│ ├ teacher-finetuned1
│ ├ student
│ ├ student-finetuned
│ ├ speed
│ ├ evaluation
│ │ ├ backward
│ │ ├ teacher-base0
│ │ ├ teacher-base1
│ │ ├ teacher-finetuned0
│ │ ├ teacher-finetuned1
│ │ ├ teacher-ensemble
│ │ ├ student
│ │ ├ student-finetuned
│ │ └ speed
│ └ exported
│
├ experiments
│ └ ru-en
│ └ test
│ └ config.sh
├ logs
│ └ ru-en
│ └ test
│ └ clean_corpus.log
All steps are independent and contain scripts that accept arguments, read input files from disk and output the results to disk. It allows writing the steps in any language (currently it's historically mostly bash and Python) and represent the pipeline as a directed acyclic graph (DAG).
Snakemake workflow manager infers the DAG implicitly from the specified inputs and outputs of the steps. The workflow manager checks which files are missing and runs the corresponding jobs either locally or on a cluster depending on the configuration.
Snakemake parallelizes steps that can be executed simultaneously. It is especially useful for teacher ensemble training and translation.
The main Snakemake process (scheduler) should be launched interactively. It runs job processes on the worker nodes in cluster mode or on a local machine in local mode.
-
Scripts inside the
pipeline
directory are independent and operate only using input arguments, input files and global envs. -
All scripts test expected environment variables early.
-
If a script step fails, it can be safely retried.
-
Ideally, every script should start from the last unfinished step, checking presence of intermediate results of previous steps.
-
A script fails as early as possible.
-
Maximum bash verbosity is set for easy debugging.
-
Input data is always read only.
-
Output data is placed in a new folder for script results.
-
It is expected that the specified output folder might not exist and should be created by the script.
-
A script creates a folder for intermediate files and cleans it in the end unless intermediate files are useful for retries.
-
Global variables are upper case, local variables are lower case.
-
Scripts should utilize resources provided by Snakemake (number of threads, memory).
Here is a list of selected publications on which the training pipeline is based. You can find more relevant publications on Bergamot project web-site.
-
V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez, "Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task", in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers. Brussels, Belgium: Association for Computational Linguistics, October 2018
-
Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón and Sergio Ortiz Rojas "Bifixer and Bicleaner: two open-source tools to clean your parallel data.", in Proceedings of the 22nd Annual Conference of the European Association for Machine Translation. Lisboa, Portugal: European Association for Machine Translation, November 2020
-
Mölder F, Jablonski KP, Letcher B, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33. Published 2021 Jan 18. doi:10.12688/f1000research.29032.2
-
Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task (Bogoychev et al., NGT 2020)
-
From Research to Production and Back: Ludicrously Fast Neural Machine Translation (Kim et al., EMNLP 2019)
-
The University of Edinburgh’s Submissions to the WMT19 News Translation Task (Bawden et al., 2019)
-
Jörg Tiedemann, 2012, Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC'2012)
-
The University of Edinburgh’s Neural MT Systems for WMT17, Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams. In Proceedings of the EMNLP 2017 Second Conference on Machine Translation (WMT17), 2017.
-
Marian: Fast Neural Machine Translation in C++, Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, Andre ́ F. T. Martins, and Alexandra Birch.
-
Improving Neural Machine Translation Models with Monolingual Data, Rico Sennrich,Barry Haddow,Alexandra Birch, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016.
-
A Call for Clarity in Reporting BLEU Scores (Post, 2018)
-
The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation, Facebook
-
Many-to-English Machine Translation Tools, Data, and Pretrained Models (Gowda et al., ACL 2021)
-
Chris Dyer, Victor Chahuneau, and Noah A. Smith. (2013). A Simple, Fast, and Effective Reparameterization of IBM Model 2. In Proc. of NAACL.
-
Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., ACL 2016)
-
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates (Taku Kudo, 2018)