Training pipelines for Firefox Translations machine translation models. The trained models are hosted in firefox-translations-models, compatible with bergamot-translator and can be used by firefox-translations web extension. This work is a part of Bergamot project that focuses on improving client-side machine translation in a web browser.
The pipeline is capable of training a translation model for a language pair end to end. Translation quality depends on chosen datasets, data cleaning procedures and hyperparameters. Some settings, especially low resource languages might require extra tuning.
It uses fast translation engine Marian and Snakemake framework for workflow management and parallelization.
- Ubuntu 18.04 (it can work on other Linux distributions, but might require
setup
scripts fixes; see more details in marian installation instructions). - One or several Nvidia GPUs with CUDA drivers installed and at least 8 GB of memory.
- CUDNN installed
- At least 16 CPU cores ( some steps of the pipeline utilize multiple cores pretty well, so the more the better).
- 64 GB RAM (128 GB+ might be required for bigger datasets)
- 200+ GB of disk space ( mostly for datasets and transformations ). It depends on chosen datasets and can be significantly higher.
It was tested on:
- Ubuntu 18.04
- 56 core Xeon server
- 128 GB of RAM
- x8 NVIDIA RTX 2080 GPUs with 12 GB of memory
- CUDA 11.2
- 100 GB of local disk space
- Many terabytes of sshfs mounted storage
- Slurm cluster with CPU and Nvidia GPU nodes
- CUDA 11.2 ( it was also tested on 11.5)
- CUDNN library installed
- Singularity module if running with containerization (recommended)
- If running without containerization, there is no procedure to configure environment automatically.
All the required modules (for example
parallel
) should be preinstalled and loaded in ~/.bashrc
It was tested on Mozilla Slurm cluster using Singularity containers. The pipeline can also be launched on CSD3 HPC but the main issue is time limits for long-running jobs. Increasing retries can help.
Snakemake workflows can work on Kubernetes, Google Cloud Life Sciences and other cloud platforms. The pipeline was not tested in this mode and might require modification.
Please refer to Cloud execution section of Snakemake documentation.
It is also possible to deploy Slurm cluster in the cloud. Fore example, using Slurm on Google Cloud Platform.
- Clone the repo:
git clone https://github.com/mozilla/firefox-translations-training.git
cd firefox-translations-training
- Adjust environment settings in the
Makefile
- Configure paths to a data storage
SHARED_ROOT
and CUDA librariesCUDA_DIR
- Adjust
NUM_GPUS
- number of GPUs per task that requires GPU andWORKSPACE
- GPU memory pre-allocation for Marian - (Optional) Set
GPUS
to select specific GPUs for local mode - (Optional) Choose a config file to use (default is
configs/config.prod.yml
) - (Cluster mode) Adjust
CLUSTER_CORES
- number of CPU cores on one cluster machine - (Cluster mode) Use an appropriate
SLURM_PROFILE
fromprofiles/
- Configure paths to a data storage
- Configure experiment and datasets in
configs/config.prod.yml
(orconfigs/config.prod.yml
for test run) - Change source code if needed for the experiment
- (Cluster mode) Adjust Snakemake and cluster settings in the cluster profile.
For
slurm-moz
:profiles/slurm-moz/config.yml
andprofiles/slurm-moz/config.cluster.yml
You can also modifyprofiles/slurm-moz/submit.sh
or create a new Snakemake profile. - (Cluster mode) It might require further tuning of requested resources in
Snakemake
file:- Use
threads
for a rule to adjust parallelism - Use
resources: mem_mb=<memory>
to adjust total memory requirements per task (default is set inprofile/slurm-moz/config.yaml
)
- Use
See also Snakemake installation
- Install Mamba - fast Conda package manager
make conda
- Install Snakemake
make snakemake
- Update git submodules
make git-modules
- (Optional) Install Singularity if running with containerization
Local mode: See Singularity installation, requries root
Cluster mode:
For example,
module load singularity
but the way to load Singularity depends on cluster installation
- (Optional) Prepare a container image if using Singularity
Either pull the prebuilt image:
make pull
Or build it (requires root):
make build
Dry run first to check that everything was installed correctly:
make dry-run
make run-local
To test the whole pipeline end to end (it supposed to run quickly and does not train anything useful):
make test
make run-local-container
To run on Slurm
without containerization:
make run-slurm
with containerization (recommended):
make run-slurm-container
By default, all Snakemake rules are executed. To run the pipeline up to a specific rule use:
make <run-command> TARGET=<non-wildcard-rule-or-path>
For example, collect corpus first:
make run-local TARGET=merge_corpus
You can also use full file path, for exampe:
make <run-command> TARGET=/models/ru-en/bicleaner/teacher-base0/model.npz.best-ce-mean-words.npz
If you want to rerun a specific step or steps, you can delete the result files that are expected in Snakemake rule output.
Snakemake might complain on missing file and suggest to run it with --clean-metadata
flag. In this case run:
make clean-meta TARGET=<missing-file-name>
and then as usual:
make <run-command>
To create a Snakemake html report, run:
make report
See Snakefile
file for directory structure documentation.
The main directories inside SHARED_ROOT
are:
data/<lang_pair>/<experiment>
- data produced by the pipeline jobslogs/<lang_pair>/<experiment>
- logs of the jobs for troubleshootingexperiments/<lang_pair>/<experiment>
- saved experiment settings for future referencemodels/<lang_pair>/<experiment>
- all models produced by the pipeline. The final compressed models are inexported
folder.
/models/ru-en/test/exported/model.ruen.intgemm.alphas.bin.gz
/models/ru-en/test/exported/lex.50.50.ruen.s2t.bin.gz
/models/ru-en/test/exported/vocab.ruen.spm.gz
The steps are based on train-student recipe.
Step | Description | Bottleneck | Comments |
---|---|---|---|
Installation | Installing dependencies and compiling | CPU | Takes ~1 hour |
Data downloading | Downloads datasets, samples sentences | Network, Disk | Time depends on dataset size, sampling of huge mono datasets (100M+ sentences) is the most intensive operation. |
Data cleaning | Basic preprocessing, dataset specific, language specific, rule based and other attempts to clean noisy data in parallel and mono datasets | CPU | Good parallelization across CPU cores. To make cleaning of a new language more efficient add it to clean_parallel.py. |
Bicleaner | Filters noisy sentence pairs in a parallel corpus using bicleaner or bicleaner-ai depending on available language packs. | CPU, GPU | If there are no pretrained language packs for bicleaner-ai, it uses bicleaner. If there are no ones for bicleaner either, this step is skipped. Cleaning thresholds are configurable per dataset, see [Dataset cleaning](##Dataset cleaning). |
Merge and dedupe | Merges clean dataset and applies deduplicaiton | CPU, Disk | |
Training s2s | Trains a backward shallow s2s model, which is useful for back-translations and ce-filtering | GPU | Inspired by a marian example. |
Augmentation with back-translations | Translates mono corpus combined from monolingual datasets in target language using shallow s2s model. | GPU | It is more useful for low-resource languages and can be skipped for others. |
Training teacher | Trains an ensemble of big transformer models on augmented dataset | GPU | You might want to adjust early stopping or after-epochs parameters depending on datasets size. |
Fine-tuning teacher | Continue training an ensemble of teachers on parallel data only | GPU | You might want to adjust early stopping parameters depending on datasets size. |
Translation by teacher | Translates a corpus and monolingual data combined from MONO_DATASETS_SRC using the teacher model (ensemble is not supported yet) |
GPU | The slowest part of the pipeline. Can take days. It is possible to speed it up launching the same scripts (corpus, mono) in parallel from another machine with access to the same network directory. |
Cross-entropy filtering | Scores translated corpus with backward s2s model and removes a part of the corpus with the lowest scores to reduce noise | GPU, CPU, Disk | At this point we work with huge datasets, so it utilizes copying to a local disk to make things faster. |
Training alignments and shortlist | Trains alignments using fast_align and extracts lexical shortlist using extract_lex tool | CPU, Disk | Some tools requires uncompressed datasets on disk and they are huge at this point. Data is copied to a local disk to make things faster. Might take 100+GB of local disk depending on a dataset size. Good CPU parallelization. |
Training student | Trains a small transformer student model on filtered data and using alignments | GPU | |
Fine-tuning student | Finetunes the student model by emulating 8bit GEMM during training | GPU | Converges very quickly and then degrades. It's quick but you might want to reduce early stopping threshold. |
Quantizaiton | Applies 8 bit quantization to the fined-tuned student model and evaluates on CPU | CPU | CPU threads must be set to 1 for this step. |
Evaluation | Calculates metrics for all models (BLEU, chrf) using SacreBLEU | GPU | Uses datasets.test configuration section. |
Export | Exports trained model and shortlist to (bergamot-translator)(https://github.com/mozilla/bergamot-translator) format |
Dataset importers can be used in datasets
sections of experiment config.
Example:
train:
- opus_ada83/v1
- mtdata_newstest2014_ruen
Data source | Prefix | Name examples | Type | Comments |
---|---|---|---|---|
MTData | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run mtdata list -l ru-en to see datasets for a specific language pair. |
OPUS | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link. |
SacreBLEU | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in TEST_DATASETS . Look up supported datasets and language pairs in sacrebleu.dataset python module. |
Flores | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages. |
Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" |
Paracrawl | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only mono datasets are used in this importer. Parallel corpus is available using opus importer. |
News crawl | news-crawl | news.2019 | mono | Some news monolingual datasets from WMT21 |
Common crawl | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on WMT21 |
Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz" |
You can also use find-corpus tool to find all datasets for an importer and get them formatted to use in config.
conda env create -f envs/corpus.yml
conda activate corpus
python utils/find-corpus.py en ru opus
Just add a shell script to corpus or mono which is named as <prefix>.sh
and accepts the same parameters as the other scripts from the same folder.
Some datasets require fixes like detokenization. Dataset and language specific fixes are implemented in pipeline/clean/fixes. Naming convention:
<dataset_name>.sh
for parallel dataset cleaning<dataset_name>.<lang>.sh
for language specific cleaning of parallel or monolingual dataset/
in dataset name should be replaced with_
Some parallel datasets require more aggressive filtering.
Dataset specific Bicleaner thresholds can be set in config.
0
means skipping filtering entrirely (useful for Paracrawl).
Example:
experiment:
...
bicleaner:
default-threshold: 0.5
dataset-thresholds:
opus_ParaCrawl/v8: 0
mtdata_neulab_tedtalksv1_train: 0.6
To see training graphs run tensorboard:
make install-tensorboard
make tensorboard
Then port forward 6006.
├ data
│ └ ru-en
│ └ test
│ ├ original
│ │ ├ corpus
│ │ │ ├ mtdata_JW300.en.gz
│ │ │ └ mtdata_JW300.ru.gz
│ │ ├ devset
│ │ │ ├ flores_dev.en.gz
│ │ │ └ flores_dev.ru.gz
│ │ ├ eval
│ │ │ ├ sacrebleu_wmt20.en.gz
│ │ │ └ sacrebleu_wmt20.ru.gz
│ │ ├ mono
│ │ │ ├ news-crawl_news.2020.ru.gz
│ │ │ └ news-crawl_news.2020.en.gz
│ │ ├ devset.ru.gz
│ │ └ devset.en.gz
│ ├ clean
│ │ ├ corpus
│ │ │ ├ mtdata_JW300.en.gz
│ │ │ └ mtdata_JW300.ru.gz
│ │ ├ mono
│ │ │ ├ news-crawl_news.2020.ru.gz
│ │ │ └ news-crawl_news.2020.en.gz
│ │ ├ mono.ru.gz
│ │ └ mono.en.gz
│ ├ biclean
│ │ ├ corpus
│ │ │ ├ mtdata_JW300.en.gz
│ │ │ └ mtdata_JW300.ru.gz
│ │ ├ corpus.ru.gz
│ │ ├ corpus.en.gz
│ ├ translated
│ │ ├ mono.ru.gz
│ │ └ mono.en.gz
│ ├ augmented
│ │ ├ corpus.ru.gz
│ │ └ corpus.en.gz
│ ├ alignment
│ │ ├ corpus.aln.gz
│ │ └ lex.s2t.pruned.gz
│ ├ merged
│ │ ├ corpus.ru.gz
│ │ └ corpus.en.gz
│ └ filtered
│ ├ corpus.ru.gz
│ └ corpus.en.gz
├ models
│ └ ru-en
│ └ test
│ ├ backward
│ ├ teacher-base0
│ ├ teacher-base1
│ ├ teacher-finetuned0
│ ├ teacher-finetuned1
│ ├ student
│ ├ student-finetuned
│ ├ speed
│ ├ evaluation
│ │ ├ backward
│ │ ├ teacher-base0
│ │ ├ teacher-base1
│ │ ├ teacher-finetuned0
│ │ ├ teacher-finetuned1
│ │ ├ teacher-ensemble
│ │ ├ student
│ │ ├ student-finetuned
│ │ └ speed
│ └ exported
│
├ experiments
│ └ ru-en
│ └ test
│ └ config.sh
├ logs
│ └ ru-en
│ └ test
│ └ clean_corpus.log
All steps are independent and contain scripts that accept arguments, read input files from disk and output the results to disk. It allows writing the steps in any language (currently it's historically mostly bash and Python) and represent the pipeline as directed acyclic graph (DAG).
Snakemake workflow manager infers the DAG implicitly from the specified inputs and outputs of the steps. The workflow manager checks which files are missing and runs the corresponding jobs either locally or on a cluster depending on configuration.
Snakemake parallelizes steps that can be executed simultniously. It is especially usefull for teacher ensemble training and translation.
The main snakemkae process (scheduler) should be launched interactively. It runs job processes on the worker nodes in cluster mode or on a local machine in local mode.
-
Scripts inside the
pipeline
directory are independent and operate only using input arguments, input files and global envs. -
All scripts test expected environment variables early.
-
If a script step fails, it can be safely retried.
-
Ideally every script should start from the last unfinished step, checking presence of intermediate results of previous steps.
-
A script fails as early as possible.
-
Maximum bash verbosity is set for easy debugging.
-
Input data is always read only.
-
Output data is placed to a new folder for script results.
-
It is expected that the specified output folder might not exist and should be created by the script.
-
A script creates a folder for intermediate files and cleans it in the end unless intermediate files are useful for retries.
-
Global variables are upper case, local variable are lower case.
-
Scripts should utilize resources provided by Snakemake (number of threads, memory).
-
V. M. Sánchez-Cartagena, M. Bañón, S. Ortiz-Rojas and G. Ramírez-Sánchez, "Prompsit's submission to WMT 2018 Parallel Corpus Filtering shared task", in Proceedings of the Third Conference on Machine Translation, Volume 2: Shared Task Papers. Brussels, Belgium: Association for Computational Linguistics, October 2018
-
Gema Ramírez-Sánchez, Jaume Zaragoza-Bernabeu, Marta Bañón and Sergio Ortiz Rojas "Bifixer and Bicleaner: two open-source tools to clean your parallel data.", in Proceedings of the 22nd Annual Conference of the European Association for Machine Translation. Lisboa, Portugal: European Association for Machine Translation, November 2020
-
Mölder, F., Jablonski, K.P., Letcher, B., Hall, M.B., Tomkins-Tinch, C.H., Sochat, V., Forster, J., Lee, S., Twardziok, S.O., Kanitz, A., Wilm, A., Holtgrewe, M., Rahmann, S., Nahnsen, S., Köster, J., 2021. Sustainable data analysis with Snakemake. F1000Res 10, 33.