- Project Mentor
- Dr Uthayasanker Thayasivam
- Contributors
- Thilakshi Fonseka
- Rashmini Naranpanawa
- Ravinga Perera
This research is about developing a NMT system using Transformer architecture for the under-resourced, domain-specific English to Sinhala translation task. The translation quality is improved by exploring effective ways of incorporating Part-of-Speech (POS) information and subword techniques.
This project consists of the following.
- Transformer baseline
- Transformer with subword segmentation
- Byte Pair Encoding
- Unigram based subword regularization
- Transformer with Part-of-Speech (POS)
- Input embedding
- Positional encoding
Following are the architecture diagrams for the POS integration with the input embedding and positional encoding respectively.
The following instructions will guide to produce our results.
We use fairseq for training, sentencepiece for preprocessing & sacrebleu to produce BLEU scores.
Transformer Baseline
pip install fairseq sacrebleu
Transformer with subword segmentation
pip install fairseq sacrebleu sentencepiece
Transformer with POS
pip install sacrebleu sentencepiece
Since POS is implemented withing the fairseq-transformer, navigate to the project directory and install fairseq as following
pip install --editable ./
- Navigate to
src/Transformer-baseline
. Follow the instructions given in theREADME.md
.
- To train the Transformer BPE model, navigate to
src/Subword-segmentation/Transformer-BPE
. Follow the instructions given in theREADME.md
. - To train the Transformer subword regularization model, navigate to
src/Subword-segmentation/Transformer-subword-regularization
. Follow the instructions given in theREADME.md
.
- Navigate to
src/POS-implementation
. Follow the instructions given in theREADME.md
.
T. Fonseka, R. Naranpanawa, R. Perera and U. Thayasivam, "English to Sinhala Neural Machine Translation," 2020 International Conference on Asian Language Processing (IALP), Kuala Lumpur, Malaysia, 2020, pp. 305-309, doi: 10.1109/IALP51396.2020.9310462.
R. Naranpanawa, R. Perera, T. Fonseka and U. Thayasivam, "Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation," International Journal of Asian Language Processing (IJALP), vol. 30, no. 04, p. 2050017, 2020, doi: 10.1142/s2717554520500174.
Apache License 2.0
Please read our code of conduct document here.