GitHub - sukuya/ckd_nmt_distillation: Layer Combination based Knowledge Distillation Experiments for Ja->En NMT

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
available_models		available_models
config		config
docs		docs
experiments		experiments
onmt		onmt
tools		tools
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
README_CKD_Original.md		README_CKD_Original.md
expmt_preproc.sh		expmt_preproc.sh
github_deploy_key_opennmt_opennmt_py.enc		github_deploy_key_opennmt_opennmt_py.enc
inference_all.sh		inference_all.sh
packageslist		packageslist
poster.jpg		poster.jpg
poster_web.jpg		poster_web.jpg
preprocess.py		preprocess.py
requirements.opt.txt		requirements.opt.txt
server.py		server.py
setup.py		setup.py
train.py		train.py
train_base.sh		train_base.sh
train_student.sh		train_student.sh
translate.py		translate.py
visual_abstract.jpg		visual_abstract.jpg
visual_abstract2.jpg		visual_abstract2.jpg

Repository files navigation

STEPS 17th Project

Following exploratory work was presented at CS6101 Module Projects Display at 17th STEP

Combining Intermediate Layers for Knowledge Distillation in Neural Machine Translation Models for Japanese -> English

This project investigates the newly introduced technique to combine intermediate layers rather than skipping while performing knowledge distillation of NMT Models. The language pair investigated is Japanese->English using the recently published work by Yimeng Wu et. al. for Portuguese->English, Turkish->English, and English->German. They were able to distill similar performance with a 50% reduction in parameters. Their results and paper can be referred at the following link: Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers We use JParacrawl for our investigation and the source code from Yimeng's work.

Our Results

Following are the results for English --> Japanese based on a training corpus of 2.6 million sentences from JParacrawl.

MODELS		BLEU SCORES
Teacher		23.1
Regular KD		20.3
PKD		19.3
Regular COMB		19.7
Overlap COMB		19.7
Skip COMB		19.4
Cross COMB		19.6

Discussions

Based on our experiments, we don’t notice any improvement over regular knowledge distillation RKD in case of any combination-based distillation variant as shown in figure above, though there is minor improvement over Patient KD which skips some of the layers in all COMB approaches. Possible reasons for our observation: Extensive hyperparameter optimization was not done, which could be one reason for the obtained performance. So, more experiments to be done to make any conclusions. Human Evaluation is not done, and BLEU can’t be relied solely for evaluating models.

Requirements

Check README_CKD_Original.md

Acknowledgement

This repo is exploration based on original source at CKD_PyTorch which is the original implementation of the paper Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers, Yimeng Wu, Peyman Passban, Mehdi Rezagholizadeh, Qun Liu at Proceedings of EMNLP, 2020.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STEPS 17th Project

Combining Intermediate Layers for Knowledge Distillation in Neural Machine Translation Models for Japanese -> English

Our Results

Discussions

Requirements

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

sukuya/ckd_nmt_distillation

Folders and files

Latest commit

History

Repository files navigation

STEPS 17th Project

Combining Intermediate Layers for Knowledge Distillation in Neural Machine Translation Models for Japanese -> English

Our Results

Discussions

Requirements

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages