Skip to content

Layer Combination based Knowledge Distillation Experiments for Ja->En NMT

License

Notifications You must be signed in to change notification settings

sukuya/ckd_nmt_distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

STEPS 17th Project

Following exploratory work was presented at CS6101 Module Projects Display at 17th STEP

Combining Intermediate Layers for Knowledge Distillation in Neural Machine Translation Models for Japanese -> English

This project investigates the newly introduced technique to combine intermediate layers rather than skipping while performing knowledge distillation of NMT Models. The language pair investigated is Japanese->English using the recently published work by Yimeng Wu et. al. for Portuguese->English, Turkish->English, and English->German. They were able to distill similar performance with a 50% reduction in parameters. Their results and paper can be referred at the following link: Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers We use JParacrawl for our investigation and the source code from Yimeng's work.

Our Results

Following are the results for English --> Japanese based on a training corpus of 2.6 million sentences from JParacrawl.

MODELS BLEU SCORES
Teacher 23.1
Regular KD 20.3
PKD 19.3
Regular COMB 19.7
Overlap COMB 19.7
Skip COMB 19.4
Cross COMB 19.6

Discussions

Based on our experiments, we don’t notice any improvement over regular knowledge distillation RKD in case of any combination-based distillation variant as shown in figure above, though there is minor improvement over Patient KD which skips some of the layers in all COMB approaches. Possible reasons for our observation: Extensive hyperparameter optimization was not done, which could be one reason for the obtained performance. So, more experiments to be done to make any conclusions. Human Evaluation is not done, and BLEU can’t be relied solely for evaluating models.

Requirements

Check README_CKD_Original.md

Acknowledgement

This repo is exploration based on original source at CKD_PyTorch which is the original implementation of the paper Why Skip If You Can Combine: A Simple Knowledge Distillation Technique for Intermediate Layers, Yimeng Wu, Peyman Passban, Mehdi Rezagholizadeh, Qun Liu at Proceedings of EMNLP, 2020.

About

Layer Combination based Knowledge Distillation Experiments for Ja->En NMT

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published