Input sequences are encoded into context representations using BERT.
- First Stage: a Transformer-based decoder to generate a draft output sequence.
- Second stage: each word of the draft sequence is masked and feeded and feeded to BERT. Then by combining the input sequence and the draft representation generated by BERT, a Transformer-based decoder is used to predict the refined word for each masked position.