Add alignment prediction with SimAlign #77

daandouwe · 2020-10-25T11:39:45Z

Why?

SimAlign is an amazingly simple and effective way of obtaining word alignments from multilingual Transformer encoders. OpenKiwi is built on top of multilingual Transformers. Hence OpenKiwi can produce alignments.

The training objective of OpenKiwi might even improve the alignments.

The alignments could be used in ingenious ways in the quality predictions. For example:

The predicted BAD target words can be aligned with source tokens to highlight which source word might have caused the mistranslation (similar to the definition of 'source tags' in the WMT QE shared task)
The alignments themselves can be used to detect accuracy errors: if an alignment is missing between a content-word in source and target this might indicate an omission or a mistranslation.

To be investigated.

How?

Two options:

Pip install

We add SimAlign to the dependencies, and import from it. Challenge: we use the encoders in slightly different ways:

OpenKiwi forwards source and target simultaneously; SimAlign forwards the sentences as two separate sentences: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L211
OpenKiwi has the encoder integrated into the model, and not saved to a path (which is expected by SimAlign: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L51), and we don't want to have to save to file separately.

Integrate code

Integrate the SimAlign code into OpenKiwi and adapt as needed. All the decoding algorithms are left unchanged, only the model setup and forward pass need to be changed. The only files that is needed is: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py

Important notes:

We need to verify that the licence allows this (GNU GENERAL PUBLIC LICENSE)
All development and changes to SimAlign need to be ported manually (instead of automatically through new version releases)
We should properly reference SimAlign where we use their code - acknowledgements are important!
OpenKiwi code becomes more complicated

Open questions

What should the output format be? I think for passing alignments, List[Tuple[int, int]], and for saving to file we should opt for 'pharaoh format': i-j k-l etc.
How do we add alignments dynamically to the predicted output? Just another field in the output object?

The text was updated successfully, but these errors were encountered:

daandouwe added the enhancement New feature or request label Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add alignment prediction with SimAlign #77

Add alignment prediction with SimAlign #77

daandouwe commented Oct 25, 2020 •

edited

Loading

Add alignment prediction with SimAlign #77

Add alignment prediction with SimAlign #77

Comments

daandouwe commented Oct 25, 2020 • edited Loading

Why?

How?

Pip install

Integrate code

Open questions

daandouwe commented Oct 25, 2020 •

edited

Loading