Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alignment prediction with SimAlign #77

Open
daandouwe opened this issue Oct 25, 2020 · 0 comments
Open

Add alignment prediction with SimAlign #77

daandouwe opened this issue Oct 25, 2020 · 0 comments
Labels
enhancement New feature or request

Comments

@daandouwe
Copy link
Collaborator

daandouwe commented Oct 25, 2020

Why?

SimAlign is an amazingly simple and effective way of obtaining word alignments from multilingual Transformer encoders. OpenKiwi is built on top of multilingual Transformers. Hence OpenKiwi can produce alignments.

The training objective of OpenKiwi might even improve the alignments.

The alignments could be used in ingenious ways in the quality predictions. For example:

  • The predicted BAD target words can be aligned with source tokens to highlight which source word might have caused the mistranslation (similar to the definition of 'source tags' in the WMT QE shared task)
  • The alignments themselves can be used to detect accuracy errors: if an alignment is missing between a content-word in source and target this might indicate an omission or a mistranslation.

To be investigated.

How?

Two options:

Pip install

We add SimAlign to the dependencies, and import from it. Challenge: we use the encoders in slightly different ways:

  1. OpenKiwi forwards source and target simultaneously; SimAlign forwards the sentences as two separate sentences: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L211
  2. OpenKiwi has the encoder integrated into the model, and not saved to a path (which is expected by SimAlign: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py#L51), and we don't want to have to save to file separately.

Integrate code

Integrate the SimAlign code into OpenKiwi and adapt as needed. All the decoding algorithms are left unchanged, only the model setup and forward pass need to be changed. The only files that is needed is: https://github.com/cisnlp/simalign/blob/master/simalign/simalign.py

Important notes:

  1. We need to verify that the licence allows this (GNU GENERAL PUBLIC LICENSE)
  2. All development and changes to SimAlign need to be ported manually (instead of automatically through new version releases)
  3. We should properly reference SimAlign where we use their code - acknowledgements are important!
  4. OpenKiwi code becomes more complicated

Open questions

  1. What should the output format be? I think for passing alignments, List[Tuple[int, int]], and for saving to file we should opt for 'pharaoh format': i-j k-l etc.
  2. How do we add alignments dynamically to the predicted output? Just another field in the output object?
@daandouwe daandouwe added the enhancement New feature or request label Oct 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant