Allow training models with guided alignment #11

XapaJIaMnu · 2023-03-24T13:00:50Z

Currently we do not have support for student model training using guided alignment as per solving #2 . We should implement passing the guided alignments to the trainer, as well as deal with the complicated case of augmenting the training data with tags and guided alignment.

jelmervdl · 2023-04-06T16:43:15Z

Marian eats untokenized sentence pairs, and alignment info with (at least?) one alignment pair per sentence-piece token. Marian does spm tokenization itself as well (but assumes those end up with the same tokens as you used for alignment)
Changing the surface form of a sentence causes it to change the spm tokens it translates to. I.e. uppercase or adding stuff in the centre of the sentence will need you to re-tokenize the sentence with spm.
Having word boundaries for some languages (that don't use spaces, like Chinese) is very helpful for things like the tag modifier. It helps it not insert tags in the middle of words.
I think we can assume spm tokens never cross word boundaries? Can we assume this? I.e. Hello_world will not be encoded with a lo_w token. Otherwise you run into issues similar as mentioned here in HTML translation.
Assume that alignments between spm tokens are a direct proxy for alignments between words and there's no semantic meaning in the second token that makes up word A being aligned with the first token that makes up word B. It just means A and B are aligned.

If 4 is true, my approach is to

group each span of tokens into words
generalise the spm token alignment into alignment between these words.
de-spm-tokenize each word
apply modifiers at this level
re-spm-tokenize each word
re-create the token-alignments based on the word alignments from 2.
Feed hungry hungry marian.

XapaJIaMnu added the enhancement New feature or request label Mar 24, 2023

XapaJIaMnu self-assigned this Mar 24, 2023

jelmervdl linked a pull request Jul 27, 2023 that will close this issue

Alignment passthrough #26

Merged

jelmervdl closed this as completed in #26 Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow training models with guided alignment #11

Allow training models with guided alignment #11

XapaJIaMnu commented Mar 24, 2023

jelmervdl commented Apr 6, 2023

Allow training models with guided alignment #11

Allow training models with guided alignment #11

Comments

XapaJIaMnu commented Mar 24, 2023

jelmervdl commented Apr 6, 2023