You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we do not have support for student model training using guided alignment as per solving #2 . We should implement passing the guided alignments to the trainer, as well as deal with the complicated case of augmenting the training data with tags and guided alignment.
The text was updated successfully, but these errors were encountered:
Marian eats untokenized sentence pairs, and alignment info with (at least?) one alignment pair per sentence-piece token. Marian does spm tokenization itself as well (but assumes those end up with the same tokens as you used for alignment)
Changing the surface form of a sentence causes it to change the spm tokens it translates to. I.e. uppercase or adding stuff in the centre of the sentence will need you to re-tokenize the sentence with spm.
Having word boundaries for some languages (that don't use spaces, like Chinese) is very helpful for things like the tag modifier. It helps it not insert tags in the middle of words.
I think we can assume spm tokens never cross word boundaries? Can we assume this? I.e. Hello_world will not be encoded with a lo_w token. Otherwise you run into issues similar as mentioned here in HTML translation.
Assume that alignments between spm tokens are a direct proxy for alignments between words and there's no semantic meaning in the second token that makes up word A being aligned with the first token that makes up word B. It just means A and B are aligned.
If 4 is true, my approach is to
group each span of tokens into words
generalise the spm token alignment into alignment between these words.
de-spm-tokenize each word
apply modifiers at this level
re-spm-tokenize each word
re-create the token-alignments based on the word alignments from 2.
Currently we do not have support for student model training using guided alignment as per solving #2 . We should implement passing the guided alignments to the trainer, as well as deal with the complicated case of augmenting the training data with tags and guided alignment.
The text was updated successfully, but these errors were encountered: