Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow training models with guided alignment #11

Closed
XapaJIaMnu opened this issue Mar 24, 2023 · 1 comment · Fixed by #26
Closed

Allow training models with guided alignment #11

XapaJIaMnu opened this issue Mar 24, 2023 · 1 comment · Fixed by #26
Assignees
Labels
enhancement New feature or request

Comments

@XapaJIaMnu
Copy link
Contributor

Currently we do not have support for student model training using guided alignment as per solving #2 . We should implement passing the guided alignments to the trainer, as well as deal with the complicated case of augmenting the training data with tags and guided alignment.

@XapaJIaMnu XapaJIaMnu added the enhancement New feature or request label Mar 24, 2023
@XapaJIaMnu XapaJIaMnu self-assigned this Mar 24, 2023
@jelmervdl
Copy link
Contributor

  1. Marian eats untokenized sentence pairs, and alignment info with (at least?) one alignment pair per sentence-piece token. Marian does spm tokenization itself as well (but assumes those end up with the same tokens as you used for alignment)
  2. Changing the surface form of a sentence causes it to change the spm tokens it translates to. I.e. uppercase or adding stuff in the centre of the sentence will need you to re-tokenize the sentence with spm.
  3. Having word boundaries for some languages (that don't use spaces, like Chinese) is very helpful for things like the tag modifier. It helps it not insert tags in the middle of words.
  4. I think we can assume spm tokens never cross word boundaries? Can we assume this? I.e. Hello_world will not be encoded with a lo_w token. Otherwise you run into issues similar as mentioned here in HTML translation.
  5. Assume that alignments between spm tokens are a direct proxy for alignments between words and there's no semantic meaning in the second token that makes up word A being aligned with the first token that makes up word B. It just means A and B are aligned.

If 4 is true, my approach is to

  1. group each span of tokens into words
  2. generalise the spm token alignment into alignment between these words.
  3. de-spm-tokenize each word
  4. apply modifiers at this level
  5. re-spm-tokenize each word
  6. re-create the token-alignments based on the word alignments from 2.
  7. Feed hungry hungry marian.

@jelmervdl jelmervdl linked a pull request Jul 27, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants