-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignments are not updated for the PrefixModifier #57
Comments
I'm assuming you meant #38 :) I'm still of the opinion that a structure as discussed in #29 makes things easier. In a way I'm a bit annoyed at Marian that it consumes detokenized text, but alignment pairs based on tokens. Just let me feed you the tokens and the alignment pairs then, or alignment pairs based on byte offsets. Something more robust than the current hack. |
I was dealing with an issue in our training repo with alignments being invalid and Marian crashing by not checking bounds when using guided alignment. mozilla/translations#450 This issue is coming from our use of OpusTrainer. It was even failing when I applied no rules, so something is up. I'm currently auditing our alignment usage for correctness, and reviewing @eu9ene's PR to add the noise modifiers. I'll probably spend some time looking into this and report back. Feel free to ignore these initial issues while I investigate, but I wasn't aware of the assumption of whitespace tokenized input. I am a bit curious coming more from the Unicode and ICU world, where we had the word break iterator, which seems a bit more robust than just whitespace tokenization, but I'm unsure of what it looks like integrating the behemoth of the ICU library in a python tool world like this. I'll investigate more and report back. |
We used whitespace as most tokenizers gave us tokens separated by whitespace, so because of convention really I suppose. I was trying to avoid adding tokenizers inside OpusTrainer, I'd rather have them as external tools, but the Tags and Placeholder modifiers need to be aware of what tokenisation was or will be applied in order to tokenize the bits they add. And then all kinds of assumptions started creeping in :( |
The alignments are just passed through. For them to be valid for using with guided alignments, they will also need to use a custom tokenizer.
From the class docs:
Doing this work with proper tokenization would also support ZH and JA.
The text was updated successfully, but these errors were encountered: