Alignments are not updated for the PrefixModifier #57

gregtatum · 2024-03-01T17:25:36Z

The alignments are just passed through. For them to be valid for using with guided alignments, they will also need to use a custom tokenizer.

From the class docs:

Note that if the target is ZH, JA or basically any other language that is not space segmented,
this wouldn't work as we would be prefixing the whole sentence (or very large part of it).
This is not supported for now. A proper fix would be to use custom detokenizer like in the placeholders.

Doing this work with proper tokenization would also support ZH and JA.

gregtatum · 2024-03-01T21:08:47Z

Based on conversation in ~~#36~~ #38 I'd like to re-think through how this works.

jelmervdl · 2024-03-04T12:39:08Z

I'm assuming you meant #38 :)

I'm still of the opinion that a structure as discussed in #29 makes things easier.

In a way I'm a bit annoyed at Marian that it consumes detokenized text, but alignment pairs based on tokens. Just let me feed you the tokens and the alignment pairs then, or alignment pairs based on byte offsets. Something more robust than the current hack.

gregtatum · 2024-03-04T23:25:54Z

I was dealing with an issue in our training repo with alignments being invalid and Marian crashing by not checking bounds when using guided alignment. mozilla/translations#450 This issue is coming from our use of OpusTrainer. It was even failing when I applied no rules, so something is up.

I'm currently auditing our alignment usage for correctness, and reviewing @eu9ene's PR to add the noise modifiers. I'll probably spend some time looking into this and report back. Feel free to ignore these initial issues while I investigate, but I wasn't aware of the assumption of whitespace tokenized input.

I am a bit curious coming more from the Unicode and ICU world, where we had the word break iterator, which seems a bit more robust than just whitespace tokenization, but I'm unsure of what it looks like integrating the behemoth of the ICU library in a python tool world like this. I'll investigate more and report back.

jelmervdl · 2024-03-05T21:19:49Z

We used whitespace as most tokenizers gave us tokens separated by whitespace, so because of convention really I suppose.

I was trying to avoid adding tokenizers inside OpusTrainer, I'd rather have them as external tools, but the Tags and Placeholder modifiers need to be aware of what tokenisation was or will be applied in order to tokenize the bits they add. And then all kinds of assumptions started creeping in :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignments are not updated for the PrefixModifier #57

Alignments are not updated for the PrefixModifier #57

gregtatum commented Mar 1, 2024

gregtatum commented Mar 1, 2024 •

edited

Loading

jelmervdl commented Mar 4, 2024

gregtatum commented Mar 4, 2024

jelmervdl commented Mar 5, 2024

Alignments are not updated for the PrefixModifier #57

Alignments are not updated for the PrefixModifier #57

Comments

gregtatum commented Mar 1, 2024

gregtatum commented Mar 1, 2024 • edited Loading

jelmervdl commented Mar 4, 2024

gregtatum commented Mar 4, 2024

jelmervdl commented Mar 5, 2024

gregtatum commented Mar 1, 2024 •

edited

Loading