Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignments are not updated for the PrefixModifier #57

Open
gregtatum opened this issue Mar 1, 2024 · 4 comments
Open

Alignments are not updated for the PrefixModifier #57

gregtatum opened this issue Mar 1, 2024 · 4 comments

Comments

@gregtatum
Copy link
Contributor

The alignments are just passed through. For them to be valid for using with guided alignments, they will also need to use a custom tokenizer.

From the class docs:

Note that if the target is ZH, JA or basically any other language that is not space segmented,
this wouldn't work as we would be prefixing the whole sentence (or very large part of it).
This is not supported for now. A proper fix would be to use custom detokenizer like in the placeholders.

Doing this work with proper tokenization would also support ZH and JA.

@gregtatum
Copy link
Contributor Author

gregtatum commented Mar 1, 2024

Based on conversation in #36 #38 I'd like to re-think through how this works.

@jelmervdl
Copy link
Contributor

I'm assuming you meant #38 :)

I'm still of the opinion that a structure as discussed in #29 makes things easier.

In a way I'm a bit annoyed at Marian that it consumes detokenized text, but alignment pairs based on tokens. Just let me feed you the tokens and the alignment pairs then, or alignment pairs based on byte offsets. Something more robust than the current hack.

@gregtatum
Copy link
Contributor Author

I was dealing with an issue in our training repo with alignments being invalid and Marian crashing by not checking bounds when using guided alignment. mozilla/translations#450 This issue is coming from our use of OpusTrainer. It was even failing when I applied no rules, so something is up.

I'm currently auditing our alignment usage for correctness, and reviewing @eu9ene's PR to add the noise modifiers. I'll probably spend some time looking into this and report back. Feel free to ignore these initial issues while I investigate, but I wasn't aware of the assumption of whitespace tokenized input.

I am a bit curious coming more from the Unicode and ICU world, where we had the word break iterator, which seems a bit more robust than just whitespace tokenization, but I'm unsure of what it looks like integrating the behemoth of the ICU library in a python tool world like this. I'll investigate more and report back.

@jelmervdl
Copy link
Contributor

We used whitespace as most tokenizers gave us tokens separated by whitespace, so because of convention really I suppose.

I was trying to avoid adding tokenizers inside OpusTrainer, I'd rather have them as external tools, but the Tags and Placeholder modifiers need to be aware of what tokenisation was or will be applied in order to tokenize the bits they add. And then all kinds of assumptions started creeping in :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants