Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment passthrough #26

Merged
merged 38 commits into from
Aug 14, 2023
Merged

Alignment passthrough #26

merged 38 commits into from
Aug 14, 2023

Conversation

jelmervdl
Copy link
Contributor

@jelmervdl jelmervdl commented Jul 7, 2023

This should add the missing functionality to train a student with alignment without having to use hacks.

  • Add alignment support to the Placeholder modifier, and
  • implement "retokenization" using a sentencepiece vocab so we
    • use word alignments (as opposed to spm token alignments) in our modifiers
    • but can pass the alignment info to marian in a useful manner

Right now I'm implementing this as an extension of the Placeholder filter for compatibility reasons. But it can be used as its own stand-alone filter, Retokenize.

The tokenizer parameter of that thing is a bit of a lie, as it always gives you the detokenized output. But it will use the tokenizer parameter to tokenize that detokenized output so it can compute the correct alignment between the new tokens and their indices.

XapaJIaMnu and others added 17 commits April 3, 2023 14:08
Also fixes the order. Does change the tests though.
# Conflicts:
#	contrib/test-data/clean.enzh.ref.06.4.none
#	contrib/test-data/clean.enzh.ref.06.4.trg
#	contrib/test-data/clean.zhen.ref.06.4.04.04.src
#	contrib/test-data/clean.zhen.ref.06.4.src
#	contrib/test_enzh_config.yml
#	contrib/test_enzh_noise_config.yml
#	src/opustrainer/modifiers/placeholders.py
#	tests/test_placeholders.py
Not end-to-end tests yet
Should be mostly compatible, except that the end-to-end tests are failing? Did I change the order of random calls?
Comment on lines +20 to +45
out = tokenizer('\t'.join([
'This is a simple test statement 🤣 .',
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7
'这 是 一个 简单 的 测试 语 句 🤣 。',
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8 ^9
'0-0 1-1 2-2 3-3 3-4 4-5 5-6 5-7 6-8 7-9',
]))
self.assertEqual(out, '\t'.join([
'This is a simple test statement 🤣.',
#[This][ is][ a][ simple][ test][ statement][ ][] [] [] [🤣][.]
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8 ^9 ^10 ^11
'这是一个简单的测试语句 🤣 。',
#[这][是][一][个][简][单][的][测][试][语][句] [ ] [] [] [] [🤣][ 。]
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8 ^9 ^10 ^11 ^12 ^13 ^14 ^15 ^16
'0-0 1-1 2-2 2-3 3-4 3-5 3-6 4-7 4-8 5-9 5-10 10-15 11-16',
# 0-0 [This] [这] 0-0
# 1-1 [is] [是] 1-1
# 2-2 [a] [一个] 2-2 2-3
# 3-3 [simple] [简单] 3-4 3-5
# 3-4 [simple] [的] 3-6
# 4-5 [test] [测试] 4-7 4-8
# 5-6 [statement] [语] 5-9
# 5-7 [statement] [句] 5-10 (6-11)
# 6-8 [🤣] [🤣] (7-12 8-13 9-14) 10-15
# 7-9 [.] [。] 11-16
]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the best example of what this code does: Input is Moses tokens for both sides of the sentence pair.

  1. Both are detokenized using their own language-specific Moses detokenizers.
  2. That detokenized sentence is then returned.
  3. However, it also uses sentencepiece with the specified vocab to retokenize that detokenized sentence.
  4. It then computes how the original moses tokens and the new sentencepiece tokens overlap, and adjusts the alignment pairs.

In the assertEqual comments above, the [..] bits are SPM tokens. The table at the bottom is a mapping between old alignment pairs and their tokens to new alignment pairs that make up the line (34) above it.

Interesting finding while doing this: the byte-fallback necessary to encode the emoji cannot be expressed with a surface form or a slice since SPM operates on unicode in its Python interface. So tokens 7, 8 and 9 are zero-length. I chose to not add alignment info for those, hoping that only the final token of that byte-fallback sequence will be sufficient for the model to learn that the emoji faces align. Because I'm lazy. But in theory it should be feasible to detect this and add alignment pairs for all the parts of the byte-fallback sequence.

For other 1->N mappings (e.g. moses [一个] maps to spm [一][个]) I chose to generate alignment pairs for all combinations.

@jelmervdl
Copy link
Contributor Author

jelmervdl commented Jul 27, 2023

(I'd really, really like to replace the hidden end-to-end tests in test_placeholders.py with proper single test case unit tests that can be reasoned about and aren't opaque and hidden in external files. We already have test_endtoend.py for those.)

TODO: Regenerate the end-to-end test files.

@jelmervdl jelmervdl linked an issue Jul 27, 2023 that may be closed by this pull request
@jelmervdl jelmervdl marked this pull request as ready for review July 27, 2023 16:07
Copy link
Contributor

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good. I would like to have example of the new modifiers in the end-to-end tests and the documentation.

One thing that I forgot to tell you is that i tested decoding with pre-spm'd text and it works, because we have one-to-one token match. This means that we should be able to apply spm modifications here and not inside marian during training and then we can solve the problem of the sentences that are getting marked as tainted due to modifications.

or pair.trg < 0 or pair.trg >= len(trg_tokens)
]
if invalid_pairs:
raise ValueError('Out-of-bound alignment pairs: ' + ' '.join(map(repr, invalid_pairs)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to Raise during training, let's have all of those be a warn_once. I had a few training runs fail because of a one off error in the alignments will kill a whole training run. We should have a fail condition for alignments are no accurate. (And for some reason, sometimes, fast align did produce invalid alignments...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I agree. What should we do with lines that have out-of-bound pairs? I think we should just throw out all alignment pairs in that case for that sentence because there's no guarantee that the rest of them are not off-by-one or something.

That would automatically also exclude them from the Tags modifier, and their alignment info wouldn't be passed on to Marian, so Marian would not be trained with broken data (or worse, bork because of an index error).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would throw out the lines, if we can do it cleanly. If alignment wasn't produced, this likely means the sentence is bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, euhm. I can do that, but it has to happen at the modifier-is-applied-to-batch level in trainer.py because that's the only place we can remove lines at the moment. I'll change that.

There's a discussion somewhere with Graeme about changing modifiers to return a list of lines, which then can just add and remove lines. But that's probably too much also add into this pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe then merge as is and start another PR aiming at sanitizing input.

src/opustrainer/modifiers/placeholders.py Show resolved Hide resolved
src/opustrainer/modifiers/placeholders.py Outdated Show resolved Hide resolved
src/opustrainer/modifiers/placeholders.py Show resolved Hide resolved
src/opustrainer/modifiers/placeholders.py Outdated Show resolved Hide resolved
src/opustrainer/modifiers/retokenize.py Show resolved Hide resolved
src/opustrainer/modifiers/placeholders.py Outdated Show resolved Hide resolved
src/opustrainer/modifiers/retokenize.py Show resolved Hide resolved
tests/test_retokenizer.py Show resolved Hide resolved
src/opustrainer/tokenizers.py Show resolved Hide resolved
Copy link
Contributor

@graemenail graemenail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; just some small comments beyond @XapaJIaMnu's review

src/opustrainer/tokenizers.py Outdated Show resolved Hide resolved
src/opustrainer/modifiers/placeholders.py Outdated Show resolved Hide resolved
tests/test_retokenizer.py Show resolved Hide resolved
@jelmervdl
Copy link
Contributor Author

One thing that I forgot to tell you is that i tested decoding with pre-spm'd text and it works, because we have one-to-one token match. This means that we should be able to apply spm modifications here and not inside marian during training and then we can solve the problem of the sentences that are getting marked as tainted due to modifications.

Oh cool! How would you pass this to Marian? Do you just give it the spm token ids as numbers?

I initially had much bigger plans for this change, but moved them over to #29 and tried to keep this change backwards compatible as much as possible. Having an option to actually produce spm token ids and specify things like spm sampling parameters would fit easily in that structure.

@XapaJIaMnu
Copy link
Contributor

One thing that I forgot to tell you is that i tested decoding with pre-spm'd text and it works, because we have one-to-one token match. This means that we should be able to apply spm modifications here and not inside marian during training and then we can solve the problem of the sentences that are getting marked as tainted due to modifications.

Oh cool! How would you pass this to Marian? Do you just give it the spm token ids as numbers?

I initially had much bigger plans for this change, but moved them over to #29 and tried to keep this change backwards compatible as much as possible. Having an option to actually produce spm token ids and specify things like spm sampling parameters would fit easily in that structure.

I was overeager here, hmm. It works for simple sentences, but not for complex ones because spm_encode | spm_encode != spm_encode. My bad. Still, maybe can bring it up to the next marian meeting if we can get such a simple option in.

I need a new modifier that actually crashes deterministically to test this.
This makes `spm_vocab` behave the same as the dataset paths in a yaml file without manual annotations.
These tests are more specifically for expected end-to-end behaviours
The epoch counter was increased one step too late for our code
When the batch size is bigger than the dataset, it will produce more data than the condition. Not sure whether we should treat this as a bug yet.
@jelmervdl
Copy link
Contributor Author

Sorry this pull request exploded a bit once I tried to fix the end-to-end tests and discovered some bugs thanks to them.

I think it is mergeable now, assuming we still train marian with full text (re marian-nmt/marian-dev#1003) and we do the "oh this can be better" data type rewrite later (re #29).

Comments are welcome. Also on the DatasetReader reads more iterations than it is strictly supposed to according to until epoch N when batch_size is >> len(dataset), and whether we should fix that properly now or open an issue for it.

except Exception as exc:
raise CurriculumLoaderError(f"could not initialize modifier '{name}': {exc!s}") from exc

def _match_modifier_signature(self, fn:Type[Modifier], settings:Dict[str, Any], basepath:str) -> Dict[str,Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, it's like you are writing C++ here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I could do this in C++, not enough reflection APIs. But if I could do this with templates, I'd probably would have had better type guarantees than whatever my python LSP is coming up with.

self.assertEqual(process.returncode, returncode, msg=process.stderr)

# Useful to generate the data, hehe
if False and path.endswith('.expected.out'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, so we keep the output generating code in here and just change it to True when the tests change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Or I can comment it out. Not sure why I did the if False thing, probably out of lazyness.

Copy link
Contributor

@XapaJIaMnu XapaJIaMnu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we leave the produce spm pieces directly bit for when the PR with marian is merged?

Also, after the rework of the tests, we get 110 sentences, as opposed to 200. It's still not an exact match of the number of lines in a batch, but closer. Should we just write this in the notes and call it a day?

How difficult would it be to drop invalid sentences with a warning? Last time i looked it got a bit hairy. Maybe just return "" would just drop the sentence altogether?

@jelmervdl
Copy link
Contributor Author

Also, after the rework of the tests, we get 110 sentences, as opposed to 200. It's still not an exact match of the number of lines in a batch, but closer. Should we just write this in the notes and call it a day?

I opened #30 for this.

How difficult would it be to drop invalid sentences with a warning? Last time i looked it got a bit hairy. Maybe just return "" would just drop the sentence altogether?

Just implemented that. With backtraces for debugging and all :)

@jelmervdl jelmervdl merged commit 3b8e0cc into main Aug 14, 2023
8 checks passed
@jelmervdl jelmervdl deleted the alignment-passthrough branch August 14, 2023 12:43
try:
yield fn(item)
except Exception as exc:
raise Exception(f'Exception while processing item {n}: {item!r}') from exc
logger.log(f'Exception while processing line, skipping: {item!r}', 'WARNING',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ideally be log_once otherwise we would get spammed everytime we loop through the dataset. (I assume the exception kwards would be the same always)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be, but it might also not be hashable. I'll experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow training models with guided alignment
3 participants