Alignment passthrough #26

jelmervdl · 2023-07-07T15:27:05Z

This should add the missing functionality to train a student with alignment without having to use hacks.

Add alignment support to the Placeholder modifier, and
implement "retokenization" using a sentencepiece vocab so we
- use word alignments (as opposed to spm token alignments) in our modifiers
- but can pass the alignment info to marian in a useful manner

Right now I'm implementing this as an extension of the Placeholder filter for compatibility reasons. But it can be used as its own stand-alone filter, Retokenize.

The tokenizer parameter of that thing is a bit of a lie, as it always gives you the detokenized output. But it will use the tokenizer parameter to tokenize that detokenized output so it can compute the correct alignment between the new tokens and their indices.

Also fixes the order. Does change the tests though.

# Conflicts: # contrib/test-data/clean.enzh.ref.06.4.none # contrib/test-data/clean.enzh.ref.06.4.trg # contrib/test-data/clean.zhen.ref.06.4.04.04.src # contrib/test-data/clean.zhen.ref.06.4.src # contrib/test_enzh_config.yml # contrib/test_enzh_noise_config.yml # src/opustrainer/modifiers/placeholders.py # tests/test_placeholders.py

Not end-to-end tests yet

Should be mostly compatible, except that the end-to-end tests are failing? Did I change the order of random calls?

jelmervdl · 2023-07-27T15:43:41Z

tests/test_retokenizer.py

+    out = tokenizer('\t'.join([
+      'This is a simple test statement 🤣 .',
+      #^0   ^1 ^2 ^3    ^4   ^5        ^6 ^7
+      '这 是 一个 简单 的 测试 语 句 🤣 。',
+      #^0 ^1 ^2  ^3   ^4 ^5   ^6 ^7 ^8 ^9
+      '0-0 1-1 2-2 3-3 3-4 4-5 5-6 5-7 6-8 7-9',
+    ]))
+    self.assertEqual(out, '\t'.join([
+      'This is a simple test statement 🤣.',
+      #[This][ is][ a][ simple][ test][ statement][ ][] [] [] [🤣][.]
+      #^0    ^1   ^2  ^3       ^4     ^5          ^6 ^7 ^8 ^9 ^10 ^11 
+      '这是一个简单的测试语句 🤣 。',
+      #[这][是][一][个][简][单][的][测][试][语][句] [ ] []  []  []  [🤣][ 。]
+      #^0  ^1  ^2  ^3 ^4  ^5  ^6  ^7  ^8  ^9  ^10 ^11 ^12 ^13 ^14 ^15  ^16
+      '0-0 1-1 2-2 2-3 3-4 3-5 3-6 4-7 4-8 5-9 5-10 10-15 11-16',
+      # 0-0 [This]      [这]    0-0
+      # 1-1 [is]        [是]    1-1
+      # 2-2 [a]         [一个]  2-2 2-3
+      # 3-3 [simple]    [简单]  3-4 3-5
+      # 3-4 [simple]    [的]    3-6
+      # 4-5 [test]      [测试]  4-7 4-8
+      # 5-6 [statement] [语]    5-9
+      # 5-7 [statement] [句]    5-10 (6-11)
+      # 6-8 [🤣]        [🤣]   (7-12 8-13 9-14) 10-15
+      # 7-9 [.]         [。]    11-16
+    ]))


This is probably the best example of what this code does: Input is Moses tokens for both sides of the sentence pair.

Both are detokenized using their own language-specific Moses detokenizers.

That detokenized sentence is then returned.

However, it also uses sentencepiece with the specified vocab to retokenize that detokenized sentence.

It then computes how the original moses tokens and the new sentencepiece tokens overlap, and adjusts the alignment pairs.

In the assertEqual comments above, the [..] bits are SPM tokens. The table at the bottom is a mapping between old alignment pairs and their tokens to new alignment pairs that make up the line (34) above it.

Interesting finding while doing this: the byte-fallback necessary to encode the emoji cannot be expressed with a surface form or a slice since SPM operates on unicode in its Python interface. So tokens 7, 8 and 9 are zero-length. I chose to not add alignment info for those, hoping that only the final token of that byte-fallback sequence will be sufficient for the model to learn that the emoji faces align. Because I'm lazy. But in theory it should be feasible to detect this and add alignment pairs for all the parts of the byte-fallback sequence.

For other 1->N mappings (e.g. moses [一个] maps to spm [一][个]) I chose to generate alignment pairs for all combinations.

…ation still occurs

jelmervdl · 2023-07-27T16:06:42Z

(I'd really, really like to replace the hidden end-to-end tests in test_placeholders.py with proper single test case unit tests that can be reasoned about and aren't opaque and hidden in external files. We already have test_endtoend.py for those.)

TODO: Regenerate the end-to-end test files.

XapaJIaMnu

Looks mostly good. I would like to have example of the new modifiers in the end-to-end tests and the documentation.

One thing that I forgot to tell you is that i tested decoding with pre-spm'd text and it works, because we have one-to-one token match. This means that we should be able to apply spm modifications here and not inside marian during training and then we can solve the problem of the sentences that are getting marked as tainted due to modifications.

XapaJIaMnu · 2023-07-27T16:51:30Z

src/opustrainer/alignments.py

+            or pair.trg < 0 or pair.trg >= len(trg_tokens)
+        ]
+        if invalid_pairs:
+            raise ValueError('Out-of-bound alignment pairs: ' + ' '.join(map(repr, invalid_pairs)))


We don't want to Raise during training, let's have all of those be a warn_once. I had a few training runs fail because of a one off error in the alignments will kill a whole training run. We should have a fail condition for alignments are no accurate. (And for some reason, sometimes, fast align did produce invalid alignments...)

Hmm, I agree. What should we do with lines that have out-of-bound pairs? I think we should just throw out all alignment pairs in that case for that sentence because there's no guarantee that the rest of them are not off-by-one or something.

That would automatically also exclude them from the Tags modifier, and their alignment info wouldn't be passed on to Marian, so Marian would not be trained with broken data (or worse, bork because of an index error).

I would throw out the lines, if we can do it cleanly. If alignment wasn't produced, this likely means the sentence is bad.

Okay, euhm. I can do that, but it has to happen at the modifier-is-applied-to-batch level in trainer.py because that's the only place we can remove lines at the moment. I'll change that.

There's a discussion somewhere with Graeme about changing modifiers to return a list of lines, which then can just add and remove lines. But that's probably too much also add into this pull request.

or maybe then merge as is and start another PR aiming at sanitizing input.

src/opustrainer/modifiers/placeholders.py

src/opustrainer/modifiers/retokenize.py

src/opustrainer/modifiers/placeholders.py

src/opustrainer/modifiers/retokenize.py

tests/test_retokenizer.py

src/opustrainer/tokenizers.py

graemenail

Looks good; just some small comments beyond @XapaJIaMnu's review

src/opustrainer/tokenizers.py

src/opustrainer/modifiers/placeholders.py

tests/test_retokenizer.py

jelmervdl · 2023-07-28T09:44:18Z

One thing that I forgot to tell you is that i tested decoding with pre-spm'd text and it works, because we have one-to-one token match. This means that we should be able to apply spm modifications here and not inside marian during training and then we can solve the problem of the sentences that are getting marked as tainted due to modifications.

Oh cool! How would you pass this to Marian? Do you just give it the spm token ids as numbers?

I initially had much bigger plans for this change, but moved them over to #29 and tried to keep this change backwards compatible as much as possible. Having an option to actually produce spm token ids and specify things like spm sampling parameters would fit easily in that structure.

XapaJIaMnu · 2023-07-28T10:44:23Z

One thing that I forgot to tell you is that i tested decoding with pre-spm'd text and it works, because we have one-to-one token match. This means that we should be able to apply spm modifications here and not inside marian during training and then we can solve the problem of the sentences that are getting marked as tainted due to modifications.

Oh cool! How would you pass this to Marian? Do you just give it the spm token ids as numbers?

I initially had much bigger plans for this change, but moved them over to #29 and tried to keep this change backwards compatible as much as possible. Having an option to actually produce spm token ids and specify things like spm sampling parameters would fit easily in that structure.

I was overeager here, hmm. It works for simple sentences, but not for complex ones because spm_encode | spm_encode != spm_encode. My bad. Still, maybe can bring it up to the next marian meeting if we can get such a simple option in.

I need a new modifier that actually crashes deterministically to test this.

Nicely backwards compatible!

…`mode` arguments

This makes `spm_vocab` behave the same as the dataset paths in a yaml file without manual annotations.

These tests are more specifically for expected end-to-end behaviours

The epoch counter was increased one step too late for our code

When the batch size is bigger than the dataset, it will produce more data than the condition. Not sure whether we should treat this as a bug yet.

src/opustrainer/modifiers/retokenize.py

jelmervdl · 2023-08-09T22:14:03Z

Sorry this pull request exploded a bit once I tried to fix the end-to-end tests and discovered some bugs thanks to them.

I think it is mergeable now, assuming we still train marian with full text (re marian-nmt/marian-dev#1003) and we do the "oh this can be better" data type rewrite later (re #29).

Comments are welcome. Also on the DatasetReader reads more iterations than it is strictly supposed to according to until epoch N when batch_size is >> len(dataset), and whether we should fix that properly now or open an issue for it.

XapaJIaMnu · 2023-08-13T20:14:07Z

src/opustrainer/trainer.py

        except Exception as exc:
            raise CurriculumLoaderError(f"could not initialize modifier '{name}': {exc!s}") from exc

+    def _match_modifier_signature(self, fn:Type[Modifier], settings:Dict[str, Any], basepath:str) -> Dict[str,Any]:


wow, it's like you are writing C++ here.

I don't think I could do this in C++, not enough reflection APIs. But if I could do this with templates, I'd probably would have had better type guarantees than whatever my python LSP is coming up with.

XapaJIaMnu · 2023-08-13T20:16:14Z

tests/test_endtoend.py

+        self.assertEqual(process.returncode, returncode, msg=process.stderr)
+
+        # Useful to generate the data, hehe
+        if False and path.endswith('.expected.out'):


Hmmm, so we keep the output generating code in here and just change it to True when the tests change?

Yes. Or I can comment it out. Not sure why I did the if False thing, probably out of lazyness.

XapaJIaMnu

So we leave the produce spm pieces directly bit for when the PR with marian is merged?

Also, after the rework of the tests, we get 110 sentences, as opposed to 200. It's still not an exact match of the number of lines in a batch, but closer. Should we just write this in the notes and call it a day?

How difficult would it be to drop invalid sentences with a warning? Last time i looked it got a bit hairy. Maybe just return "" would just drop the sentence altogether?

jelmervdl · 2023-08-14T10:58:39Z

Also, after the rework of the tests, we get 110 sentences, as opposed to 200. It's still not an exact match of the number of lines in a batch, but closer. Should we just write this in the notes and call it a day?

I opened #30 for this.

How difficult would it be to drop invalid sentences with a warning? Last time i looked it got a bit hairy. Maybe just return "" would just drop the sentence altogether?

Just implemented that. With backtraces for debugging and all :)

XapaJIaMnu · 2023-08-14T14:17:48Z

src/opustrainer/trainer.py

        try:
            yield fn(item)
        except Exception as exc:
-            raise Exception(f'Exception while processing item {n}: {item!r}') from exc
+            logger.log(f'Exception while processing line, skipping: {item!r}', 'WARNING',


This should ideally be log_once otherwise we would get spammed everytime we loop through the dataset. (I assume the exception kwards would be the same always)

It should be, but it might also not be hashable. I'll experiment.

XapaJIaMnu and others added 17 commits April 3, 2023 14:08

wip

4ea23bb

Initial implementation of noise augmenters

a144ae2

Simplify code a bit

be4ebe5

Fix tests

7dcc2a4

Fix possible bug in test

7a1189f

Add specific tests for the three modes

6da73ff

Add alignment info to the simple tests

55b443c

Make placeholder modifier produce (corrected) alignment pairs

42b850e

Make sure get_placeholding_candidates returns the original instances

831d564

Also fixes the order. Does change the tests though.

update tests

2e3d29f

Attempt to improve the alignment fix-up

723760e

Fix unit tests

17733d4

Not end-to-end tests yet

Implement retokenize modifier

5c768e4

Merge remote-tracking branch 'origin/main' into alignment-passthrough

e0adec6

Let PlaceholderModifier use Retokenizer implementation for now

294a18d

Should be mostly compatible, except that the end-to-end tests are failing? Did I change the order of random calls?

Add unittest for spm retokenize in placeholders

d8b1b10

jelmervdl mentioned this pull request Jul 27, 2023

Use SentencePair struct instead of str internally #29

Open

jelmervdl commented Jul 27, 2023

View reviewed changes

jelmervdl added 4 commits July 27, 2023 16:51

Add test to confirm that even when no placeholder is added, retokeniz…

704bd65

…ation still occurs

Efficiency: don't bother calculating candidates if prob = 0.

38a3cae

Add tests covering spaces tokenizer

0c4868f

Document the spm_vocab option of the Tags modifier

aab72a4

jelmervdl requested review from XapaJIaMnu and graemenail July 27, 2023 16:06

jelmervdl linked an issue Jul 27, 2023 that may be closed by this pull request

Allow training models with guided alignment #11

Closed

jelmervdl marked this pull request as ready for review July 27, 2023 16:07

XapaJIaMnu requested changes Jul 27, 2023

View reviewed changes

graemenail reviewed Jul 27, 2023

View reviewed changes

src/opustrainer/tokenizers.py Outdated Show resolved Hide resolved

src/opustrainer/modifiers/placeholders.py Outdated Show resolved Hide resolved

tests/test_retokenizer.py Show resolved Hide resolved

jelmervdl added 4 commits July 28, 2023 11:18

Be nicer about issues with the alignment info

973906a

Explain the StopIteration bit

6b4abe0

Remove unreachable else

c200c9c

Remove debug code

126587d

jelmervdl added 12 commits July 28, 2023 12:11

Document and rename methods

106d832

Skip trainer backtrace test for now

b9ad9f6

I need a new modifier that actually crashes deterministically to test this.

Only print alignment info when spm_vocab is passed in

b822e8c

Nicely backwards compatible!

Make retokenize a little less O(n^2)

ef3c780

Replace placeholder-specific end-to-end tests with specific test for …

7069872

…`mode` arguments

Use Path in type signature of modifiers to resolve relative paths

6b62198

This makes `spm_vocab` behave the same as the dataset paths in a yaml file without manual annotations.

Rewrite end-to-end tests

7a80d2c

These tests are more specifically for expected end-to-end behaviours

Rewrite DatasetReader to not always produce n+1 lines

9603208

The epoch counter was increased one step too late for our code

Add option for batch size

a2248ad

When the batch size is bigger than the dataset, it will produce more data than the condition. Not sure whether we should treat this as a bug yet.

Add some comments to the tests

2f72e76

Fix missing sentencepiece dependency

4779dd6

Fix other pyproject.toml entries while we're at it

a17af46

XapaJIaMnu reviewed Aug 9, 2023

View reviewed changes

src/opustrainer/modifiers/retokenize.py Outdated Show resolved Hide resolved

XapaJIaMnu reviewed Aug 13, 2023

View reviewed changes

Make trainer skip lines that can't be processed by modifier

2479f09

jelmervdl merged commit 3b8e0cc into main Aug 14, 2023
8 checks passed

jelmervdl deleted the alignment-passthrough branch August 14, 2023 12:43

XapaJIaMnu reviewed Aug 14, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment passthrough #26

Alignment passthrough #26

jelmervdl commented Jul 7, 2023 •

edited

Loading

jelmervdl Jul 27, 2023

jelmervdl commented Jul 27, 2023 •

edited

Loading

XapaJIaMnu left a comment

XapaJIaMnu Jul 27, 2023

jelmervdl Jul 28, 2023

XapaJIaMnu Aug 13, 2023

jelmervdl Aug 13, 2023

XapaJIaMnu Aug 14, 2023

graemenail left a comment

jelmervdl commented Jul 28, 2023

XapaJIaMnu commented Jul 28, 2023

jelmervdl commented Aug 9, 2023

XapaJIaMnu Aug 13, 2023

jelmervdl Aug 13, 2023

XapaJIaMnu Aug 13, 2023

jelmervdl Aug 13, 2023

XapaJIaMnu left a comment

jelmervdl commented Aug 14, 2023

XapaJIaMnu Aug 14, 2023

jelmervdl Aug 14, 2023

Alignment passthrough #26

Alignment passthrough #26

Conversation

jelmervdl commented Jul 7, 2023 • edited Loading

Choose a reason for hiding this comment

jelmervdl commented Jul 27, 2023 • edited Loading

XapaJIaMnu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graemenail left a comment

Choose a reason for hiding this comment

jelmervdl commented Jul 28, 2023

XapaJIaMnu commented Jul 28, 2023

jelmervdl commented Aug 9, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XapaJIaMnu left a comment

Choose a reason for hiding this comment

jelmervdl commented Aug 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jelmervdl commented Jul 7, 2023 •

edited

Loading

jelmervdl commented Jul 27, 2023 •

edited

Loading