Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment passthrough #26

Merged
merged 38 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
4ea23bb
wip
XapaJIaMnu Apr 3, 2023
a144ae2
Initial implementation of noise augmenters
XapaJIaMnu Apr 3, 2023
be4ebe5
Simplify code a bit
jelmervdl Apr 4, 2023
7dcc2a4
Fix tests
jelmervdl Apr 4, 2023
7a1189f
Fix possible bug in test
jelmervdl Apr 4, 2023
6da73ff
Add specific tests for the three modes
jelmervdl Apr 4, 2023
55b443c
Add alignment info to the simple tests
jelmervdl Apr 4, 2023
42b850e
Make placeholder modifier produce (corrected) alignment pairs
jelmervdl Apr 4, 2023
831d564
Make sure `get_placeholding_candidates` returns the original instances
jelmervdl Apr 5, 2023
2e3d29f
update tests
jelmervdl Apr 5, 2023
38e54de
Merge branch 'main' into alignment-passthrough
jelmervdl Jul 6, 2023
723760e
Attempt to improve the alignment fix-up
jelmervdl Jul 6, 2023
17733d4
Fix unit tests
jelmervdl Jul 7, 2023
5c768e4
Implement retokenize modifier
jelmervdl Jul 24, 2023
e0adec6
Merge remote-tracking branch 'origin/main' into alignment-passthrough
jelmervdl Jul 25, 2023
294a18d
Let PlaceholderModifier use Retokenizer implementation for now
jelmervdl Jul 27, 2023
d8b1b10
Add unittest for spm retokenize in placeholders
jelmervdl Jul 27, 2023
704bd65
Add test to confirm that even when no placeholder is added, retokeniz…
jelmervdl Jul 27, 2023
38a3cae
Efficiency: don't bother calculating candidates if prob = 0.
jelmervdl Jul 27, 2023
0c4868f
Add tests covering spaces tokenizer
jelmervdl Jul 27, 2023
aab72a4
Document the `spm_vocab` option of the `Tags` modifier
jelmervdl Jul 27, 2023
973906a
Be nicer about issues with the alignment info
jelmervdl Jul 28, 2023
6b4abe0
Explain the `StopIteration` bit
jelmervdl Jul 28, 2023
c200c9c
Remove unreachable else
jelmervdl Jul 28, 2023
126587d
Remove debug code
jelmervdl Jul 28, 2023
106d832
Document and rename methods
jelmervdl Jul 28, 2023
b9ad9f6
Skip trainer backtrace test for now
jelmervdl Jul 28, 2023
b822e8c
Only print alignment info when spm_vocab is passed in
jelmervdl Aug 7, 2023
ef3c780
Make `retokenize` a little less `O(n^2)`
jelmervdl Aug 7, 2023
7069872
Replace placeholder-specific end-to-end tests with specific test for …
jelmervdl Aug 7, 2023
6b62198
Use `Path` in type signature of modifiers to resolve relative paths
jelmervdl Aug 9, 2023
7a80d2c
Rewrite end-to-end tests
jelmervdl Aug 9, 2023
9603208
Rewrite DatasetReader to not always produce n+1 lines
jelmervdl Aug 9, 2023
a2248ad
Add option for batch size
jelmervdl Aug 9, 2023
2f72e76
Add some comments to the tests
jelmervdl Aug 9, 2023
4779dd6
Fix missing sentencepiece dependency
jelmervdl Aug 9, 2023
a17af46
Fix other pyproject.toml entries while we're at it
jelmervdl Aug 9, 2023
2479f09
Make trainer skip lines that can't be processed by modifier
jelmervdl Aug 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,21 +165,28 @@ modifiers:
#### Tags
Adds a placeholder tag to the source sentence that can be used by the model to hint how it should translate that word. The word to hint is chosen at random from the target sentence. Only words with a 1-to-1 mapping between source and target are considered.

This modifier needs a third column in the training data with per-word alignment information.
This modifier needs a third column in the training data with per-word (technically: space separated token) alignment information.

```yaml
- Tags: 0.05
custom_detok_src: null
custom_detok_trg: zh
spm_vocab: path/to/vocab.enzh.spm
template: "__source__ {src} __target__ {trg} __done__"
```

All options are optional.

You can specify custom detokenizer languages using `custom_detok_src` and `custom_detok_trg` if the dataset you're reading from has been tokenized by the Moses tokenizer. This can be helpful to do for languages that do not use spaces to delimit words. The default tokenisation strategy is splitting/joining by spaces.

The `spm_vocab` option can be used to recompute the alignment info to match the tokenisation from the sentencepiece vocabulary. This is mostly useful for Marian, which takes untokenised input but expects the alignment info to match the sentencepiece tokenisation it performs. Note that at the moment alignment info is only produced when `spm_vocab` is given.

The format for telling the translation model the intention to translate a word in a certain way can be controlled by `template`. Here `{src}` and `{trg}` are replaced by the selected words from the source and target side of the sentence pair.

**Note**: Due to how most modifiers are implemented, they will have a normalising effect on spaces. Sequences of spaces will be collapsed into a single space. This is also true for the *Tags* modifier.

**Note**: Even if the probability of the *Tags* modifier is set to 0, it will apply detokenisation and optionally re-computation of the alignment on every sentence pair, regardless whether it was picked out to be modified or not.

#### Prefix
Prepends a random subsection of the target sentence before the source sentence.

Expand Down
10 changes: 0 additions & 10 deletions contrib/test-data/clean.enzh.ref.06.4.none

This file was deleted.

10 changes: 0 additions & 10 deletions contrib/test-data/clean.enzh.ref.06.4.trg

This file was deleted.

Loading