-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment passthrough #26
Merged
Merged
Changes from 1 commit
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
4ea23bb
wip
XapaJIaMnu a144ae2
Initial implementation of noise augmenters
XapaJIaMnu be4ebe5
Simplify code a bit
jelmervdl 7dcc2a4
Fix tests
jelmervdl 7a1189f
Fix possible bug in test
jelmervdl 6da73ff
Add specific tests for the three modes
jelmervdl 55b443c
Add alignment info to the simple tests
jelmervdl 42b850e
Make placeholder modifier produce (corrected) alignment pairs
jelmervdl 831d564
Make sure `get_placeholding_candidates` returns the original instances
jelmervdl 2e3d29f
update tests
jelmervdl 38e54de
Merge branch 'main' into alignment-passthrough
jelmervdl 723760e
Attempt to improve the alignment fix-up
jelmervdl 17733d4
Fix unit tests
jelmervdl 5c768e4
Implement retokenize modifier
jelmervdl e0adec6
Merge remote-tracking branch 'origin/main' into alignment-passthrough
jelmervdl 294a18d
Let PlaceholderModifier use Retokenizer implementation for now
jelmervdl d8b1b10
Add unittest for spm retokenize in placeholders
jelmervdl 704bd65
Add test to confirm that even when no placeholder is added, retokeniz…
jelmervdl 38a3cae
Efficiency: don't bother calculating candidates if prob = 0.
jelmervdl 0c4868f
Add tests covering spaces tokenizer
jelmervdl aab72a4
Document the `spm_vocab` option of the `Tags` modifier
jelmervdl 973906a
Be nicer about issues with the alignment info
jelmervdl 6b4abe0
Explain the `StopIteration` bit
jelmervdl c200c9c
Remove unreachable else
jelmervdl 126587d
Remove debug code
jelmervdl 106d832
Document and rename methods
jelmervdl b9ad9f6
Skip trainer backtrace test for now
jelmervdl b822e8c
Only print alignment info when spm_vocab is passed in
jelmervdl ef3c780
Make `retokenize` a little less `O(n^2)`
jelmervdl 7069872
Replace placeholder-specific end-to-end tests with specific test for …
jelmervdl 6b62198
Use `Path` in type signature of modifiers to resolve relative paths
jelmervdl 7a80d2c
Rewrite end-to-end tests
jelmervdl 9603208
Rewrite DatasetReader to not always produce n+1 lines
jelmervdl a2248ad
Add option for batch size
jelmervdl 2f72e76
Add some comments to the tests
jelmervdl 4779dd6
Fix missing sentencepiece dependency
jelmervdl a17af46
Fix other pyproject.toml entries while we're at it
jelmervdl 2479f09
Make trainer skip lines that can't be processed by modifier
jelmervdl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should ideally be
log_once
otherwise we would get spammed everytime we loop through the dataset. (I assume the exception kwards would be the same always)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be, but it might also not be hashable. I'll experiment.