-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HF tokenizers: initial base tokenizer support #2350
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
60bbd6f
wip
ebsmothers 0b11333
HF tokenizers: initial base tokenizer support
ebsmothers cd1dcc1
Merge branch 'main' into hf-tokenizer
ebsmothers 5f5fa8b
prefer config over generation_config, address comments
ebsmothers 0bb68a3
docstring updates
ebsmothers 0a00c80
more comments
ebsmothers 7e70de7
add link
ebsmothers 538cf32
merge
ebsmothers 9120af5
more robust BOS/EOS handling, update docs and test
ebsmothers b1aba76
docs formatting
ebsmothers 6407277
Add to api_ref_modules.rst
ebsmothers File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,6 +23,7 @@ dependencies = [ | |
"sentencepiece", | ||
"tiktoken", | ||
"blobfile>=2", | ||
"tokenizers", | ||
|
||
# Miscellaneous | ||
"numpy", | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"bos_token_id": 0, "eos_token_id": -1} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought we would make this a dev dependency and conditionally import this in hf_tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was on the fence about this. But actually I think the upstream dependencies of tokenizers are already a subset of our upstream dependencies. So in that case it's not like we are pulling in a bunch of new transitive deps, this is really just tokenizers and nothing else. Because of that I feel OK about having it as a core dep, but happy to take the approach you're suggesting if you (or others) strongly disagree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't torch close to 1 GB? Anyways if everyone would prefer to make this an optional dep I am fine with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 3 MB is nothing. I'm fine with it as a core dep.