-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HF tokenizers: initial base tokenizer support #2350
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2350
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 6407277 with merge base d5d85c5 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just checked _hf_tokenizer. havents looked at the tests yet
path (str): Path to tokenizer.json file | ||
config_path (Optional[str]): Path to tokenizer_config.json file. Default: None | ||
generation_config_path (Optional[str]): Path to generation_config.json file. | ||
Default: None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have 3 paths. For the last 2 paths, they are more specific, which is good. I think we could rename path -> tokenizer_json_path, to make it explicit
if self.generation_config: | ||
self.bos_id = self.generation_config.get("bos_token_id") | ||
self.eos_id = self.generation_config.get("eos_token_id") | ||
|
||
if self.config: | ||
bos_token = self._get_token_from_config(self.config, "bos_token") | ||
eos_token = self._get_token_from_config(self.config, "eos_token") | ||
if bos_token is not None and self.bos_id is None: | ||
self.bos_id = self.hf_tokenizer.token_to_id(bos_token) | ||
if eos_token is not None and self.eos_id is None: | ||
self.eos_id = self.eos_id or self.hf_tokenizer.token_to_id(eos_token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know much about the files, should we add a sanity check to see if config.eos == generation_config.bos_token_id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry maybe I'm confused.. why would we expect eos to equal bos? But if you mean check on eos == eos and bos == bos across config and generation config, I think we can add that check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I opted not to add the check, mainly because it's just some extra logic to validate something that shouldn't happen (the user has config and generation config that don't line up). In that case, we should just be explicit in our docs that we prefer config.json whenever it's passed. Lmk if any concerns with that though
self.generation_config = json.load(f) | ||
else: | ||
self.generation_config = None | ||
self._infer_bos_eos_tokens() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets raise an error if self.generation_config is None and self.config is None
Also, apparently we always use one or the other. We could simplify the rest of the code by just making something like:
self.config = generation_config or config
edit: actually, i dont think we can do it because they have different formats
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think it's simplest if we just use config. At the same time, I read (on some random forum) that generation config should take precedence over config? So currently I'm doing that. One thing that's nice about generation config is that it directly gives the int value for eos and bos, whereas the regular config tends to give the string value, which then needs to be converted. Aside from that, config has chat template, special tokens, and all that other stuff we would need. So yeah right now I am hedging a bit and keeping both around because I suspect we may need both at different times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cite your sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I said, maybe not the actual source-of-truth here. Especially as I look through the PreTrainedTokenizerBase
class -- they seem to really only rely on tokenizer_config.json
(ref). For what it's worth, litgpt also gives tokenizer_config.json
precedence, and it seems to work OK for them (also shout-out to @Andrei-Aksionov who helped give me some pointers here).
So yeah perhaps I will switch the order to read from tokenizer config, then generation config. I do trust the code itself + litgpt's implementation more than that one specific issue comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks for the shout-out 😊.
It was long time since I worked with tokenizers, but yeah, we decided to rely on tokenizer.json
and if it's in the repo - use HF Tokenizers. Before tokenizer.model
and thus SentencePiece had a precedence.
If to take a closer look at the init process of AutoTokenizer, one can see that if there is only a tokenizer.model
file (a vocab for SP), then this vocab is loaded and then the SP tokenizer is converted to a HF Tokenizers format. The latter allows to extend the vocab with special tokens (often comes with instruct versions of models). With SP to perform the same you will need a much more fiddling (i.e. operate with protobuf format).
So, basically, HF never works with SP tokenizer, even if only it's vocab is provided.
(I asked in the repo of OpenCoder what to do if there is only tokenizer.model
, additional tokens in tokenizer_config.json
and I don't want to use AutoTokenizer, but never got an answer. Do they all rely on AutoTokenizer?)
In other words, it's good that you are working on supporting tokenizer.json
.
For what it's worth, litgpt also gives tokenizer_config.json precedence
FYI, after looking at the codebase, previously generation_config.json
had a precedence, but it was changed in this commit.
@@ -23,6 +23,7 @@ dependencies = [ | |||
"sentencepiece", | |||
"tiktoken", | |||
"blobfile>=2", | |||
"tokenizers", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought we would make this a dev dependency and conditionally import this in hf_tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was on the fence about this. But actually I think the upstream dependencies of tokenizers are already a subset of our upstream dependencies. So in that case it's not like we are pulling in a bunch of new transitive deps, this is really just tokenizers and nothing else. Because of that I feel OK about having it as a core dep, but happy to take the approach you're suggesting if you (or others) strongly disagree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't torch close to 1 GB? Anyways if everyone would prefer to make this an optional dep I am fine with it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 3 MB is nothing. I'm fine with it as a core dep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat!
@@ -23,6 +23,7 @@ dependencies = [ | |||
"sentencepiece", | |||
"tiktoken", | |||
"blobfile>=2", | |||
"tokenizers", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)
@pytest.fixture | ||
def texts(self): | ||
return [ | ||
"I can see the sun. But even if I cannot see the sun, I know that it exists.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"I am not a cultivated man, brother, but I've thought a lot about this."
@@ -4,6 +4,7 @@ | |||
# This source code is licensed under the BSD-style license found in the | |||
# LICENSE file in the root directory of this source tree. | |||
|
|||
from ._hf_tokenizer import HFTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's weird that we spell out SentencePiece and TikToken, but not HuggingFace...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yeah we can spell it out
from typing import Any, Dict, List, Optional | ||
|
||
from tokenizers import Tokenizer | ||
from torchtune.modules.transforms.tokenizers._utils import BaseTokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add one more submodule, Satan gives us a cookie and a "good job" sticker
|
||
class HFTokenizer(BaseTokenizer): | ||
""" | ||
A wrapper around Hugging Face tokenizers. This class can be used to load from a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: point to the repo.
|
||
Args: | ||
tokenizer_json_path (str): Path to tokenizer.json file | ||
config_path (Optional[str]): Path to tokenizer_config.json file. Default: None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be called tokenizer_config_json_path
. Yes, I know that's a mouthful, but it pattern matches with the name of the file and there's already other "configs" in a Hugging Face model dir.
def __init__( | ||
self, | ||
tokenizer_json_path: str, | ||
config_path: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwargs only
self.generation_config = json.load(f) | ||
else: | ||
self.generation_config = None | ||
self._infer_bos_eos_tokens() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we give them an option to explicitly set the BOS and EOS if it can't be inferred from the file? Or are they just screwed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am hedging a bit here for future work -- eventually I would like to have a ModelTokenizer
wrapping this. In that case I think we should require at least tokenizer_config.json
and explicitly giving EOS and BOS will probably not be enough
config_path: Optional[str] = None, | ||
generation_config_path: Optional[str] = None, | ||
): | ||
self.hf_tokenizer = Tokenizer.from_file(tokenizer_json_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this has to be self.hf_tokenizer
instead of just tokenizer
? You already know it's a HFTokenizer class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it can just be self.tokenizer
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2350 +/- ##
===========================================
- Coverage 65.61% 23.35% -42.26%
===========================================
Files 358 365 +7
Lines 21367 22023 +656
===========================================
- Hits 14020 5144 -8876
- Misses 7347 16879 +9532 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bueno
def __init__( | ||
self, | ||
*, | ||
tokenizer_json_path: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could've sworn I made a comment here, but b/c tokenizer_json_path is required, it can be a positional argument. My thinking is that kwargs would apply to tokenizer_config_json_path and generation_config_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah duh, good point. I should've thought about it more carefully anyways
Fixes #2212
This is an initial PR to support general tokenizers from Hugging Face via a tokenizer.json file. This is just a starting point to parse relevant JSON files, infer BOS and EOS, and define the encode/decode methods. Special tokens are not handled here, nor is lining things up with our
tokenize_messages
abstraction. The goal here is to unblock initial usage of models that contain only tokenizer.json on the HF hub. With this PR, it should be possible to define aModelTokenizer
containingHFTokenizer
as a component, similar to how our otherModelTokenizers
contain eitherSentencePieceBaseTokenizer
orTikTokenBaseTokenizer
. This also means that we add a dependency on the tokenizers library, which I think is a necessity to support a broader set of models.Test plan
Added a unit test to verify parity on our toy tiktoken tokenizer. The HF tokenizer.json version of this tokenizer was generated with this script.
Will also provide a script comparing a proper model tokenizer across torchtune and HF implementations (but this shouldn't go in our CI due to the larger tokenizer model size).