Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF tokenizers: initial base tokenizer support #2350

Merged
merged 11 commits into from
Feb 13, 2025

Conversation

ebsmothers
Copy link
Contributor

@ebsmothers ebsmothers commented Feb 5, 2025

Fixes #2212

This is an initial PR to support general tokenizers from Hugging Face via a tokenizer.json file. This is just a starting point to parse relevant JSON files, infer BOS and EOS, and define the encode/decode methods. Special tokens are not handled here, nor is lining things up with our tokenize_messages abstraction. The goal here is to unblock initial usage of models that contain only tokenizer.json on the HF hub. With this PR, it should be possible to define a ModelTokenizer containing HFTokenizer as a component, similar to how our other ModelTokenizers contain either SentencePieceBaseTokenizer or TikTokenBaseTokenizer. This also means that we add a dependency on the tokenizers library, which I think is a necessity to support a broader set of models.

Test plan

Added a unit test to verify parity on our toy tiktoken tokenizer. The HF tokenizer.json version of this tokenizer was generated with this script.

pytest tests/torchtune/modules/transforms/tokenizers/test_hf_tokenizer.py
...
========== 4 passed, 1 warning in 0.08s ============

Will also provide a script comparing a proper model tokenizer across torchtune and HF implementations (but this shouldn't go in our CI due to the larger tokenizer model size).

Copy link

pytorch-bot bot commented Feb 5, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2350

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 6407277 with merge base d5d85c5 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025
Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just checked _hf_tokenizer. havents looked at the tests yet

Comment on lines 20 to 23
path (str): Path to tokenizer.json file
config_path (Optional[str]): Path to tokenizer_config.json file. Default: None
generation_config_path (Optional[str]): Path to generation_config.json file.
Default: None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have 3 paths. For the last 2 paths, they are more specific, which is good. I think we could rename path -> tokenizer_json_path, to make it explicit

Comment on lines 71 to 81
if self.generation_config:
self.bos_id = self.generation_config.get("bos_token_id")
self.eos_id = self.generation_config.get("eos_token_id")

if self.config:
bos_token = self._get_token_from_config(self.config, "bos_token")
eos_token = self._get_token_from_config(self.config, "eos_token")
if bos_token is not None and self.bos_id is None:
self.bos_id = self.hf_tokenizer.token_to_id(bos_token)
if eos_token is not None and self.eos_id is None:
self.eos_id = self.eos_id or self.hf_tokenizer.token_to_id(eos_token)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know much about the files, should we add a sanity check to see if config.eos == generation_config.bos_token_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry maybe I'm confused.. why would we expect eos to equal bos? But if you mean check on eos == eos and bos == bos across config and generation config, I think we can add that check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I opted not to add the check, mainly because it's just some extra logic to validate something that shouldn't happen (the user has config and generation config that don't line up). In that case, we should just be explicit in our docs that we prefer config.json whenever it's passed. Lmk if any concerns with that though

self.generation_config = json.load(f)
else:
self.generation_config = None
self._infer_bos_eos_tokens()
Copy link
Contributor

@felipemello1 felipemello1 Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets raise an error if self.generation_config is None and self.config is None

Also, apparently we always use one or the other. We could simplify the rest of the code by just making something like:

self.config = generation_config or config

edit: actually, i dont think we can do it because they have different formats

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it's simplest if we just use config. At the same time, I read (on some random forum) that generation config should take precedence over config? So currently I'm doing that. One thing that's nice about generation config is that it directly gives the int value for eos and bos, whereas the regular config tends to give the string value, which then needs to be converted. Aside from that, config has chat template, special tokens, and all that other stuff we would need. So yeah right now I am hedging a bit and keeping both around because I suspect we may need both at different times

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cite your sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source

Like I said, maybe not the actual source-of-truth here. Especially as I look through the PreTrainedTokenizerBase class -- they seem to really only rely on tokenizer_config.json (ref). For what it's worth, litgpt also gives tokenizer_config.json precedence, and it seems to work OK for them (also shout-out to @Andrei-Aksionov who helped give me some pointers here).

So yeah perhaps I will switch the order to read from tokenizer config, then generation config. I do trust the code itself + litgpt's implementation more than that one specific issue comment.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks for the shout-out 😊.

It was long time since I worked with tokenizers, but yeah, we decided to rely on tokenizer.json and if it's in the repo - use HF Tokenizers. Before tokenizer.model and thus SentencePiece had a precedence.

If to take a closer look at the init process of AutoTokenizer, one can see that if there is only a tokenizer.model file (a vocab for SP), then this vocab is loaded and then the SP tokenizer is converted to a HF Tokenizers format. The latter allows to extend the vocab with special tokens (often comes with instruct versions of models). With SP to perform the same you will need a much more fiddling (i.e. operate with protobuf format).

So, basically, HF never works with SP tokenizer, even if only it's vocab is provided.
(I asked in the repo of OpenCoder what to do if there is only tokenizer.model, additional tokens in tokenizer_config.json and I don't want to use AutoTokenizer, but never got an answer. Do they all rely on AutoTokenizer?)

In other words, it's good that you are working on supporting tokenizer.json.


For what it's worth, litgpt also gives tokenizer_config.json precedence

FYI, after looking at the codebase, previously generation_config.json had a precedence, but it was changed in this commit.

@ebsmothers ebsmothers mentioned this pull request Feb 6, 2025
13 tasks
@@ -23,6 +23,7 @@ dependencies = [
"sentencepiece",
"tiktoken",
"blobfile>=2",
"tokenizers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought we would make this a dev dependency and conditionally import this in hf_tokenizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was on the fence about this. But actually I think the upstream dependencies of tokenizers are already a subset of our upstream dependencies. So in that case it's not like we are pulling in a bunch of new transitive deps, this is really just tokenizers and nothing else. Because of that I feel OK about having it as a core dep, but happy to take the approach you're suggesting if you (or others) strongly disagree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't torch close to 1 GB? Anyways if everyone would prefer to make this an optional dep I am fine with it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 3 MB is nothing. I'm fine with it as a core dep.

Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat!

@@ -23,6 +23,7 @@ dependencies = [
"sentencepiece",
"tiktoken",
"blobfile>=2",
"tokenizers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)

@pytest.fixture
def texts(self):
return [
"I can see the sun. But even if I cannot see the sun, I know that it exists.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I am not a cultivated man, brother, but I've thought a lot about this."

@@ -4,6 +4,7 @@
# This source code is licensed under the BSD-style license found in the
# LICENSE file in the root directory of this source tree.

from ._hf_tokenizer import HFTokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it's weird that we spell out SentencePiece and TikToken, but not HuggingFace...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK yeah we can spell it out

from typing import Any, Dict, List, Optional

from tokenizers import Tokenizer
from torchtune.modules.transforms.tokenizers._utils import BaseTokenizer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add one more submodule, Satan gives us a cookie and a "good job" sticker


class HFTokenizer(BaseTokenizer):
"""
A wrapper around Hugging Face tokenizers. This class can be used to load from a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: point to the repo.


Args:
tokenizer_json_path (str): Path to tokenizer.json file
config_path (Optional[str]): Path to tokenizer_config.json file. Default: None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be called tokenizer_config_json_path. Yes, I know that's a mouthful, but it pattern matches with the name of the file and there's already other "configs" in a Hugging Face model dir.

def __init__(
self,
tokenizer_json_path: str,
config_path: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwargs only

self.generation_config = json.load(f)
else:
self.generation_config = None
self._infer_bos_eos_tokens()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we give them an option to explicitly set the BOS and EOS if it can't be inferred from the file? Or are they just screwed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hedging a bit here for future work -- eventually I would like to have a ModelTokenizer wrapping this. In that case I think we should require at least tokenizer_config.json and explicitly giving EOS and BOS will probably not be enough

config_path: Optional[str] = None,
generation_config_path: Optional[str] = None,
):
self.hf_tokenizer = Tokenizer.from_file(tokenizer_json_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this has to be self.hf_tokenizer instead of just tokenizer? You already know it's a HFTokenizer class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it can just be self.tokenizer

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 30.43478% with 64 lines in your changes missing coverage. Please review.

Project coverage is 23.35%. Comparing base (9da35c7) to head (7e70de7).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...une/modules/transforms/tokenizers/_hf_tokenizer.py 18.86% 43 Missing ⚠️
...modules/transforms/tokenizers/test_hf_tokenizer.py 44.73% 21 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2350       +/-   ##
===========================================
- Coverage   65.61%   23.35%   -42.26%     
===========================================
  Files         358      365        +7     
  Lines       21367    22023      +656     
===========================================
- Hits        14020     5144     -8876     
- Misses       7347    16879     +9532     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@joecummings joecummings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bueno

def __init__(
self,
*,
tokenizer_json_path: str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could've sworn I made a comment here, but b/c tokenizer_json_path is required, it can be a positional argument. My thinking is that kwargs would apply to tokenizer_config_json_path and generation_config_path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah duh, good point. I should've thought about it more carefully anyways

@ebsmothers ebsmothers merged commit e1386d1 into pytorch:main Feb 13, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Question] what to do when model doesn't have tokenizer.model?
7 participants