HF tokenizers: initial base tokenizer support #2350

ebsmothers · 2025-02-05T23:28:47Z

This is an initial PR to support general tokenizers from Hugging Face via a tokenizer.json file. This is just a starting point to parse relevant JSON files, infer BOS and EOS, and define the encode/decode methods. Special tokens are not handled here, nor is lining things up with our tokenize_messages abstraction. The goal here is to unblock initial usage of models that contain only tokenizer.json on the HF hub. With this PR, it should be possible to define a ModelTokenizer containing HFTokenizer as a component, similar to how our other ModelTokenizers contain either SentencePieceBaseTokenizer or TikTokenBaseTokenizer. This also means that we add a dependency on the tokenizers library, which I think is a necessity to support a broader set of models.

Test plan

Added a unit test to verify parity on our toy tiktoken tokenizer. The HF tokenizer.json version of this tokenizer was generated with this script.

pytest tests/torchtune/modules/transforms/tokenizers/test_hf_tokenizer.py
...
========== 4 passed, 1 warning in 0.08s ============

Will also provide a script comparing a proper model tokenizer across torchtune and HF implementations (but this shouldn't go in our CI due to the larger tokenizer model size).

pytorch-bot · 2025-02-05T23:28:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2350

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 6407277 with merge base d5d85c5 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1

just checked _hf_tokenizer. havents looked at the tests yet

felipemello1 · 2025-02-06T01:22:40Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        path (str): Path to tokenizer.json file
+        config_path (Optional[str]): Path to tokenizer_config.json file. Default: None
+        generation_config_path (Optional[str]): Path to generation_config.json file.
+            Default: None


we have 3 paths. For the last 2 paths, they are more specific, which is good. I think we could rename path -> tokenizer_json_path, to make it explicit

felipemello1 · 2025-02-06T01:31:37Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        if self.generation_config:
+            self.bos_id = self.generation_config.get("bos_token_id")
+            self.eos_id = self.generation_config.get("eos_token_id")
+
+        if self.config:
+            bos_token = self._get_token_from_config(self.config, "bos_token")
+            eos_token = self._get_token_from_config(self.config, "eos_token")
+            if bos_token is not None and self.bos_id is None:
+                self.bos_id = self.hf_tokenizer.token_to_id(bos_token)
+            if eos_token is not None and self.eos_id is None:
+                self.eos_id = self.eos_id or self.hf_tokenizer.token_to_id(eos_token)


I dont know much about the files, should we add a sanity check to see if config.eos == generation_config.bos_token_id?

Sorry maybe I'm confused.. why would we expect eos to equal bos? But if you mean check on eos == eos and bos == bos across config and generation config, I think we can add that check

Update: I opted not to add the check, mainly because it's just some extra logic to validate something that shouldn't happen (the user has config and generation config that don't line up). In that case, we should just be explicit in our docs that we prefer config.json whenever it's passed. Lmk if any concerns with that though

felipemello1 · 2025-02-06T01:38:35Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+                self.generation_config = json.load(f)
+        else:
+            self.generation_config = None
+        self._infer_bos_eos_tokens()


lets raise an error if self.generation_config is None and self.config is None

Also, apparently we always use one or the other. We could simplify the rest of the code by just making something like:

self.config = generation_config or config

edit: actually, i dont think we can do it because they have different formats

Yeah I think it's simplest if we just use config. At the same time, I read (on some random forum) that generation config should take precedence over config? So currently I'm doing that. One thing that's nice about generation config is that it directly gives the int value for eos and bos, whereas the regular config tends to give the string value, which then needs to be converted. Aside from that, config has chat template, special tokens, and all that other stuff we would need. So yeah right now I am hedging a bit and keeping both around because I suspect we may need both at different times

Cite your sources.

Source

Like I said, maybe not the actual source-of-truth here. Especially as I look through the PreTrainedTokenizerBase class -- they seem to really only rely on tokenizer_config.json (ref). For what it's worth, litgpt also gives tokenizer_config.json precedence, and it seems to work OK for them (also shout-out to @Andrei-Aksionov who helped give me some pointers here).

So yeah perhaps I will switch the order to read from tokenizer config, then generation config. I do trust the code itself + litgpt's implementation more than that one specific issue comment.

Oh, thanks for the shout-out 😊.

It was long time since I worked with tokenizers, but yeah, we decided to rely on tokenizer.json and if it's in the repo - use HF Tokenizers. Before tokenizer.model and thus SentencePiece had a precedence.

If to take a closer look at the init process of AutoTokenizer, one can see that if there is only a tokenizer.model file (a vocab for SP), then this vocab is loaded and then the SP tokenizer is converted to a HF Tokenizers format. The latter allows to extend the vocab with special tokens (often comes with instruct versions of models). With SP to perform the same you will need a much more fiddling (i.e. operate with protobuf format).

So, basically, HF never works with SP tokenizer, even if only it's vocab is provided.
(I asked in the repo of OpenCoder what to do if there is only tokenizer.model, additional tokens in tokenizer_config.json and I don't want to use AutoTokenizer, but never got an answer. Do they all rely on AutoTokenizer?)

In other words, it's good that you are working on supporting tokenizer.json.

For what it's worth, litgpt also gives tokenizer_config.json precedence

FYI, after looking at the codebase, previously generation_config.json had a precedence, but it was changed in this commit.

RdoubleA · 2025-02-07T02:42:19Z

pyproject.toml

@@ -23,6 +23,7 @@ dependencies = [
    "sentencepiece",
    "tiktoken",
    "blobfile>=2",
+    "tokenizers",


I had thought we would make this a dev dependency and conditionally import this in hf_tokenizer

Yeah I was on the fence about this. But actually I think the upstream dependencies of tokenizers are already a subset of our upstream dependencies. So in that case it's not like we are pulling in a bunch of new transitive deps, this is really just tokenizers and nothing else. Because of that I feel OK about having it as a core dep, but happy to take the approach you're suggesting if you (or others) strongly disagree

Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)

Isn't torch close to 1 GB? Anyways if everyone would prefer to make this an optional dep I am fine with it

Yeah, 3 MB is nothing. I'm fine with it as a core dep.

joecummings

neat!

joecummings · 2025-02-11T02:01:28Z

pyproject.toml

@@ -23,6 +23,7 @@ dependencies = [
    "sentencepiece",
    "tiktoken",
    "blobfile>=2",
+    "tokenizers",


Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)

joecummings · 2025-02-11T02:02:56Z

tests/torchtune/modules/transforms/tokenizers/test_hf_tokenizer.py

+    @pytest.fixture
+    def texts(self):
+        return [
+            "I can see the sun. But even if I cannot see the sun, I know that it exists.",


"I am not a cultivated man, brother, but I've thought a lot about this."

joecummings · 2025-02-11T02:04:42Z

torchtune/modules/transforms/tokenizers/__init__.py

@@ -4,6 +4,7 @@
 # This source code is licensed under the BSD-style license found in the
 # LICENSE file in the root directory of this source tree.

+from ._hf_tokenizer import HFTokenizer


nit: it's weird that we spell out SentencePiece and TikToken, but not HuggingFace...

OK yeah we can spell it out

joecummings · 2025-02-11T02:05:25Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+from typing import Any, Dict, List, Optional
+
+from tokenizers import Tokenizer
+from torchtune.modules.transforms.tokenizers._utils import BaseTokenizer


If we add one more submodule, Satan gives us a cookie and a "good job" sticker

joecummings · 2025-02-11T02:05:47Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+
+class HFTokenizer(BaseTokenizer):
+    """
+    A wrapper around Hugging Face tokenizers. This class can be used to load from a


nit: point to the repo.

joecummings · 2025-02-11T02:06:55Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+
+    Args:
+        tokenizer_json_path (str): Path to tokenizer.json file
+        config_path (Optional[str]): Path to tokenizer_config.json file. Default: None


Should be called tokenizer_config_json_path. Yes, I know that's a mouthful, but it pattern matches with the name of the file and there's already other "configs" in a Hugging Face model dir.

joecummings · 2025-02-11T02:07:18Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+    def __init__(
+        self,
+        tokenizer_json_path: str,
+        config_path: Optional[str] = None,


kwargs only

joecummings · 2025-02-11T02:08:52Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+                self.generation_config = json.load(f)
+        else:
+            self.generation_config = None
+        self._infer_bos_eos_tokens()


Should we give them an option to explicitly set the BOS and EOS if it can't be inferred from the file? Or are they just screwed?

I am hedging a bit here for future work -- eventually I would like to have a ModelTokenizer wrapping this. In that case I think we should require at least tokenizer_config.json and explicitly giving EOS and BOS will probably not be enough

joecummings · 2025-02-11T02:09:45Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+        config_path: Optional[str] = None,
+        generation_config_path: Optional[str] = None,
+    ):
+        self.hf_tokenizer = Tokenizer.from_file(tokenizer_json_path)


Is there a reason this has to be self.hf_tokenizer instead of just tokenizer? You already know it's a HFTokenizer class.

Yeah it can just be self.tokenizer

codecov-commenter · 2025-02-11T03:01:32Z

Codecov Report

Attention: Patch coverage is 30.43478% with 64 lines in your changes missing coverage. Please review.

Project coverage is 23.35%. Comparing base (9da35c7) to head (7e70de7).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
...une/modules/transforms/tokenizers/_hf_tokenizer.py	18.86%	43 Missing ⚠️
...modules/transforms/tokenizers/test_hf_tokenizer.py	44.73%	21 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #2350       +/-   ##
===========================================
- Coverage   65.61%   23.35%   -42.26%     
===========================================
  Files         358      365        +7     
  Lines       21367    22023      +656     
===========================================
- Hits        14020     5144     -8876     
- Misses       7347    16879     +9532

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joecummings

bueno

joecummings · 2025-02-11T15:52:47Z

torchtune/modules/transforms/tokenizers/_hf_tokenizer.py

+    def __init__(
+        self,
+        *,
+        tokenizer_json_path: str,


Could've sworn I made a comment here, but b/c tokenizer_json_path is required, it can be a positional argument. My thinking is that kwargs would apply to tokenizer_config_json_path and generation_config_path

Ah duh, good point. I should've thought about it more carefully anyways

ebsmothers added 2 commits February 3, 2025 17:40

wip

60bbd6f

HF tokenizers: initial base tokenizer support

0b11333

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 5, 2025

ebsmothers requested review from joecummings, felipemello1 and acisseJZhong February 5, 2025 23:29

felipemello1 reviewed Feb 6, 2025

View reviewed changes

ebsmothers mentioned this pull request Feb 6, 2025

Add Phi4 #2197

Merged

13 tasks

RdoubleA reviewed Feb 7, 2025

View reviewed changes

ebsmothers added 3 commits February 10, 2025 13:42

Merge branch 'main' into hf-tokenizer

cd1dcc1

prefer config over generation_config, address comments

5f5fa8b

docstring updates

0bb68a3

joecummings reviewed Feb 11, 2025

View reviewed changes

more comments

0a00c80

add link

7e70de7

joecummings approved these changes Feb 11, 2025

View reviewed changes

joecummings reviewed Feb 11, 2025

View reviewed changes

ebsmothers added 4 commits February 12, 2025 10:57

merge

538cf32

more robust BOS/EOS handling, update docs and test

9120af5

docs formatting

b1aba76

Add to api_ref_modules.rst

6407277

ebsmothers merged commit e1386d1 into pytorch:main Feb 13, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HF tokenizers: initial base tokenizer support #2350

HF tokenizers: initial base tokenizer support #2350

ebsmothers commented Feb 5, 2025 •

edited

Loading

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

felipemello1 left a comment •

edited

Loading

felipemello1 Feb 6, 2025

felipemello1 Feb 6, 2025

ebsmothers Feb 6, 2025

ebsmothers Feb 11, 2025

felipemello1 Feb 6, 2025 •

edited

Loading

ebsmothers Feb 6, 2025

joecummings Feb 6, 2025

ebsmothers Feb 6, 2025

Andrei-Aksionov Feb 6, 2025

RdoubleA Feb 7, 2025

ebsmothers Feb 7, 2025

joecummings Feb 11, 2025

ebsmothers Feb 11, 2025

joecummings Feb 11, 2025

joecummings left a comment

joecummings Feb 11, 2025

joecummings Feb 11, 2025

joecummings Feb 11, 2025

ebsmothers Feb 11, 2025

joecummings Feb 11, 2025

joecummings Feb 11, 2025

joecummings Feb 11, 2025

joecummings Feb 11, 2025

joecummings Feb 11, 2025

ebsmothers Feb 11, 2025

joecummings Feb 11, 2025

ebsmothers Feb 11, 2025

codecov-commenter commented Feb 11, 2025 •

edited

Loading

joecummings left a comment

joecummings Feb 11, 2025

ebsmothers Feb 11, 2025

HF tokenizers: initial base tokenizer support #2350

HF tokenizers: initial base tokenizer support #2350

Conversation

ebsmothers commented Feb 5, 2025 • edited Loading

Test plan

pytorch-bot bot commented Feb 5, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2350

⏳ No Failures, 1 Pending

felipemello1 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Feb 11, 2025 • edited Loading

Codecov Report

joecummings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers commented Feb 5, 2025 •

edited

Loading

pytorch-bot bot commented Feb 5, 2025 •

edited

Loading

felipemello1 left a comment •

edited

Loading

felipemello1 Feb 6, 2025 •

edited

Loading

codecov-commenter commented Feb 11, 2025 •

edited

Loading