Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HF tokenizers: initial base tokenizer support #2350

Merged
merged 11 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/api_ref_modules.rst
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ model specific tokenizers.

transforms.tokenizers.SentencePieceBaseTokenizer
transforms.tokenizers.TikTokenBaseTokenizer
transforms.tokenizers.HuggingFaceBaseTokenizer
transforms.tokenizers.ModelTokenizer
transforms.tokenizers.BaseTokenizer

Expand Down
24 changes: 24 additions & 0 deletions docs/source/basics/tokenizers.rst
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,30 @@ to do the actual encoding and decoding.
print(sp_tokenizer.encode(text))
# [1, 6312, 28709, 1526, 2]

.. _hf_tokenizers:

Using Hugging Face tokenizers
-----------------------------

Sometimes tokenizers hosted on Hugging Face do not contain files compatible with one of torchtune's
existing tokenizer classes. In this case, we provide :class:`~torchtune.modules.transforms.tokenizers.HuggingFaceBaseTokenizer`
to parse the Hugging Face ``tokenizer.json`` file and define the correct ``encode`` and ``decode`` methods to
match torchtune's other :class:`~torchtune.modules.transforms.tokenizers.BaseTokenizer` classes. You should also pass the path to
either ``tokenizer_config.json`` or ``generation_config.json``, which will allow torchtune to infer BOS and EOS tokens.
Continuing with the Mistral example:

.. code-block:: python

hf_tokenizer = HuggingFaceBaseTokenizer(
tokenizer_json_path="/tmp/Mistral-7B-v0.1/tokenizer.json",
tokenizer_config_json_path="/tmp/Mistral-7B-v0.1/tokenizer_config.json",
)

text = "hello world"

print(hf_tokenizer.encode(text))
# [1, 6312, 28709, 1526, 2]

.. _model_tokenizers:

Model tokenizers
Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ dependencies = [
"sentencepiece",
"tiktoken",
"blobfile>=2",
"tokenizers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought we would make this a dev dependency and conditionally import this in hf_tokenizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was on the fence about this. But actually I think the upstream dependencies of tokenizers are already a subset of our upstream dependencies. So in that case it's not like we are pulling in a bunch of new transitive deps, this is really just tokenizers and nothing else. Because of that I feel OK about having it as a core dep, but happy to take the approach you're suggesting if you (or others) strongly disagree

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just documenting. Tokenizer package is roughly 3 MB (https://pypi.org/project/tokenizers/#files)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't torch close to 1 GB? Anyways if everyone would prefer to make this an optional dep I am fine with it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, 3 MB is nothing. I'm fine with it as a core dep.


# Miscellaneous
"numpy",
Expand Down
1 change: 1 addition & 0 deletions tests/assets/generation_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"bos_token_id": 0, "eos_token_id": -1}
Loading