Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose getEncodingNameForModel in wasm version? #123

Open
jens-f opened this issue Nov 14, 2024 · 0 comments · May be fixed by #136
Open

Expose getEncodingNameForModel in wasm version? #123

jens-f opened this issue Nov 14, 2024 · 0 comments · May be fixed by #136

Comments

@jens-f
Copy link

jens-f commented Nov 14, 2024

Would it be possible to expose the function getEncodingNameForModel that's available in the javascript version also in the wasm version of the library?

The wasm version currently exposes encoding_for_model and get_encoding and both create a tokenizer instance immediately. For our uses, we'd like to first translate the model name to the underlying encoding and then instantiate the tokenizer with get_encoding ourselves. This would allow us to cache and reuse a single tokenizer across multiple models that use the same encoding.

noseworthy added a commit to noseworthy/tiktoken that referenced this issue Feb 16, 2025
The `tiktoken-js` library includes a very helpful function,
`getEncodingNameForModel()`. This function is buried in the
implementation of `encoding_for_model()` in the rust based
`tiktoken` package.

This function is very useful when implementing an encoding cache based
on the model used. In this case, having a mapping from model ->
encoding and then caching based on the encoding name conserves
resources since so many models re-use the same encoding.

I've refactored the typescript definition generation a little bit so
that all the types are declared and their references when used all
appear in the same block and there are no use-before-declaration
warnings.

Finally, I've exposed a new `get_encoding_name_for_model()` function
that behaves similarly to the one in the `tiktoken-js` package, and used
it inside of `encoding_for_model()`. I also added a test to ensure that
this function can be called properly from typescript code, and that it
properly throws exceptions in the case of invalid model names.

Fixes: dqbd#123
@noseworthy noseworthy linked a pull request Feb 16, 2025 that will close this issue
noseworthy added a commit to noseworthy/tiktoken that referenced this issue Feb 16, 2025
The `tiktoken-js` library includes a very helpful function,
`getEncodingNameForModel()`. This function is buried in the
implementation of `encoding_for_model()` in the rust based
`tiktoken` package.

This function is very useful when implementing an encoding cache based
on the model used. In this case, having a mapping from model ->
encoding and then caching based on the encoding name conserves
resources since so many models re-use the same encoding.

I've refactored the typescript definition generation a little bit so
that all the types are declared and their references when used all
appear in the same block and there are no use-before-declaration
warnings.

Finally, I've exposed a new `get_encoding_name_for_model()` function
that behaves similarly to the one in the `tiktoken-js` package, and used
it inside of `encoding_for_model()`. I also added a test to ensure that
this function can be called properly from typescript code, and that it
properly throws exceptions in the case of invalid model names.

Fixes: dqbd#123
noseworthy added a commit to noseworthy/tiktoken that referenced this issue Feb 17, 2025
The `tiktoken-js` library includes a very helpful function,
`getEncodingNameForModel()`. This function is buried in the
implementation of `encoding_for_model()` in the rust based
`tiktoken` package.

This function is very useful when implementing an encoding cache based
on the model used. In this case, having a mapping from model ->
encoding and then caching based on the encoding name conserves
resources since so many models re-use the same encoding.

I've exposed a new `get_encoding_name_for_model()` function
that behaves similarly to the one in the `tiktoken-js` package, and used
it inside of `encoding_for_model()`.

Finally, I've also added a test to ensure that this function can be
called properly from typescript code, and that it properly throws
exceptions in the case of invalid model names.

Fixes: dqbd#123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant