forked from openai/tiktoken
-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose getEncodingNameForModel in wasm version? #123
Comments
noseworthy
added a commit
to noseworthy/tiktoken
that referenced
this issue
Feb 16, 2025
The `tiktoken-js` library includes a very helpful function, `getEncodingNameForModel()`. This function is buried in the implementation of `encoding_for_model()` in the rust based `tiktoken` package. This function is very useful when implementing an encoding cache based on the model used. In this case, having a mapping from model -> encoding and then caching based on the encoding name conserves resources since so many models re-use the same encoding. I've refactored the typescript definition generation a little bit so that all the types are declared and their references when used all appear in the same block and there are no use-before-declaration warnings. Finally, I've exposed a new `get_encoding_name_for_model()` function that behaves similarly to the one in the `tiktoken-js` package, and used it inside of `encoding_for_model()`. I also added a test to ensure that this function can be called properly from typescript code, and that it properly throws exceptions in the case of invalid model names. Fixes: dqbd#123
noseworthy
added a commit
to noseworthy/tiktoken
that referenced
this issue
Feb 16, 2025
The `tiktoken-js` library includes a very helpful function, `getEncodingNameForModel()`. This function is buried in the implementation of `encoding_for_model()` in the rust based `tiktoken` package. This function is very useful when implementing an encoding cache based on the model used. In this case, having a mapping from model -> encoding and then caching based on the encoding name conserves resources since so many models re-use the same encoding. I've refactored the typescript definition generation a little bit so that all the types are declared and their references when used all appear in the same block and there are no use-before-declaration warnings. Finally, I've exposed a new `get_encoding_name_for_model()` function that behaves similarly to the one in the `tiktoken-js` package, and used it inside of `encoding_for_model()`. I also added a test to ensure that this function can be called properly from typescript code, and that it properly throws exceptions in the case of invalid model names. Fixes: dqbd#123
noseworthy
added a commit
to noseworthy/tiktoken
that referenced
this issue
Feb 17, 2025
The `tiktoken-js` library includes a very helpful function, `getEncodingNameForModel()`. This function is buried in the implementation of `encoding_for_model()` in the rust based `tiktoken` package. This function is very useful when implementing an encoding cache based on the model used. In this case, having a mapping from model -> encoding and then caching based on the encoding name conserves resources since so many models re-use the same encoding. I've exposed a new `get_encoding_name_for_model()` function that behaves similarly to the one in the `tiktoken-js` package, and used it inside of `encoding_for_model()`. Finally, I've also added a test to ensure that this function can be called properly from typescript code, and that it properly throws exceptions in the case of invalid model names. Fixes: dqbd#123
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Would it be possible to expose the function
getEncodingNameForModel
that's available in thejavascript
version also in the wasm version of the library?The wasm version currently exposes
encoding_for_model
andget_encoding
and both create a tokenizer instance immediately. For our uses, we'd like to first translate the model name to the underlying encoding and then instantiate the tokenizer withget_encoding
ourselves. This would allow us to cache and reuse a single tokenizer across multiple models that use the same encoding.The text was updated successfully, but these errors were encountered: