Expose getEncodingNameForModel in wasm version? #123

jens-f · 2024-11-14T00:12:22Z

Would it be possible to expose the function getEncodingNameForModel that's available in the javascript version also in the wasm version of the library?

The wasm version currently exposes encoding_for_model and get_encoding and both create a tokenizer instance immediately. For our uses, we'd like to first translate the model name to the underlying encoding and then instantiate the tokenizer with get_encoding ourselves. This would allow us to cache and reuse a single tokenizer across multiple models that use the same encoding.

The text was updated successfully, but these errors were encountered:

The `tiktoken-js` library includes a very helpful function, `getEncodingNameForModel()`. This function is buried in the implementation of `encoding_for_model()` in the rust based `tiktoken` package. This function is very useful when implementing an encoding cache based on the model used. In this case, having a mapping from model -> encoding and then caching based on the encoding name conserves resources since so many models re-use the same encoding. I've refactored the typescript definition generation a little bit so that all the types are declared and their references when used all appear in the same block and there are no use-before-declaration warnings. Finally, I've exposed a new `get_encoding_name_for_model()` function that behaves similarly to the one in the `tiktoken-js` package, and used it inside of `encoding_for_model()`. I also added a test to ensure that this function can be called properly from typescript code, and that it properly throws exceptions in the case of invalid model names. Fixes: dqbd#123

The `tiktoken-js` library includes a very helpful function, `getEncodingNameForModel()`. This function is buried in the implementation of `encoding_for_model()` in the rust based `tiktoken` package. This function is very useful when implementing an encoding cache based on the model used. In this case, having a mapping from model -> encoding and then caching based on the encoding name conserves resources since so many models re-use the same encoding. I've exposed a new `get_encoding_name_for_model()` function that behaves similarly to the one in the `tiktoken-js` package, and used it inside of `encoding_for_model()`. Finally, I've also added a test to ensure that this function can be called properly from typescript code, and that it properly throws exceptions in the case of invalid model names. Fixes: dqbd#123

noseworthy linked a pull request Feb 16, 2025 that will close this issue

Add get_encoding_name_for_model to tiktoken #136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose getEncodingNameForModel in wasm version? #123

Expose getEncodingNameForModel in wasm version? #123

jens-f commented Nov 14, 2024

Expose getEncodingNameForModel in wasm version? #123

Expose getEncodingNameForModel in wasm version? #123

Comments

jens-f commented Nov 14, 2024