Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace js-tiktoken BPE merge algorithm with faster heap based algorithm #101

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mikolalysenko
Copy link

@mikolalysenko mikolalysenko commented Apr 19, 2024

There are several open issues noting that in the worst case BPE merge algorithm in js-tiktoken takes quadratic time in the number of input characters for certain pathological inputs.

This PR fixes this problem using a heap to avoid recalculating the ranks of all tokens at each character. This technique should also work for the rust/wasm tokenizer but it seems less important in those cases since the native parsers are already pretty fast.

I also added a new test fixture and an example string which causes pathological behavior.

Related issues:

Note: This should be a mild CVE since an attacker may use this behavior to cause a denial of service against services that check user input with js-tiktoken.

@huytool157
Copy link

@dqbd are we considering merging this, we notice some performance issues if the input is large as well

@enricoros
Copy link

@mikolalysenko have you benchmarked this vs baseline vs wasm? curious for 1, 1000, 1M, 100M tokens.

@jonschlinkert
Copy link

have you benchmarked this vs baseline vs wasm? curious for 1, 1000, 1M, 100M tokens.

If there is a perf difference at different scales, we should consider toggling algorithms based on length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants