Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify byte_pair_merge #255

Merged
merged 2 commits into from
Feb 11, 2024
Merged

Simplify byte_pair_merge #255

merged 2 commits into from
Feb 11, 2024

Conversation

hauntsaninja
Copy link
Collaborator

@hauntsaninja hauntsaninja commented Feb 9, 2024

Based on suggestion in #239 (specifically 8f5dd7d)

Like that commit, this:

  • Does the init in a single loop and saves a loop if there are no merges
  • Simplifies get_rank and no longer uses it in init (so you don't need multiple skip values)

Unlike that commit:

  • We drop optimisations enabled by ignoring single tokens. These didn't show any benefit on benchmarks for me (this makes sense given typical piece sizes, but let me know if that's unexpected!). Given this, I opted for the simpler version.
  • I preserve some of the comments from the original that I think are still useful

Let me know what you think! Once we figure this one out, we'll look at the linearithmic fix (I have some thoughts there, still doing some benchmarking).

Co-authored-by: @paplorinc

src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
parts.push((i, rank));
}
parts.push((piece.len() - 1, Rank::MAX));
parts.push((piece.len(), Rank::MAX));

let get_rank = {
#[inline(always)]
Copy link
Contributor

@l0rinc l0rinc Feb 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you see any effect of the inlining here?
I didn't, and even the linter complained, this being a closure inheriting some paramters

Copy link
Collaborator Author

@hauntsaninja hauntsaninja Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I dimly remember it being useful in #31 (but it was also used in an additional place then). I can double check :-) Which linter?

src/lib.rs Show resolved Hide resolved
Copy link
Contributor

@l0rinc l0rinc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We drop optimisations enabled by ignoring single tokens.

the parts.len() > 3 means that once we're down to 2 tokens, we don't need more iterations, we don't have to try to merge it into a single token since we've already filtered those out - but that won't show up in the benchmarks, since that's basically constant time, so I agree, the code is simpler this way :)

Thanks for adding the comments back, please see my inline comments.
If you can, please add me as a coauthor here.

Thanks!

@hauntsaninja hauntsaninja merged commit 1b9faf2 into main Feb 11, 2024
42 checks passed
@hauntsaninja hauntsaninja deleted the byte-pair-merge branch February 11, 2024 08:20
@hauntsaninja
Copy link
Collaborator Author

Thank you for the original change and follow-up review! I've marked you as co-author on the commit :-)

tmm1 added a commit to anysphere/tiktoken-rs that referenced this pull request Oct 17, 2024
backport of openai#255

Co-authored-by: Shantanu <[email protected]>
Co-authored-by: Lőrinc Pap <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants