Simplify byte_pair_merge #255

hauntsaninja · 2024-02-09T22:56:46Z

Based on suggestion in #239 (specifically 8f5dd7d)

Like that commit, this:

Does the init in a single loop and saves a loop if there are no merges
Simplifies get_rank and no longer uses it in init (so you don't need multiple skip values)

Unlike that commit:

We drop optimisations enabled by ignoring single tokens. These didn't show any benefit on benchmarks for me (this makes sense given typical piece sizes, but let me know if that's unexpected!). Given this, I opted for the simpler version.
I preserve some of the comments from the original that I think are still useful

Let me know what you think! Once we figure this one out, we'll look at the linearithmic fix (I have some thoughts there, still doing some benchmarking).

Co-authored-by: @paplorinc

src/lib.rs

l0rinc · 2024-02-10T11:21:42Z

src/lib.rs

+        parts.push((i, rank));
+    }
+    parts.push((piece.len() - 1, Rank::MAX));
+    parts.push((piece.len(), Rank::MAX));

    let get_rank = {
        #[inline(always)]


did you see any effect of the inlining here?
I didn't, and even the linter complained, this being a closure inheriting some paramters

Good question, I dimly remember it being useful in #31 (but it was also used in an additional place then). I can double check :-) Which linter?

src/lib.rs

l0rinc

We drop optimisations enabled by ignoring single tokens.

the parts.len() > 3 means that once we're down to 2 tokens, we don't need more iterations, we don't have to try to merge it into a single token since we've already filtered those out - but that won't show up in the benchmarks, since that's basically constant time, so I agree, the code is simpler this way :)

Thanks for adding the comments back, please see my inline comments.
If you can, please add me as a coauthor here.

Thanks!

Co-authored-by: Lőrinc Pap <[email protected]>

hauntsaninja · 2024-02-11T08:20:52Z

Thank you for the original change and follow-up review! I've marked you as co-author on the commit :-)

backport of openai#255 Co-authored-by: Shantanu <[email protected]> Co-authored-by: Lőrinc Pap <[email protected]>

Simplify byte_pair_merge

66a57ba

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

l0rinc reviewed Feb 10, 2024

View reviewed changes

src/lib.rs Show resolved Hide resolved

l0rinc approved these changes Feb 10, 2024

View reviewed changes

Apply suggestions from code review

2cc09e0

Co-authored-by: Lőrinc Pap <[email protected]>

l0rinc approved these changes Feb 11, 2024

View reviewed changes

hauntsaninja merged commit 1b9faf2 into main Feb 11, 2024
42 checks passed

hauntsaninja deleted the byte-pair-merge branch February 11, 2024 08:20

stephentoub mentioned this pull request Feb 20, 2024

Ensure tiktoken implementation up-to-date with OpenAI reference implementation dotnet/machinelearning#7019

Open

tmm1 mentioned this pull request Oct 17, 2024

Simplify byte_pair_merge anysphere/tiktoken-rs#17

Merged

tmm1 added a commit to anysphere/tiktoken-rs that referenced this pull request Oct 17, 2024

Simplify byte_pair_merge (#17)

0a951f9

backport of openai#255 Co-authored-by: Shantanu <[email protected]> Co-authored-by: Lőrinc Pap <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify byte_pair_merge #255

Simplify byte_pair_merge #255

hauntsaninja commented Feb 9, 2024 •

edited

Loading

l0rinc Feb 10, 2024 •

edited

Loading

hauntsaninja Feb 11, 2024 •

edited

Loading

l0rinc left a comment •

edited

Loading

hauntsaninja commented Feb 11, 2024

Simplify byte_pair_merge #255

Simplify byte_pair_merge #255

Conversation

hauntsaninja commented Feb 9, 2024 • edited Loading

l0rinc Feb 10, 2024 • edited Loading

Choose a reason for hiding this comment

hauntsaninja Feb 11, 2024 • edited Loading

Choose a reason for hiding this comment

l0rinc left a comment • edited Loading

Choose a reason for hiding this comment

hauntsaninja commented Feb 11, 2024

hauntsaninja commented Feb 9, 2024 •

edited

Loading

l0rinc Feb 10, 2024 •

edited

Loading

hauntsaninja Feb 11, 2024 •

edited

Loading

l0rinc left a comment •

edited

Loading