An efficient implementation of BytePairTokenizer #36

gboduljak · 2024-01-17T02:01:39Z

As suggested by @angeloskath' s code review ml-explore/mlx-examples#315 (comment), an implementation of BytePairTokenizer seems useful for many use cases, but it is currently missing in mlx-data. I did some research on byte pair tokenization in transformers. I think that the implementation in transformers is somewhat slow. More precisely, the implementation iterates over all possible adjacent symbol pairs to determine the optimal symbol pair to merge, every time a merge could be done. This implies quadratic time complexity. However, in the referenced paper, there is an elegant linearithmic time implementation. Since the implementation requires some pointer trickery, it seems that we could (relatively) easily implement this in C++ and expose to Python.

I would appreciate your thoughts on:

Do we want an implementation of BytePairTokenizer in C++?
Do we want the faster implementation of BytePairTokenizer in C++, referenced in the paper?

Paper: https://arxiv.org/pdf/2306.16837.pdf

The text was updated successfully, but these errors were encountered:

angeloskath · 2024-01-17T08:08:23Z

Hi @gboduljak!

Yeah we would want a tokenizer in C++. I think for starters implementing it similar to python but in C++ would be sufficient. BPE quite a simple algorithm and if a Python implementation is usable I think a C++ one would be at least as much (probably much faster) with the benefit of allowing us to use threads.

Subsequently, we can optimize it if needed.

gboduljak mentioned this issue Jan 17, 2024

CLIP (ViT) ml-explore/mlx-examples#315

Merged

9 tasks

gboduljak mentioned this issue Jan 18, 2024

A draft implementation of BPE tokenizer #39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An efficient implementation of BytePairTokenizer #36

An efficient implementation of BytePairTokenizer #36

gboduljak commented Jan 17, 2024

angeloskath commented Jan 17, 2024

An efficient implementation of BytePairTokenizer #36

An efficient implementation of BytePairTokenizer #36

Comments

gboduljak commented Jan 17, 2024

angeloskath commented Jan 17, 2024