feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

llllvvuu · 2024-08-26T09:03:17Z

Draft only because this is stacked on top of #948. Speculative decoding commit is 7d0e1cc

Implementation based on #149. This basic version only supports bs=1, temp=0, max_kv_size=None. Supporting samplers, rotating cache, and batching are deferred to future commits in order to keep this diff small.

Note due to #948 (comment), the best speedup happens when the main model is unquantized.

Test (>2x speedup on my machine):

python -m llms.mlx_lm.generate --temp 0 --max-kv-size 0 \
  --model mlx-community/SmolLM-1.7B-Instruct-fp16 \
  --draft-model mlx-community/SmolLM-135M-Instruct-4bit \
  --prompt "Develop a Python program that reads all the text files under a directory and returns top-5 words with the most number of occurrences." \
  --max-tokens 500

The `prompt` argument can now be either a `str` or `list[str]`. The change to `generate()` is backwards-compatible. The changes to `generate_step()`, `top_p_sampling()`, and `min_p_sampling()` are backwards-incompatible in order to unify shapes; this could be changed by adding a few if-statements, if preferred.

…server This basic version only supports bs=1, temp=0, max_kv_size=None. Supporting samplers, rotating cache, and batching are deferred to future commits in order to keep this diff small.

llllvvuu added 2 commits August 21, 2024 14:44

feat: show batch generation progress

5105b31

llllvvuu force-pushed the feat/speculative_decoding branch from 7fbd21a to 30b6956 Compare August 26, 2024 09:08

llllvvuu changed the title ~~feat: basic speculative decoding support in mlx_lm.generate / mlx_lm.server~~ feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server Aug 26, 2024

llllvvuu force-pushed the feat/speculative_decoding branch 4 times, most recently from fc4e076 to 4f91b16 Compare August 27, 2024 06:37

feat: basic speculative decoding support in mlx_lm.generate / mlx_lm.…

7d0e1cc

…server This basic version only supports bs=1, temp=0, max_kv_size=None. Supporting samplers, rotating cache, and batching are deferred to future commits in order to keep this diff small.

llllvvuu force-pushed the feat/speculative_decoding branch from 4f91b16 to 7d0e1cc Compare August 27, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

llllvvuu commented Aug 26, 2024 •

edited

Loading

feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

Are you sure you want to change the base?

feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954

Conversation

llllvvuu commented Aug 26, 2024 • edited Loading

llllvvuu commented Aug 26, 2024 •

edited

Loading