feat(mlx_lm): basic speculative decoding support in mlx_lm.generate / mlx_lm.server #954
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Draft only because this is stacked on top of #948. Speculative decoding commit is 7d0e1cc
Implementation based on #149. This basic version only supports bs=1, temp=0, max_kv_size=None. Supporting samplers, rotating cache, and batching are deferred to future commits in order to keep this diff small.
Note due to #948 (comment), the best speedup happens when the main model is unquantized.
Test (>2x speedup on my machine):
python -m llms.mlx_lm.generate --temp 0 --max-kv-size 0 \ --model mlx-community/SmolLM-1.7B-Instruct-fp16 \ --draft-model mlx-community/SmolLM-135M-Instruct-4bit \ --prompt "Develop a Python program that reads all the text files under a directory and returns top-5 words with the most number of occurrences." \ --max-tokens 500