-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow Speed Issue #1029
Comments
There's likely two factors here:
|
Thanks for the response! |
FYI we sped up the fused attention in MLX 0.19.0. It should be noticeably faster though still a bit slower than llama.cpp at the very long sequence lengths.. there's still some optimizations to do there. |
That's awesome. Interestingly, running the same test with 4bit produced a bug where it would just generate the full 1000 max-tokens and just repeat the last two paragraphs over and over. mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done Here is my Full prompt (from Wikipedia article) which is the same as my original test. Just to see what happens, I increased --max-kv-size to 34k and --max-tokens to 2000, and it generated 2k tokens with the loop. Running the exact same command replacing 4bit with 8bit generated the correct text and stopped at the end without the loop. In previous test, it did not do that. Should I created a separate issue? Thanks! |
If the behavior changed from 0.18.1 to 0.19 yes it would be good to file another issue for that. Maybe small changes due to numerics make sense but if it went from working to not working that doesn't sound good. |
@chigkim are you still having this issue? |
I ran the tests below on a Macbook Pro with m3-max 64GB. MLX seems to run much slower than llama.cpp with flash attention enabled.
Is this speed just a result of flash attention not available in MLX? If so, it would be amazing to have flash attention!
I'm including the full commands and relevant logs below. Also here is my Full prompt (from Wikipedia article).
Lcpp-fa = llama.cpp with flash attention.
Thanks!
The text was updated successfully, but these errors were encountered: