-
Notifications
You must be signed in to change notification settings - Fork 884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Llama-3.1-8B-Instruct-4bit keeps looping at the end. #1059
Comments
I ran the same command, below is the output I get which looks pretty reasonable. Can you share what you are seeing? Also could you share the machine, the OS, and the version of MLX as well?
|
Yea your output looks correct. MBp 16" with M3 Max 64GB Thanks! |
Also I noticed that your prompt processing speed is 643.576, and mine is 417.707. |
I was able to reproduce this with the same long prompt on M1 Max. I did a bit of bisecting and looks like after ml-explore/mlx#1509 I'm getting non deterministic outputs. |
Ok I’ll take a look at that. But it’s not in MLX 0.19 so it can’t really be the same issue as above.. |
@chigkim were you building the main branch of MLX from source or did you install MLX from PyPi? |
For MLX, I installed from Pip. For MLX-LM, I tried both from pip and git. |
I've tried to reproduce this on several machines (M1 Max, M2 Ultra, M1 Ultra, and M3 Max), so far not seeing any issues in the output. Some questions / suggestions:
|
I just tried mlx==0.19.1 as well as 0.20.0 along with mlx-lm==0.19.3. |
I'm really stumped by this one to be honest. I tried your exact command which works fine on several machines:
Given that it works for me, I don't think there is a problem with the command or the prompt (unless you've changed it). Without the ability to reproduce, it's very difficult to debug. Anything you can do to fuzz around and see if there are conditions that the looping is sensitive to would be great to help us get to the bottom of this. Some ideas:
Would be really great if you have time for any of those and can share the results. Thanks! |
I deleted my environment and created a fresh one with |
That is really curious.. There was a bug in one of our qmv kernels that was recently fixed. It might be possible (but I think pretty unlikely) that this would account for the looping behavior you are seeing. Do you only notice looping for the very long prompt or do you also see it for shorter prompts? Also you have a few arguments set. I'm wondering if you turn them off or change does the looping go away?
|
I've tried without specifying max-kv-size, temp, top_p, seed, but they all looped exactly the same way. |
Ok. Let's see if it's fixed after our next release (which includes a fix for quantization in some cases ml-explore/mlx#1577). If it's not fixed, I will try fuzzing around a bit to see if it can be reproduced on our side. |
Do you guys need to requantize and update the model on HF, or can I just pull the main branch, install with pip install -e ., and test? |
No you can use the same model (no requantization needed). You can test by pulling and building the main branch. It would be great to know if that works for you or not. |
Oh that was merged 4 days ago, and I already pulled the latest main branch when I tested. So the fix didn't work. :( |
Ok, so I might have a potentially good news that could lead to something... |
Interesting.. you can see all the commits between 18.1 and 19.0 here: ml-explore/mlx@v0.18.1...v0.19.0 The commit in there that seems most likely to have changed something for LLM inference is the fused attention. Can you try building and the commits before and after to see if that is the case? So concretely: For including the fused attention:
And the commit just before:
That would be my first guess as to a related cause but it would be good to check and see. |
Yep that was it! The commit 50d8bed loops, but 9dd72cd doesn't. |
Could you try running with Metal validation enabled to see if that gives us any clues? (Low probability but when it hits it hits well):
|
Also you can precompute the prompt cache to speed testing up:
Then use that to generate:
|
@chigkim Adding on @awni 's last message. Can you run the following commands and report back the outputs?
The first 2 should have exactly the same output, looping or not. The next 4 should all have different outputs. |
Hmm, the output doesn't seem to be any different. Am I supposed to look for something? % METAL_DEVICE_WRAPPER_TYPE=1 METAL_DEBUG_ERROR_MODE=0 mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --max-kv-size 33000 --max-tokens 1000 --temp 0.0 --top-p 0.9 --seed 1000 --prompt -<../text/portugal.txt;say done
2024-11-13 22:46:18.124 python[16522:3704956] Metal API Validation Enabled
Fetching 6 files: 100%|████████████████████████| 6/6 [00:00<00:00, 45507.82it/s]
==========
Prompt: <|begin_of_text|><|start_header_id|>user<|end_header_id|>
Summarize the following:
......
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Portugal is a country located on the Iberian Peninsula in southwestern Europe.
.....
==========
Prompt: 32134 tokens, 414.286 tokens-per-sec
Generation: 1000 tokens, 32.263 tokens-per-sec
Peak memory: 12.535 GB |
Nope it would have been obvious if it threw a validation error. Thanks for checking. |
Ooo, we're getting somewhere... |
I'm on mlx-lm v0.19.1.
Running the following command with 4bit produced a bug where it would just generate the full 1000 max-tokens and just repeat the last two paragraphs over and over.
Here is my Full prompt from Wikipedia article.
Just to see what happens, I increased --max-kv-size to 34k and --max-tokens to 2000, and it generated 2k tokens with the loop.
Running the exact same command replacing 4bit with 8bit generated the correct text with 811 tokens and stopped at the end without the loop.
The text was updated successfully, but these errors were encountered: