-
Notifications
You must be signed in to change notification settings - Fork 47
Garbage output on 30B model? #4
Comments
Hi, can you post an MD5 hash of your re-sharded 30B model? I had one more report of somebody having the same problem as you, but for me it worked fine. I re-sharded mine on a Windows PC with 128GB RAM. As for 65B llama.cpp being worse than 13B here, that's because by default the llama.cpp readme instructs you to quantize the model from float16 to 4-bit precision, which decreases the amount of RAM needed, but also leads to worse generated text. You can imagine it as 65B 4-bit model having more knowledge than the 13B one, but being less able to communicate that knowledge due to the lowered prevision. You can try to skip the quantization step and run the fp16 13B model with llama.cpp. |
Quantization is one part of the difference but even the fp16 model is inferior to this project. I have reason to believe llama.cpp has bugs which cause quality degradation, my guess is they're aggravated by scale I did this conversion on linux x64 on a 200gb system
|
Well it's good to have solid confirmation the resharding isn't the issue! Here is my equivalent screenshot, 13B is plausible while 30B is not Just to throw some other things out there, this is an M1 Max 64gb, Ventura 13.2.1 (22D68), Python 3.9.13, PyTorch 1.13.1 I did note in your example the continuation does not follow grammatically from the prompt, but it is at least the same topic. Probably not related, but you never know. |
Strange, basically identical configuration to my setup... Could you try to pull the latest code from the repository and see if it gives you better results? |
Hi @jankais3r, how can I download the alpaca 30B and 65B? I found the 30B in here, but I need to figure out how to convert it. |
Hi @LeiHao0, the 30B model you linked unfortunately won't work with LLaMA_MPS, because it has been quantized to 4-bit precision from the original float16. If you find a float16 Alpaca 30B model, let me know and I will add support for it in the script. Actually I think it should be possible to run this model, but I am not sure how I feel about introducing q4 models to this repo... Maybe I'll add it with an identified |
Hi @jankais3r, I discovered a new language model called alpaca-30b on the Hugging Face website, which was trained in 8-bit mode. Although the model consumes 80GB of memory (24-30GB of which is allocated to swap), it appears to be working since it's generating a few words now. |
Using M2 Max 96gb. Also got gibberish (& extremely slow) for Llama 30B "agneanal поча althoughUsers substtemperatureniulice" |
I've been trying to get the 30B model up but my output is total garbage. Example:
Trust Delete викори геCES/$ свобо voicepull mediumвриLongrightarrow fed NormalJe zespo installer пробле конце attacks genu genericituteAX language hy Jurpring lange)))) Архивная
. This is in contrast to the 7B and 13B model which work well.From the readme, it seems like someone at least managed to measure memory usage on 30B, but is there an indication they were able to produce reasonable text?
Possibly related, I needed to do the resharding and arrow conversion on a second machine with more RAM, maybe there is some problem doing the conversion on a different machine than inference? Is there a 'known good' arrow version I could compare with?
Side note, the readme takes the view that the comparison with llama.cpp is about performance. But in my experience, the 13B here is much better quality than 65B llama.cpp. I have several theories about why this is, but they all suggest that getting the 30B working here would be stronger output than alternatives.
The text was updated successfully, but these errors were encountered: