-
I have a proxmox server and I loaded a Debian 11 LXC in it just for koboldcpp. I have given it 16GB RAM so there's plenty of memory for the model. The CPU of the host is an AMD Ryzen 5 PRO 4650G, I have no GPU. I have openblas installed on the container just like the instructions here explain. When I send a message the "Processing Prompt [BLAS]" part goes pretty quickly (considering that it's running on just a cpu) and see my CPU is being pretty much fully used during that moment. But during the "Generating (x / 250 tokens)" part the CPU usage drops to under 5% (usually around 2-3%) and it takes forever to progress to each token. I've tried changing the threads to various things like 5, 8, 10, and even 20 one time but it changes nothing. Am I doing something stupid here? edit: discovered something that might be a clue, it seems to only be a problem only for specifically gguf model files. I downloaded another model that was .bin and it was super fast. But any gguf I try has the above problem. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
What model did you use? Link the GGUF so I can take a look. You can also try running without openblas |
Beta Was this translation helpful? Give feedback.
-
I get 17.7gb ram usage with a 13b model and 17 layers on a 8gb cpu. This lets it sit at like 95%+ util on a 8370. That new fancy smarter context magic thing seems to have really sped things up considerably. |
Beta Was this translation helpful? Give feedback.
16GB RAM is not enough to load 20B models. That's why its so slow - you are probably hitting swap.
Please try a 7B model instead: maybe this one should be good. https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q4_K_S.gguf