CPU usage drops only on the "generating" part. #521

silmarine · 2023-11-07T11:37:34Z

silmarine
Nov 7, 2023

I have a proxmox server and I loaded a Debian 11 LXC in it just for koboldcpp. I have given it 16GB RAM so there's plenty of memory for the model. The CPU of the host is an AMD Ryzen 5 PRO 4650G, I have no GPU. I have openblas installed on the container just like the instructions here explain. When I send a message the "Processing Prompt [BLAS]" part goes pretty quickly (considering that it's running on just a cpu) and see my CPU is being pretty much fully used during that moment. But during the "Generating (x / 250 tokens)" part the CPU usage drops to under 5% (usually around 2-3%) and it takes forever to progress to each token.

I've tried changing the threads to various things like 5, 8, 10, and even 20 one time but it changes nothing. Am I doing something stupid here?

edit: discovered something that might be a clue, it seems to only be a problem only for specifically gguf model files. I downloaded another model that was .bin and it was super fast. But any gguf I try has the above problem.

Answered by LostRuins

Nov 8, 2023

16GB RAM is not enough to load 20B models. That's why its so slow - you are probably hitting swap.

Please try a 7B model instead: maybe this one should be good. https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q4_K_S.gguf

View full answer

LostRuins · 2023-11-07T14:37:43Z

LostRuins
Nov 7, 2023
Maintainer

What model did you use? Link the GGUF so I can take a look.

You can also try running without openblas --noblas to see if it's better

4 replies

silmarine Nov 7, 2023
Author

I tried without openblas and that made the Processing Prompt part like the Generating part from before: very low CPU usage and unbearably slow progress to the point that I couldn't even wait for it to get to the generating part.

I've been trying lots of models and previously just deleted ones that were to slow but recently decided to actually look into it. So i can't tell you all of them without just downloading random models again. But what I just tried was the following:
TheBloke/U-Amethyst-20B-GGUF
mxlewd-l2-20b.Q4_K_M.gguf
The .bin model that processed quickly is Metharme-7b-4bit-Q5_1-GGML-V2.bin but I don't remember where I downloaded the file.

LostRuins Nov 8, 2023
Maintainer

16GB RAM is not enough to load 20B models. That's why its so slow - you are probably hitting swap.

Please try a 7B model instead: maybe this one should be good. https://huggingface.co/TheBloke/airoboros-mistral2.2-7B-GGUF/blob/main/airoboros-mistral2.2-7b.Q4_K_S.gguf

Answer selected by LostRuins

silmarine Nov 8, 2023
Author

That's odd. According to TheBloke's Repo for, for example, mxlewd-l2-20b.Q4_K_M the max RAM requirement is 14.54 GB. I gave it 16 for the context and all. That is RAM dedicated just to the container and there's less than 200MB being used for the container when koboldcpp isn't running. I did try the one you linked and it was much faster though. I have more RAM to give to the container but I don't know how much it would need if I cannot trust in the documentation for the max RAM required.

LostRuins Nov 8, 2023
Maintainer

There is always going to be overhead that needs to be accounted for besides the weights, such as KV caches and buffer space needed to store temporary data during computation. The best way to determine would be trial and error.

Also do note that the longer the context, the more memory it will use. Same if you use a larger quant (e.g Q4 is bigger than Q3) , or if you use larger batch sizes.

ghost · 2023-11-17T18:11:08Z

ghost
Nov 17, 2023

I get 17.7gb ram usage with a 13b model and 17 layers on a 8gb cpu. This lets it sit at like 95%+ util on a 8370. That new fancy smarter context magic thing seems to have really sped things up considerably.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU usage drops only on the "generating" part. #521

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

CPU usage drops only on the "generating" part. #521

silmarine Nov 7, 2023

Replies: 2 comments · 4 replies

LostRuins Nov 7, 2023 Maintainer

silmarine Nov 7, 2023 Author

LostRuins Nov 8, 2023 Maintainer

silmarine Nov 8, 2023 Author

LostRuins Nov 8, 2023 Maintainer

ghost Nov 17, 2023

silmarine
Nov 7, 2023

Replies: 2 comments 4 replies

LostRuins
Nov 7, 2023
Maintainer

silmarine Nov 7, 2023
Author

LostRuins Nov 8, 2023
Maintainer

silmarine Nov 8, 2023
Author

LostRuins Nov 8, 2023
Maintainer

ghost
Nov 17, 2023