Trying to speed up responses. #5784

nerdingaround · 2024-04-01T05:42:12Z

nerdingaround
Apr 1, 2024

I am still very new to this, and I am playing around with different models and etc, but I have noticed that my responses are slow and getting slower.

Output generated in 424.34 seconds (0.12 tokens/s, 49 tokens, context 2379, seed 1092770479)
Output generated in 479.82 seconds (0.24 tokens/s, 116 tokens, context 2520, seed 1564826966)

What can I do to speed this up, I am currently using TheBloke_LLaMA2-13B-Tiefighter-AWQ.

any help or insight is appreciated, if you need anything else in terms of gear, I am using a RTX 3070 AND 12TH Gen i7-12700K (20 Cores)

TiagoTiago · 2024-04-03T19:15:46Z

TiagoTiago
Apr 3, 2024

One of the things to keep in mind is, the bigger the context, the prompt essentially, the more it gotta process when generating new text. If you don't need it to remember stuff earlier in the conversation, there's a setting in the parameters tab to truncate the prompt that will crop off the top once reaches that size and keep it that way. Keep in mind, depending on the system you're using, there maybe some special stuff inserted at the beginning that might not get reinserted and that could lead to the model forgetting some important instructions or whatever (sometimes you will not notice it because the AI might recognize the pattern in what remains of the text and go along with; but even then it is possible it might start drifting away over time).

0 replies

TheLounger · 2024-04-03T22:37:09Z

TheLounger
Apr 3, 2024

Assuming you have 8GB of VRAM, the 7.25GB model you specified (+context) probably won't fit entirely.
Have you setup gpu-memory / cpu-memory (model tab)? If not, it might be doing a "sysmem fallback", which is the default and is terribly slow.

So, either use a smaller model, or set up the memory split manually.
If you use a smaller model that does fit in your VRAM, you should go with an ExLlama2 model. On the other hand, if you want to use a bigger model, you probably want to use GGUF (load with llama.cpp) and offload a bunch of model layers to the GPU.

1 reply

berkut1 Apr 4, 2024

Yeah, looks like overflow VRAM.
It's best to use GGUF with medium (13b) and large models, especially when they add this ggerganov/llama.cpp#6414

And sysmem fallback to Prefer No Sysmem Fallback is must have for properly splitting model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to speed up responses. #5784

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Trying to speed up responses. #5784

nerdingaround Apr 1, 2024

Replies: 2 comments · 1 reply

TiagoTiago Apr 3, 2024

TheLounger Apr 3, 2024

berkut1 Apr 4, 2024

nerdingaround
Apr 1, 2024

Replies: 2 comments 1 reply

TiagoTiago
Apr 3, 2024

TheLounger
Apr 3, 2024