Some of my benchmark for v1.56 #646

TheBill2001 · 2024-01-28T14:15:34Z

TheBill2001
Jan 28, 2024

Hello! I decided to run some test with the new release and test out Vulkan. These the run results I gathered from one of my Story mode save.

System info:

Release: v1.56, CUDA, prebuilt from Git
OS: EndeavourOS (Arch-based), Linux
GPU: GTX 1660 6GB VRAM, 545.29.06, CUDA 12.3
RAM: 16 GiB
CPU: AMD R5 3600X (6C, 12T)

Model info:

Model: Psyfighter2-GGUF
Quantize: Q4_K_M
Total layer: 41 (report from KoboldCpp)

General launch config:

High priority: yes
Context size: 8192
Threads: 12
Amount to generate: 80
Total number of tokens in prompt: 7386
The prompt was kept the same for each test.

Note: For these test, there was nothing using the GPU at all except for KoboldCpp. The cool thing about running linux is that you can just turn off the desktop environment and use the TTY. Even then, about 250 MiB is used constantly, which I assume it is the driver overhead.

Backend	Backend config	Layers offloaded	Processing	Generation	Total
CuBLAS		12	380.83s (51.6ms/T)	44.12s (551.5ms/T)	424.95s (5311.9ms/T = 0.19T/s)
CuBLAS	`lowvram`	20	385.21s (52.2ms/T)	35.33s (441.6ms/T)	420.55s (5256.8ms/T = 0.19T/s)
CuBLAS	`lowvram`	24	385.21s (52.2ms/T)	34.42s (430.2ms/T)	417.09s (5213.6ms/T = 0.19T/s)
CuBLAS	`lowvram`	26	381.61s (51.7ms/T)	34.72s (434.0ms/T)	416.33s (5204.1ms/T = 0.19T/s)
CuBLAS	`lowvram`	27	380.40s (51.5ms/T)	34.16s (427.0ms/T)	414.56s (5182.0ms/T = 0.19T/s)
Vulkan		6	202.77s (27.5ms/T)	53.30s (666.2ms/T)	256.06s (3200.8ms/T = 0.31T/s)
Vulkan		8	191.84s (26.0ms/T)	46.49s (581.1ms/T)	238.32s (2979.0ms/T = 0.34T/s)

So Vulkan is significantly faster at processing than CuBLAS, but slower at generating.

For CuBLAS without lowvram, the maximum number of layers that I can offload with 8K context size is 12. With lowvram enabled, I can offload up to 27 layers. That is a lot more than before even v1.52 without that partial per-layer KV offloading merge. Before v1.52, I can only offload 18 layers, similar to what is stated in the Wiki, 20 if I disable the desktop environment.
For Vulkan, I can offload up to 8 layers before OOM. The generation speed is slower than CuBLAS, but when using story mode with a lot of word info and memory changing, I will say that Vulkan is very nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some of my benchmark for v1.56 #646

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Some of my benchmark for v1.56 #646

TheBill2001 Jan 28, 2024

Replies: 0 comments

TheBill2001
Jan 28, 2024