Some of my benchmark for v1.56 #646
TheBill2001
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello! I decided to run some test with the new release and test out Vulkan. These the run results I gathered from one of my Story mode save.
System info:
Model info:
General launch config:
Note: For these test, there was nothing using the GPU at all except for KoboldCpp. The cool thing about running linux is that you can just turn off the desktop environment and use the TTY. Even then, about 250 MiB is used constantly, which I assume it is the driver overhead.
(51.6ms/T)
(551.5ms/T)
(5311.9ms/T = 0.19T/s)
lowvram
(52.2ms/T)
(441.6ms/T)
(5256.8ms/T = 0.19T/s)
lowvram
(52.2ms/T)
(430.2ms/T)
(5213.6ms/T = 0.19T/s)
lowvram
(51.7ms/T)
(434.0ms/T)
(5204.1ms/T = 0.19T/s)
lowvram
(51.5ms/T)
(427.0ms/T)
(5182.0ms/T = 0.19T/s)
(27.5ms/T)
(666.2ms/T)
(3200.8ms/T = 0.31T/s)
(26.0ms/T)
(581.1ms/T)
(2979.0ms/T = 0.34T/s)
So Vulkan is significantly faster at processing than CuBLAS, but slower at generating.
For CuBLAS without
lowvram
, the maximum number of layers that I can offload with 8K context size is 12. Withlowvram
enabled, I can offload up to 27 layers. That is a lot more than before even v1.52 without that partial per-layer KV offloading merge. Before v1.52, I can only offload 18 layers, similar to what is stated in the Wiki, 20 if I disable the desktop environment.For Vulkan, I can offload up to 8 layers before OOM. The generation speed is slower than CuBLAS, but when using story mode with a lot of word info and memory changing, I will say that Vulkan is very nice.
Beta Was this translation helpful? Give feedback.
All reactions