You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that leaving MMQ on drastically speeds up prompt processing on my single 3090 gpu. This confuses me since I'm using K quants and I have a gpu with tensor cores that supposedly should be faster than MMQ kernels. Am I doing something wrong, or is this the expected behavior?
Edit: Not sure if this matters, but I usually only offload 60% of the model layers to my VRAM and the rest to my DDR4 system RAM. This is so I can run larger models with higher BPWs.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I noticed that leaving MMQ on drastically speeds up prompt processing on my single 3090 gpu. This confuses me since I'm using K quants and I have a gpu with tensor cores that supposedly should be faster than MMQ kernels. Am I doing something wrong, or is this the expected behavior?
Edit: Not sure if this matters, but I usually only offload 60% of the model layers to my VRAM and the rest to my DDR4 system RAM. This is so I can run larger models with higher BPWs.
Beta Was this translation helpful? Give feedback.
All reactions