You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 19, 2023. It is now read-only.
In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)?
you said this:
i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codes?Am i wrong?
The text was updated successfully, but these errors were encountered:
Hi, I wish I could give you a definitive answer, but unfortunately I am not familiar enough with PyTorch' MPS implementation to be able to confirm or deny your theory...
It's bandwidth. The model is bottlenecked by how quickly the processor can fetch weights from RAM. FP16 consumes 4x as many bits as Int4, and thus is 4x slower.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)?
you said this:
i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codes?Am i wrong?
The text was updated successfully, but these errors were encountered: