Why fp16 MPS performance is worse than CPU? #15

FdyCN · 2023-04-23T03:33:16Z

In your conclusion. MPS performace is worse than llama.cpp cpu performance in the same fp16. Why? Is there any kernel which MPS doesn't support will fallback to CPU( so that hurt performace)？

you said this:

i figure that you mean MPS shader is compiling Just-in-time so the performace is worse than A-head-of-time compiled CPU codes？Am i wrong?

jankais3r · 2023-05-10T19:48:05Z

Hi, I wish I could give you a definitive answer, but unfortunately I am not familiar enough with PyTorch' MPS implementation to be able to confirm or deny your theory...

philipturner · 2023-05-27T12:30:16Z

It's bandwidth. The model is bottlenecked by how quickly the processor can fetch weights from RAM. FP16 consumes 4x as many bits as Int4, and thus is 4x slower.

jankais3r closed this as completed May 10, 2023

jankais3r reopened this May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why fp16 MPS performance is worse than CPU? #15

Why fp16 MPS performance is worse than CPU? #15

FdyCN commented Apr 23, 2023

jankais3r commented May 10, 2023

philipturner commented May 27, 2023

Why fp16 MPS performance is worse than CPU? #15

Why fp16 MPS performance is worse than CPU? #15

Comments

FdyCN commented Apr 23, 2023

jankais3r commented May 10, 2023

philipturner commented May 27, 2023