[DIRTY] Using m1 intrinsics for f16xf16 #4

Narsil · 2023-08-01T08:51:27Z

This is very dirty PR more a POC than anything else at this point.

It seems to work and be correct. (It passes in every scenario I tried.)
It is faster than without.

half-rs is using a fork starkat99/half-rs#98 to get some currently non existing intrinsics for pure f16 computing.

Then hackilishly added them into gemm:

Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely f16 -> f16.

The code requires black_box atm for the compiler to be happy. This is most likely an error of mine in half-rs intrinsics implementation (I used arm! macro but do no understand how that affects the compiler).

I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.

Current results:

GGML WITHOUT ACCELERATE (f32xf16) -> f32 :  220ms (1 thread) - 197ms (8 threads)
GEMM (f16xf16x) -> f16:   340ms (thread) - 110ms (8 threads)
M, N, K :  4096 x 128 x 11108

For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath).

Using m1 intrinsics for f16xf16

c7a1ceb

Narsil mentioned this pull request Aug 1, 2023

M1 f16 intrinsics sarah-quinones/gemm#13

Open

Removing black box.

a8f0280

Narsil marked this pull request as draft August 1, 2023 09:35

Narsil closed this Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DIRTY] Using m1 intrinsics for f16xf16 #4

[DIRTY] Using m1 intrinsics for f16xf16 #4

Narsil commented Aug 1, 2023

[DIRTY] Using m1 intrinsics for f16xf16 #4

[DIRTY] Using m1 intrinsics for f16xf16 #4

Conversation

Narsil commented Aug 1, 2023