[DIRTY] Using m1 intrinsics for f16xf16 #4
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is very dirty PR more a POC than anything else at this point.
half-rs
is using a fork starkat99/half-rs#98 to get some currently non existing intrinsics for pure f16 computing.Then hackilishly added them into gemm:
Copy-pasted the code for f16 gemm (which does f16 -> f32simd -> matmul -> f16) to do purely
f16 -> f16
.The code requires
black_box
atm for the compiler to be happy. This is most likely an error of mine inhalf-rs
intrinsics implementation (I usedarm!
macro but do no understand how that affects the compiler).I didn't re-optimize this afterwards to make sure cache lines were adapted or anything of the sort.
Current results:
For reference Accelerate seems to do ~25ms for the same op and threading seems to decrease performance on it , which I guess is because Accelerate already uses threading underneath).