Add ARM Neon and scalar implementations of SIMD functions #359

The previous code implicitly caused a load; change it so the load intrinsic is explicitly invoked, as the others are. (This in fact makes no difference to the generated code.)

Many Intel intrinsics have a corresponding Neon equivalent. Other cases are more interesting: * Neon's vmaxvq directly selects the maximum entry in a vector, so can be used to implement both the __max_16/__max_8 macros and the _mm_movemask_epi8 early loop exit. Introduce additional helper macros alongside __max_16/__max_8 so that the early loop exit can similarly be implemented differently on the two platforms. * Full-width shifts can be done via vextq. This is defined close to the ksw_u8()/ksw_i16() functions (rather than in neon_sse.h) as it implicitly uses one of their local variables. * ksw_i16() uses saturating *signed* 16-bit operations apart from _mm_subs_epu16; presumably the data is effectively still signed but we wish to keep it non-negative. The ARM intrinsics are more careful about type checking, so this requires an extra U16() helper macro.

Make the native SSE2 code conditional on __SSE2__, which is defined by GCC/Clang/etc on x86-64 by default and on i386 with -msse2 etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ARM Neon and scalar implementations of SIMD functions #359

Add ARM Neon and scalar implementations of SIMD functions #359

Commits on Jun 17, 2022

Commits on Jun 20, 2022

Commits on Jun 26, 2022

Commits on Jun 27, 2022