WIP: DO NOT MERGE: ARMv8 NaN Correctness Proposal #1273

Syonyk · 2025-01-30T22:45:48Z

NaN handling in SIMDe is very far from hardware, in terms of properly handling and propagating input NaNs to operations. However, this is not a primary concern for many users of the library. Obviously, at least one user particularly cares...

My proposed solution, in this PR (general working sample code to discuss, but certainly not a final proposal!), is to add optional NaN checking after results have been generated. This permits the use of whatever vector acceleration may be applicable, instead of requiring the slow fallback path to be used.

The concept is simple enough: After results are generated in the non-native path, check for input operand NaN conditions. This is done with x86 SSE intrinsics, as I'm working on x86, and these should work properly on other platforms as well through emulation. If any NaN values are found, then the "ARM NaN Propagation Algorithm" is applied to the elements to create the correct output NaN in slots as required.

Note that this does not solve the problem of hardware generating an inconsistent NaN on faulty operation - it only corrects propagating the NaNs through functions. I don't have tests written yet in the SIMDe test library, but in my own "throw large amounts of random data at the intrinsics" tester, it matches hardware exactly for silencing and propagating NaNs.

Feedback desired. Several points I am unclear on:

Should this be gated on SIMDE_FAST_NANs or SIMDE_NO_FAST_NANs? The default does not seem to be either one set. I am fine with this behavior off by default.
Is there a better way to handle the 64-bit vector into a 128-bit vector case? I would prefer to avoid MMX registers for performance reasons, as they share with the x86 registers on many processor.

NaN handling in SIMDe is very far from hardware, in terms of properly handling and propagating input NaNs to operations. However, this is not a primary concern for many users of the library. Obviously, at least one user particularly cares... My proposed solution, in this PR (general working sample code to discuss, but certainly not a final proposal!), is to add optional NaN checking after results have been generated. This permits the use of whatever vector acceleration may be applicable, instead of requiring the slow fallback path to be used. The concept is simple enough: After results are generated in the non-native path, check for input operand NaN conditions. This is done with x86 SSE intrinsics, as I'm working on x86, and these should work properly on other platforms as well through emulation. If any NaN values are found, then the "ARM NaN Propagation Algorithm" is applied to the elements to create the correct output NaN in slots as required. Note that this does not solve the problem of hardware generating an inconsistent NaN on faulty operation - it only corrects propagating the NaNs through functions. I don't have tests written yet in the SIMDe test library, but in my own "throw large amounts of random data at the intrinsics" tester, it matches hardware exactly for silencing and propagating NaNs. Feedback desired. Several points I am unclear on: - Should this be gated on SIMDE_FAST_NANs or SIMDE_NO_FAST_NANs? The default does not seem to be either one set. I am fine with this behavior off by default. - Is there a better way to handle the 64-bit vector into a 128-bit vector case? I would prefer to avoid MMX registers for performance reasons, as they share with the x86 registers on many processor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: DO NOT MERGE: ARMv8 NaN Correctness Proposal #1273

WIP: DO NOT MERGE: ARMv8 NaN Correctness Proposal #1273

Syonyk commented Jan 30, 2025

WIP: DO NOT MERGE: ARMv8 NaN Correctness Proposal #1273

Are you sure you want to change the base?

WIP: DO NOT MERGE: ARMv8 NaN Correctness Proposal #1273

Conversation

Syonyk commented Jan 30, 2025