Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-SIMD math instructions are missing. #344

Open
mcourteaux opened this issue Feb 21, 2025 · 2 comments
Open

Non-SIMD math instructions are missing. #344

mcourteaux opened this issue Feb 21, 2025 · 2 comments

Comments

@mcourteaux
Copy link
Contributor

I was looking at some performance metrics coming from vg-renderer, and it's spending a conciderable amount of time in vg::vec2Dir for example. Looking into the calculations of that, I saw that it calls to bx::rsqrt, which seems like 100% the correct thing to do:

// Direction from a to b
inline Vec2 vec2Dir(const Vec2& a, const Vec2& b)
{
	const float dx = b.x - a.x;
	const float dy = b.y - a.y;
	const float lenSqr = dx * dx + dy * dy;
	const float invLen = lenSqr < VG_EPSILON ? 0.0f : bx::rsqrt(lenSqr);
	return{ dx * invLen, dy * invLen };
}

However, bx::rsqrt() does not have an implementation mapping to _mm_rsqrt_ss for x64 (SSE).
There is a mechanism that splats the value over the whole vector and compiles in a _mm_rsqrt_ps and then extracts one elements, which I'd consider wasteful, and prevents the compiler, or micro-architectures from potentially vectorizing this.

So, I'm thinking: those functions are missing? But I see that they don't really have a place right now, as all of the files are named simd_xxx, and this is an example of a non-SIMD instruction.

@mcourteaux
Copy link
Contributor Author

For reference, I PR'd this in vg-renderer to work around this issue, and overall improve performance: jdryg/vg-renderer#43

@bkaradzic
Copy link
Owner

Code you're talking about is here:

inline BX_CONSTEXPR_FUNC float rsqrt(float _a)
{
#if BX_SIMD_SUPPORTED
if (isConstantEvaluated() )
{
return rsqrtRef(_a);
}
return rsqrtSimd(_a);
#else
return rsqrtRef(_a);
#endif // BX_SIMD_SUPPORTED
}

SIMD implementation here:

inline BX_CONST_FUNC float rsqrtSimd(float _a)
{
if (_a < kFloatSmallest)
{
return kFloatInfinity;
}
const simd128_t aa = simd_splat(_a);
#if BX_SIMD_NEON
const simd128_t rsqrta = simd_rsqrt_nr(aa);
#else
const simd128_t rsqrta = simd_rsqrt_ni(aa);
#endif // BX_SIMD_NEON
float result = 0.0f;
simd_stx(&result, rsqrta);
return result;
}

There is a mechanism that splats the value over the whole vector and compiles in a _mm_rsqrt_ps and then extracts one elements, which I'd consider wasteful, and prevents the compiler, or micro-architectures from potentially vectorizing this.

You need to load float to SIMD register somehow, splat is one way to load it. Extracting one component from SIMD register is because result expected is float.

Ideally for vg-renderer SIMD functions in your PR you should call bx SIMD stuff, instead SSE intrinsic directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants