Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware sqrt helper functions #37

Open
AngelTomkins opened this issue Mar 6, 2024 · 2 comments
Open

Hardware sqrt helper functions #37

AngelTomkins opened this issue Mar 6, 2024 · 2 comments

Comments

@AngelTomkins
Copy link

AngelTomkins commented Mar 6, 2024

In #32 it was shown that you can use the hardware sqrt operation to replace a software one in nds_renderer.c. The default libnds sqrt function has extra checks that are unnecessary and does not support async computation. Libnds' sqrt does a check to make sure the hardware divider is not busy before sending the value, it only takes ~30 main bus cycles according to my testing, this means that unless you have two hardware divides back to back, there is no reason to check this value twice. Blocksds uses a simpler approach by having async functions for sending values to the hardware math coprocessor, this means we can use the cpu while waiting for the hardware math to complete. With testing I have found that replacing the sqrt call referenced in #32 with an async send, then check if (s > 0) and then wait for the hardware to finish the operation. In testing this saves 10-20 microseconds per frame.

// Normalize the result
int s = (lights[i].nx * lights[i].nx + lights[i].ny * lights[i].ny + lights[i].nz * lights[i].nz) >> 8;

// Send squareroot value to hardware before comparing (s > 0), this saves 10-20 microseconds
// Devkitpro's libnds does not have helper functions for async hardware math. This should be 
// put into a helper function.
REG_SQRTCNT = SQRT_64;
REG_SQRT_PARAM = (s64)s << 16;
if (s > 0) {
    while (REG_SQRTCNT & SQRT_BUSY);
    s = REG_SQRT_RESULT;
    lights[i].nx = (lights[i].nx << 16) / s;
    lights[i].ny = (lights[i].ny << 16) / s;
    lights[i].nz = (lights[i].nz << 16) / s;
}

If we do not switch to Blocksds, I propose we at least have these functions in a header file to have better operability with the hardware. The question I have is, where should this function go, so that it can be used by more than just nds_renderer.c if it comes to be useful later on? Should this be a function or a preprocessor define: #define sqrt_asynch(x) ...?

@Hydr8gon
Copy link
Owner

Hydr8gon commented Mar 6, 2024

I think it's fine to do it inline in this case, and maybe consider a function if it becomes necessary. Though to be honest, the performance difference is so small that I don't think it really matters.

@Kuratius
Copy link

Kuratius commented Mar 15, 2024

Related to this, I wrote a function that could use the hardware squareroot to compute a floating point squareroot.
It may be worth overriding the sqrtf used by <math.h>

Where would be a good place to put the implementation?

f32 fsqrt(f32 x){
    union{f32 f; u32 i;}xu;
    xu.f=x;
    //grab exponent
    s32 exponent= (xu.i & (0xff<<23));
    if(exponent==0)return 0.0;
    exponent=exponent-(127<<23);
    exponent=exponent>>1; //right shift on negative number depends on compiler
    u64 mantissa=xu.i & ((1<<23)-1);
    mantissa=(mantissa+(1<<23))<<23;
    if ((exponent & (1<<22))>0){
    mantissa=mantissa<<1;
    }
    u32 new_mantissa= (u32) sqrt(mantissa); //modify this line to use hardware sqrt
    xu.i= ((exponent+(127<<23))& (0xff<<23) ) | (new_mantissa & ((1<<23)-1));
    return xu.f;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants