-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add benchmark coverage for parabola based cosine approximation #2
Open
milianw
wants to merge
4
commits into
AZHenley:master
Choose a base branch
from
milianw:parabola-approx
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
milianw
force-pushed
the
parabola-approx
branch
2 times, most recently
from
March 31, 2022 16:30
da856bd
to
a379bde
Compare
This covers both the original version by Nick from [1] and the slightly modified and optimized versions that I came up with a couple years ago over and shared at [2]. [1]: https://web.archive.org/web/20171228230531/http://forum.devmaster.net/t/fast-and-accurate-sine-cosine/9648 [2]: https://stackoverflow.com/a/28050328/35250 Note though that the original version [1] is only defined for the ranges [-pi, pi] but the accuracy test harness here tests the range [0, 2pi], which shines a bad light on these versions. My version [2] doesn't suffer from this accuracy issue - you can throw arbitrary input values at it. The performance is pretty good too, the imprecise version is even the fastest cos implementation on my machine now. The lookup table implementations are directly behind it, but I have to note: In real-world testing, cache eviction effects through interactions with the rest of the application code will further decrease the performance of lookup tables. Finally, this code is easily autovectorized by compilers like icc and even gcc. On my machine, the results for all tests are as follows: Compiler: ``` g++ (GCC) 11.2.0 compiling code with `-flto -march=native -O3` ``` CPU: ``` 11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz ``` output: ``` ACCURACY cos_taylor_literal_4terms_naive 19.9880092736029695 cos_taylor_literal_6terms_naive 1.4652889617438571 cos_taylor_literal_6terms_2pi 1.4652889617438571 cos_taylor_literal_6terms_pi 0.0001004702941281 cos_taylor_literal_6terms 0.0001004702941279 cos_taylor_literal_10terms 0.0000000000756514 cos_taylor_running_6terms 0.0001004702941287 cos_taylor_running_8terms 0.0000001352604422 cos_taylor_running_10terms 0.0000000000756513 cos_taylor_running_16terms 0.0000000000000009 cos_table_1 0.4944578886434219 cos_table_0_1 0.0499943500331001 cos_table_0_01 0.0049999938268771 cos_table_0_001 0.0004999999109268 cos_table_0_0001 0.0000499999164148 cos_table_1_LERP 0.1147496616359112 cos_table_0_1_LERP 0.0012496954434600 cos_table_0_01_LERP 0.0000124999013960 cos_table_0_001_LERP 0.0000001249999969 cos_table_0_0001_LERP 0.0000000012500020 cos_math_h 0.0000000000000000 cos_parabola 15.9999999810748665 cos_parabola_extra 63.2499998575883708 cos_parabola_opt 0.0560095959541279 cos_parabola_extra_opt 0.0010902926026140 TIME cos_taylor_literal_4terms_naive 0.3642890000000000 cos_taylor_literal_6terms_naive 0.5741620000000000 cos_taylor_literal_6terms_2pi 0.7144020000000000 cos_taylor_literal_6terms_pi 0.7745180000000000 cos_taylor_literal_6terms 0.7218470000000000 cos_taylor_literal_10terms 1.1426369999999999 cos_taylor_running_6terms 0.6787260000000001 cos_taylor_running_8terms 0.9333120000000000 cos_taylor_running_10terms 1.1113160000000000 cos_taylor_running_16terms 1.7794570000000001 cos_table_1 0.2014240000000000 cos_table_0_1 0.2031010000000000 cos_table_0_01 0.2034710000000000 cos_table_0_001 0.2042740000000000 cos_table_0_0001 0.2036450000000000 cos_table_1_LERP 0.3107120000000000 cos_table_0_1_LERP 0.3346280000000000 cos_table_0_01_LERP 0.3342020000000000 cos_table_0_001_LERP 0.3342410000000000 cos_table_0_0001_LERP 0.3327000000000000 cos_math_h 0.7697060000000000 cos_parabola 0.1096080000000000 cos_parabola_extra 0.1190130000000000 cos_parabola_opt 0.1476240000000000 cos_parabola_extra_opt 0.2056920000000000 ```
This range is often much better to approximate for 0-symmetric functions like cos. I.e. compare: ``` [0, 2pi]: ACCURACY cos_taylor_literal_4terms_naive 19.9880092736029695 cos_taylor_literal_6terms_naive 1.4652889617438571 cos_taylor_literal_6terms_2pi 1.4652889617438571 ... cos_parabola 15.9999999810748665 cos_parabola_extra 63.2499998575883708 [-pi, pi]: ACCURACY cos_taylor_literal_4terms_naive 0.0239777873763927 cos_taylor_literal_6terms_naive 0.0001004702957825 cos_taylor_literal_6terms_2pi 0.0001004702957825 cos_parabola 1.9999999739033667 cos_parabola_extra 3.3499999445446544 ```
This is basically the opposite of the new -r arg - we now increase the value range to [-10pi, 10pi]. Anything out of 2pi will be abysmal for naive functions that don't account for this, see: ``` ./benchmarks -R Cosine benchmark ACCURACY cos_taylor_literal_4terms_naive 22237893.9080788344144821 cos_taylor_literal_6terms_naive 1693743289.4118604660034180 cos_taylor_literal_6terms_2pi 1.4652888121124259 cos_taylor_literal_6terms_pi 1.4652886805053995 cos_taylor_literal_6terms 1.4652886805053986 cos_taylor_literal_10terms 0.0003012239456650 cos_taylor_running_6terms 0.0001004702740058 cos_taylor_running_8terms 0.0000001352604069 cos_taylor_running_10terms 0.0000000000756512 cos_taylor_running_16terms 0.0000000000000014 cos_table_1 0.4944578224012448 cos_table_0_1 0.0499941818532710 cos_table_0_01 0.0049999017702790 cos_table_0_001 0.0004999070288996 cos_table_0_0001 0.0000499070860950 cos_table_1_LERP 0.1147496616359124 cos_table_0_1_LERP 0.0012496954434598 cos_table_0_01_LERP 0.0000124999013927 cos_table_0_001_LERP 0.0000001249999925 cos_table_0_0001_LERP 0.0000000012499975 cos_math_h 0.0000000000000000 cos_parabola 399.9999987128100543 cos_parabola_extra 36130.4497678874759004 cos_parabola_opt 0.0560095959541315 cos_parabola_extra_opt 0.0010902926026148 ```
No fancy compiler args are added, but can be set manually using standard CMake procedures.
milianw
force-pushed
the
parabola-approx
branch
from
March 31, 2022 16:31
a379bde
to
73e5430
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add benchmark coverage for parabola based cosine approximation
This covers both the original version by Nick from 1 and
the slightly modified and optimized versions that I came up with
a couple years ago over and shared at 2.
Note though that the original version 1 is only defined for the
ranges [-pi, pi] but the accuracy test harness here tests the
range [0, 2pi], which shines a bad light on these versions.
My version 2 doesn't suffer from this accuracy issue - you
can throw arbitrary input values at it. The performance is
pretty good too, the imprecise version is even the fastest
cos implementation on my machine now. The lookup table
implementations are directly behind it, but I have to note:
In real-world testing, cache eviction effects through
interactions with the rest of the application code will
further decrease the performance of lookup tables. Finally,
this code is easily autovectorized by compilers like icc and
even gcc.
On my machine, the results for all tests are as follows:
Compiler:
CPU:
output: