-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast atan and atan2 functions. #8388
base: main
Are you sure you want to change the base?
Conversation
|
GPU performance test was severely memory bandwidth limited. This has been worked around by computing many (1024) arctans per output and summing them. Now --at least on my system-- they are faster. See updated performance reports. |
Okay, this is ready for review. Vulkan is slow, but that is apparently known well... |
Oh dear... I don't even know what WebGPU is... @steven-johnson Is this supposed to be an actual platform that is fast, and where performance metrics make sense? I can treat it like Vulkan, where it's just "meh, at least some are faster..."? |
https://en.wikipedia.org/wiki/WebGPU |
I don't think Vulkan is necessarily slow ... I think the benchmark loop is including initialization overhead. See my follow up here: #7202 |
Very cool! I have some concerns with the error metric though. Decimal digits of error isn't a great metric. E.g. having a value of 0.0001 when it's supposed to be zero is much much worse than having a value of 0.3701 when it's supposed to be 0.37. Relative error isn't great either, due to the singularity at zero. A better metric is ULPs, which is the maximum number of distinct floating point values in between the answer and the correct answer. There are also cases where you want a hard constraint as opposed to a minimization. exp(0) should be exactly one, and I guess I decided its derivative should be exactly one too, which explains the different in coefficients. |
@abadams I improved the optimization script a lot. I added support for ULP optimization: it optimizes very nicely for maximal bit error. When instead optimizing for MAE, we see the max ULP distance increase: I changed the default to the ULP-optimized one, but to keep the maximal absolute error under 1e-5, I had to choose the higher-degree polynomial. Overall still good. @derek-gerstmann Thanks a lot for investigating the performance issue! I now also get very fast Vulkan performance. I wonder why the overhead is so huge in Vulkan, and not there in other backends? Vulkan:
CUDA:
Vulkan is now even faster than CUDA! 🤯 |
@steven-johnson The build just broke on something LLVM related it seems... There seems to be no related commit to Halide. Does LLVM constantly update with every build? Edit: I found the commit: llvm/llvm-project@75c7bca Fix separately PR'd in #8391 |
…nge (-1, 1) to test (-4, 4). Cleanup code/comments. Test performance for all approximations.
We rebuild LLVM once a day, about 2AM Pacific time. |
@abadams I added the check that counts number of wrong mantissa bits:
Pay attention to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Partially apply clang-tidy fixes we don't use yet - Put a bunch of stuff into anonymous namespaces - Delete some redundant casts (e.g. casting an int to int) - Add some const refs to avoid copies - Remove meaningless inline qualifiers on in-class definitions and constexpr functions - Remove return-with-value from functions returning void - Delete a little dead code - Use std::min/max where appropriate - Don't use a variable after std::forwarding it. It may have been moved from. - Use std::string::empty instead of comparing length to zero * Undo unintentional formatting change * Restore some necessary casts * Add NOLINT to silence older clang-tidy
LLVM as it is built on the buildbots depends on `-lrt`, which is not a target. Filter out non-target dependencies from consideration.
GCC 12 only supports _Float16 on x86. Support for ARM was added in GCC 13. This causes a build failure in the manylinux_2_28 images.
The instructions for which llvm to acquire were stale
* Update pip package metadata * Link to the CMake package docs from Doxygen * Fix invalid Doxygen annotation in Serialization.h
PyPI rejected this because of a spacing issue.
A few quirks in the Markdown parser were worked around here. The most notable is that the sequence `]:` causes Doxygen to interpret a would-be link as a trailing reference even if it is not at the start of a line. Duplicating the single bracket reference is a portable workaround, i.e. [winget] ~> [winget][winget] It also doesn't stop interpreting `@` directives inside inline code, so it warns about our use of the `@` as a decorator symbol inside Python.md.
Someone was using this as a reference expert schedule, but it was stale and a bit simplistic for large matrices. I rescheduled it to get a better fraction of peak. This also now demonstrates how to use rfactor to block an sgemm over the k axis.
Cut polynomial + merge it + later take care of other transcendentals. |
Addresses #8243. Uses a polynomial approximation with odd powers: this way, it's immediately symmetrical around 0. Coefficients are optimized using my script which does iterative weight-adjusted least-squared-error (also included in PR; see below).
Added API
I designed this new
ApproximationPrecision
such that it can be used for other vectorizable functions at a later point as well, such as forfast_sin
andfast_cos
if we want that at some point. Note that I chose forMAE_1e_5
style of notation, instead of5Decimals
because 5 decimals suggests that there will be 5 decimals correct, which is technically less correct than saying that the maximal absolute error will be below1e-5
.Performance difference:
Linux/CPU (with precision
MAE_1e_5
):On Linux/CUDA, it's slightly faster than the default LLVM implementation (there is no atan instruction in PTX):
On Linux/OpenCL, it is also slightly faster:
Precision tests:
Optimizer
This PR includes a Python optimization script to find the coefficients of the polynomials:
While I didn't do anything very scientific or looked at research papers, I get a hunch that the results from this script are really good (and may actually converge to optimal).
If my optimization makes sense, then I have some funny observation: I get different coefficients for all of the fast approximations we have. See below.
Better coefficients for
exp()
?My result:
versus current Halide code:
Halide/src/IROperator.cpp
Lines 1432 to 1439 in 3cdeb53
Better coefficients for
sin()
?Notice that my optimization gives maximal error of 1.35e-11, instead of the promised 1e-5, with degree 6.
Versus:
Halide/src/IROperator.cpp
Lines 1390 to 1394 in 3cdeb53
If this is true (I don't see a reason why it wouldn't), that would mean we can remove a few terms to get faster version that still provides the promised precision.
Better coefficients for
cos()
?versus:
Halide/src/IROperator.cpp
Lines 1396 to 1400 in 3cdeb53
Better coefficients for
log()
?versus:
Halide/src/IROperator.cpp
Lines 1357 to 1365 in 3cdeb53