Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast atan and atan2 functions. #8388

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
69052fa
Fix for the removed DataLayout constructor.
mcourteaux Aug 13, 2024
e82d9ff
Fast vectorizable atan and atan2 functions.
mcourteaux Aug 10, 2024
11b442c
Default to not using fast atan versions if on CUDA.
mcourteaux Aug 10, 2024
dee28bc
Finished fast atan/atan2 functions and tests.
mcourteaux Aug 10, 2024
362f0ea
Correct attribution.
mcourteaux Aug 10, 2024
1bd7f7a
Clang-format
mcourteaux Aug 10, 2024
4f1e851
Weird WebAssembly limits...
mcourteaux Aug 11, 2024
f10396b
Small improvements to the optimization script.
mcourteaux Aug 11, 2024
de9d3b7
Polynomial optimization for log, exp, sin, cos with correct ranges.
mcourteaux Aug 11, 2024
d8e3225
Improve fast atan performance tests for GPU.
mcourteaux Aug 12, 2024
3bcd1a7
Bugfix fast_atan approximation. Fix correctness test to exceed the ra…
mcourteaux Aug 12, 2024
2aa0c7e
Cleanup
mcourteaux Aug 12, 2024
fd088f8
Enum class instead of enum for ApproximationPrecision.
mcourteaux Aug 12, 2024
62534d7
Weird Metal limits. There should be a better way...
mcourteaux Aug 12, 2024
c76e719
Skip test for WebGPU.
mcourteaux Aug 12, 2024
fc25944
Fast atan/atan2 polynomials reoptimized. New optimization strategy: ULP.
mcourteaux Aug 13, 2024
b5d0cad
Feedback Steven.
mcourteaux Aug 13, 2024
4d61c6a
More comments and test mantissa error.
mcourteaux Aug 14, 2024
ff28b99
Do not error when testing arctan performance on Metal / WebGPU.
mcourteaux Aug 14, 2024
5a435f0
Partially apply clang-tidy fixes we don't enforce yet (#8376)
abadams Aug 16, 2024
a4544be
Fix bundling error on buildbots (#8392)
alexreinking Aug 16, 2024
624f737
Fix incorrect std::array sizes in Target.cpp (#8396)
steven-johnson Aug 23, 2024
5ca88b7
Fix _Float16 detection on ARM64 GCC<13 (#8401)
alexreinking Aug 29, 2024
238f73c
Update README.md (#8404)
abadams Sep 2, 2024
b09f611
Support CMAKE_OSX_ARCHITECTURES (#8390)
alexreinking Sep 4, 2024
0614530
Pip packaging at last! (#8405)
alexreinking Sep 4, 2024
ae6dac4
Big documentation update (#8410)
alexreinking Sep 5, 2024
30b5938
Document how to find Halide from a pip installation (#8411)
alexreinking Sep 6, 2024
6f0da12
Merge pull request #8412
alexreinking Sep 6, 2024
44651f9
Fix classifier spelling (#8413)
alexreinking Sep 7, 2024
636ad8f
Make run-clang-tidy.sh work on macOS (#8416)
alexreinking Sep 9, 2024
51824df
Link to PyPI from Doxygen index.html (#8415)
alexreinking Sep 9, 2024
c9b2a76
Include our Markdown documentation in the Doxygen site. (#8417)
alexreinking Sep 10, 2024
a8966e9
Add missing backslash (#8419)
abadams Sep 15, 2024
9bcb9b7
Reschedule the matrix multiply performance app (#8418)
abadams Sep 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions src/CodeGen_LLVM.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1910,7 +1910,7 @@ Value *CodeGen_LLVM::codegen_buffer_pointer(const string &buffer, Halide::Type t

Value *CodeGen_LLVM::codegen_buffer_pointer(Value *base_address, Halide::Type type, Expr index) {
// Promote index to 64-bit on targets that use 64-bit pointers.
llvm::DataLayout d(module.get());
const llvm::DataLayout &d = module->getDataLayout();
if (promote_indices() && d.getPointerSize() == 8) {
index = promote_64(index);
}
Expand Down Expand Up @@ -1951,7 +1951,7 @@ Value *CodeGen_LLVM::codegen_buffer_pointer(Value *base_address, Halide::Type ty
}

// Promote index to 64-bit on targets that use 64-bit pointers.
llvm::DataLayout d(module.get());
const llvm::DataLayout &d = module->getDataLayout();
if (d.getPointerSize() == 8) {
llvm::Type *index_type = index->getType();
llvm::Type *desired_index_type = i64_t;
Expand Down Expand Up @@ -3286,7 +3286,7 @@ void CodeGen_LLVM::visit(const Call *op) {
} else if (op->is_intrinsic(Call::undef)) {
user_error << "undef not eliminated before code generation. Please report this as a Halide bug.\n";
} else if (op->is_intrinsic(Call::size_of_halide_buffer_t)) {
llvm::DataLayout d(module.get());
const llvm::DataLayout &d = module->getDataLayout();
value = ConstantInt::get(i32_t, (int)d.getTypeAllocSize(halide_buffer_t_type));
} else if (op->is_intrinsic(Call::strict_float)) {
IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>::FastMathFlagGuard guard(*builder);
Expand Down Expand Up @@ -4466,7 +4466,7 @@ Value *CodeGen_LLVM::create_alloca_at_entry(llvm::Type *t, int n, bool zero_init
Value *size = ConstantInt::get(i32_t, n);
AllocaInst *ptr = builder->CreateAlloca(t, size, name);
int align = native_vector_bits() / 8;
llvm::DataLayout d(module.get());
const llvm::DataLayout &d = module->getDataLayout();
int allocated_size = n * (int)d.getTypeAllocSize(t);
if (t->isVectorTy() || n > 1) {
ptr->setAlignment(llvm::Align(align));
Expand Down
158 changes: 158 additions & 0 deletions src/IROperator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1421,6 +1421,164 @@ Expr fast_cos(const Expr &x_full) {
return fast_sin_cos(x_full, false);
}

// A vectorizable atan and atan2 implementation.
// Based on the ideas presented in https://mazzo.li/posts/vectorized-atan2.html.
Expr fast_atan_approximation(const Expr &x_full, ApproximationPrecision precision, bool between_m1_and_p1) {
const float pi_over_two = 1.57079632679489661923f;
Expr x;
// if x > 1 -> atan(x) = Pi/2 - atan(1/x)
Expr x_gt_1 = abs(x_full) > 1.0f;
if (between_m1_and_p1) {
x = x_full;
} else {
x = select(x_gt_1, 1.0f / x_full, x_full);
}

// Coefficients obtained using src/polynomial_optimizer.py
// Note that the maximal errors are computed with numpy with double precision.
// The real errors are a bit larger with single-precision floats (see correctness/fast_arctan.cpp).
// Also note that ULP distances which are not units are bogus, but this is because this error
// was again measured with double precision, so the actual reconstruction had more bits of precision
// than the actual float32 target value. So in practice the MaxULP Error will be close to round(MaxUlpE).

// The table is huge, so let's put clang-format off and handle the layout manually:
// clang-format off
std::vector<float> c;
switch (precision) {
mcourteaux marked this conversation as resolved.
Show resolved Hide resolved
// == MSE Optimized == //
case ApproximationPrecision::MSE_Poly2: // (MSE=1.0264e-05, MAE=9.2149e-03, MaxUlpE=3.9855e+05)
c = {+9.762134539879e-01f, -2.000301999499e-01f};
break;
case ApproximationPrecision::MSE_Poly3: // (MSE=1.5776e-07, MAE=1.3239e-03, MaxUlpE=6.7246e+04)
c = {+9.959820734941e-01f, -2.922781275652e-01f, +8.301806798764e-02f};
break;
case ApproximationPrecision::MSE_Poly4: // (MSE=2.8490e-09, MAE=1.9922e-04, MaxUlpE=1.1422e+04)
c = {+9.993165406918e-01f, -3.222865011143e-01f, +1.490324612527e-01f, -4.086355921512e-02f};
break;
case ApproximationPrecision::MSE_Poly5: // (MSE=5.6675e-11, MAE=3.0801e-05, MaxUlpE=1.9456e+03)
c = {+9.998833730470e-01f, -3.305995351168e-01f, +1.814513158372e-01f, -8.717338298570e-02f,
+2.186719361787e-02f};
break;
case ApproximationPrecision::MSE_Poly6: // (MSE=1.2027e-12, MAE=4.8469e-06, MaxUlpE=3.3187e+02)
c = {+9.999800646964e-01f, -3.326943930673e-01f, +1.940196968486e-01f, -1.176947321238e-01f,
+5.408220801540e-02f, -1.229952788751e-02f};
break;
case ApproximationPrecision::MSE_Poly7: // (MSE=2.6729e-14, MAE=7.7227e-07, MaxUlpE=5.6646e+01)
c = {+9.999965889517e-01f, -3.331900904961e-01f, +1.982328680483e-01f, -1.329414694644e-01f,
+8.076237117606e-02f, -3.461248530394e-02f, +7.151152759080e-03f};
break;
case ApproximationPrecision::MSE_Poly8: // (MSE=6.1506e-16, MAE=1.2419e-07, MaxUlpE=9.6914e+00)
c = {+9.999994159669e-01f, -3.333022219271e-01f, +1.995110884308e-01f, -1.393321817395e-01f,
+9.709319573480e-02f, -5.688043380309e-02f, +2.256648487698e-02f, -4.257308331872e-03f};
break;

// == MAE Optimized == //
case ApproximationPrecision::MAE_1e_2:
case ApproximationPrecision::MAE_Poly2: // (MSE=1.2096e-05, MAE=4.9690e-03, MaxUlpE=4.6233e+05)
c = {+9.724104536788e-01f, -1.919812827495e-01f};
break;
case ApproximationPrecision::MAE_1e_3:
case ApproximationPrecision::MAE_Poly3: // (MSE=1.8394e-07, MAE=6.1071e-04, MaxUlpE=7.7667e+04)
c = {+9.953600796593e-01f, -2.887020515559e-01f, +7.935084373856e-02f};
break;
case ApproximationPrecision::MAE_1e_4:
case ApproximationPrecision::MAE_Poly4: // (MSE=3.2969e-09, MAE=8.1642e-05, MaxUlpE=1.3136e+04)
c = {+9.992141075707e-01f, -3.211780734117e-01f, +1.462720063085e-01f, -3.899151874271e-02f};
break;
case ApproximationPrecision::MAE_Poly5: // (MSE=6.5235e-11, MAE=1.1475e-05, MaxUlpE=2.2296e+03)
c = {+9.998663727249e-01f, -3.303055171903e-01f, +1.801624340886e-01f, -8.516115366058e-02f,
+2.084750202717e-02f};
break;
case ApproximationPrecision::MAE_1e_5:
case ApproximationPrecision::MAE_Poly6: // (MSE=1.3788e-12, MAE=1.6673e-06, MaxUlpE=3.7921e+02)
c = {+9.999772256973e-01f, -3.326229914097e-01f, +1.935414518077e-01f, -1.164292778405e-01f,
+5.265046001895e-02f, -1.172037220425e-02f};
break;
case ApproximationPrecision::MAE_1e_6:
case ApproximationPrecision::MAE_Poly7: // (MSE=3.0551e-14, MAE=2.4809e-07, MaxUlpE=6.4572e+01)
c = {+9.999961125922e-01f, -3.331737159104e-01f, +1.980784841430e-01f, -1.323346922675e-01f,
+7.962601662878e-02f, -3.360626486524e-02f, +6.812471171209e-03f};
break;
case ApproximationPrecision::MAE_Poly8: // (MSE=7.0132e-16, MAE=3.7579e-08, MaxUlpE=1.1023e+01)
c = {+9.999993357462e-01f, -3.332986153129e-01f, +1.994657492754e-01f, -1.390867909988e-01f,
+9.642330770840e-02f, -5.591422536378e-02f, +2.186431903729e-02f, -4.054954273090e-03f};
break;


// == Max ULP Optimized == //
case ApproximationPrecision::MULPE_Poly2: // (MSE=2.1006e-05, MAE=1.0755e-02, MaxUlpE=1.8221e+05)
c = {+9.891111216318e-01f, -2.144680385336e-01f};
break;
case ApproximationPrecision::MULPE_1e_2:
case ApproximationPrecision::MULPE_Poly3: // (MSE=3.5740e-07, MAE=1.3164e-03, MaxUlpE=2.2273e+04)
c = {+9.986650768126e-01f, -3.029909865833e-01f, +9.104044335898e-02f};
break;
case ApproximationPrecision::MULPE_1e_3:
case ApproximationPrecision::MULPE_Poly4: // (MSE=6.4750e-09, MAE=1.5485e-04, MaxUlpE=2.6199e+03)
c = {+9.998421981586e-01f, -3.262726405770e-01f, +1.562944595469e-01f, -4.462070448745e-02f};
break;
case ApproximationPrecision::MULPE_1e_4:
case ApproximationPrecision::MULPE_Poly5: // (MSE=1.3135e-10, MAE=2.5335e-05, MaxUlpE=4.2948e+02)
c = {+9.999741103798e-01f, -3.318237821017e-01f, +1.858860952571e-01f, -9.300240079057e-02f,
+2.438947597681e-02f};
break;
case ApproximationPrecision::MULPE_1e_5:
case ApproximationPrecision::MULPE_Poly6: // (MSE=3.0079e-12, MAE=3.5307e-06, MaxUlpE=5.9838e+01)
c = {+9.999963876702e-01f, -3.330364633925e-01f, +1.959597060284e-01f, -1.220687452250e-01f,
+5.834036471395e-02f, -1.379661708254e-02f};
break;
case ApproximationPrecision::MULPE_1e_6:
case ApproximationPrecision::MULPE_Poly7: // (MSE=6.3489e-14, MAE=4.8826e-07, MaxUlpE=8.2764e+00)
c = {+9.999994992400e-01f, -3.332734078379e-01f, +1.988954540598e-01f, -1.351537940907e-01f,
+8.431852775558e-02f, -3.734345976535e-02f, +7.955832300869e-03f};
break;
case ApproximationPrecision::MULPE_Poly8: // (MSE=1.3696e-15, MAE=7.5850e-08, MaxUlpE=1.2850e+00)
c = {+9.999999220612e-01f, -3.333208398432e-01f, +1.997085632112e-01f, -1.402570625577e-01f,
+9.930940122930e-02f, -5.971380457112e-02f, +2.440561807586e-02f, -4.733710058459e-03f};
break;
}
// clang-format on

Expr x2 = x * x;
Expr result = c.back();
for (size_t i = 1; i < c.size(); ++i) {
result = x2 * result + c[c.size() - i - 1];
}
result *= x;

if (!between_m1_and_p1) {
result = select(x_gt_1, select(x_full < 0, -pi_over_two, pi_over_two) - result, result);
}
return common_subexpression_elimination(result);
}
Expr fast_atan(const Expr &x_full, ApproximationPrecision precision) {
return fast_atan_approximation(x_full, precision, false);
}

Expr fast_atan2(const Expr &y, const Expr &x, ApproximationPrecision precision) {
const float pi = 3.14159265358979323846f;
const float pi_over_two = 1.57079632679489661923f;
// Making sure we take the ratio of the biggest number by the smallest number (in absolute value)
// will always give us a number between -1 and +1, which is the range over which the approximation
// works well. We can therefore also skip the inversion logic in the fast_atan_approximation function
// by passing true for "between_m1_and_p1". This increases both speed (1 division instead of 2) and
// numerical precision.
Expr swap = abs(y) > abs(x);
Expr atan_input = select(swap, x, y) / select(swap, y, x);
Expr ati = fast_atan_approximation(atan_input, precision, true);
Expr at = select(swap, select(atan_input >= 0.0f, pi_over_two, -pi_over_two) - ati, ati);
// This select statement is literally taken over from the definition on Wikipedia.
// There might be optimizations to be done here, but I haven't tried that yet. -- Martijn
Expr result = select(
x > 0.0f, at,
x < 0.0f && y >= 0.0f, at + pi,
x < 0.0f && y < 0.0f, at - pi,
x == 0.0f && y > 0.0f, pi_over_two,
x == 0.0f && y < 0.0f, -pi_over_two,
0.0f);
return common_subexpression_elimination(result);
}

Expr fast_exp(const Expr &x_full) {
user_assert(x_full.type() == Float(32)) << "fast_exp only works for Float(32)";

Expand Down
93 changes: 93 additions & 0 deletions src/IROperator.h
Original file line number Diff line number Diff line change
Expand Up @@ -972,6 +972,99 @@ Expr fast_sin(const Expr &x);
Expr fast_cos(const Expr &x);
// @}

/**
* Enum that declares several options for functions that are approximated
* by polynomial expansions. These polynomials can be optimized for three
* different metrics: Mean Squared Error, Maximum Absolute Error, or
* Maximum Units in Last Place (ULP) Error.
*
* Orthogonally to the optimization objective, these polynomials can vary
* in degree. Higher degree polynomials will give more precise results.
* Note that the `X` in the `PolyX` enum values refer to the number of terms
* in the polynomial, and not the degree of the polynomial. E.g., even
* symmetric functions may be implemented using only even powers, for which
* `Poly3` would actually mean that terms in [1, x^2, x^4] are used.
*
* Additionally, if you don't care about number of terms in the polynomial
* and you do care about the maximal absolute error the approximation may have
* over the domain, you may use the `MAE_1e_x` values and the implementation
* will decide the appropriate polynomial degree that achieves this precision.
*/
enum class ApproximationPrecision {
/** Mean Squared Error Optimized. */
// @{
MSE_Poly2,
MSE_Poly3,
MSE_Poly4,
MSE_Poly5,
MSE_Poly6,
MSE_Poly7,
MSE_Poly8,
// @}

/** Number of terms in polynomial -- Optimized for Max Absolute Error. */
// @{
MAE_Poly2,
MAE_Poly3,
MAE_Poly4,
MAE_Poly5,
MAE_Poly6,
MAE_Poly7,
MAE_Poly8,
// @}

/** Number of terms in polynomial -- Optimized for Max ULP Error.
* ULP is "Units in Last Place", measured in IEEE 32-bit floats. */
// @{
MULPE_Poly2,
MULPE_Poly3,
MULPE_Poly4,
MULPE_Poly5,
MULPE_Poly6,
MULPE_Poly7,
MULPE_Poly8,
// @}

/* Maximum Absolute Error Optimized with given Maximal Absolute Error. */
// @{
MAE_1e_2,
MAE_1e_3,
MAE_1e_4,
MAE_1e_5,
MAE_1e_6,
// @}

/* Maximum ULP Error Optimized with given Maximal Absolute Error. */
// @{
MULPE_1e_2,
MULPE_1e_3,
MULPE_1e_4,
MULPE_1e_5,
MULPE_1e_6,
// @}
};

/** Fast vectorizable approximations for arctan and arctan2 for Float(32).
*
* Desired precision can be specified as either a maximum absolute error (MAE) or
* the number of terms in the polynomial approximation (see the ApproximationPrecision enum) which
* are optimized for either:
* - MSE (Mean Squared Error)
* - MAE (Maximum Absolute Error)
* - MULPE (Maximum Units in Last Place Error).
*
* The default (Max ULP Error Polynomial 6) has a MAE of 3.53e-6.
* For more info on the precision, see the table in IROperator.cpp.
*
* Note: the polynomial uses odd powers, so the number of terms is not the degree of the polynomial.
* Note: Poly8 is only useful to increase precision for atan, and not for atan2.
* Note: The performance of this functions seem to be not reliably faster on WebGPU (for now, August 2024).
*/
// @{
Expr fast_atan(const Expr &x, ApproximationPrecision precision = ApproximationPrecision::MULPE_Poly6);
Expr fast_atan2(const Expr &y, const Expr &x, ApproximationPrecision = ApproximationPrecision::MULPE_Poly6);
// @}

/** Fast approximate cleanly vectorizable log for Float(32). Returns
* nonsense for x <= 0.0f. Accurate up to the last 5 bits of the
* mantissa. Vectorizes cleanly. */
Expand Down
Loading
Loading