-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks & examples #5
Comments
Also, just because it came up on HN today, here's a nice explanation of a somewhat bonkers optimized fixed-length reduction that's worth pondering. |
@sampsyo there is a conversation about code to compile in #27 (comment). There a few interesting things in that PR, I am trying to separate them out a bit. From @lemaitre:
|
Interesting discussion; thanks for the ping! This is a good list from @lemaitre. To break things down a bit, I sort of think there are two competing goals here:
To make things more complicated, there is also a separate dimension to distinguish "algorithms" (kernels, such as GEMM) from "applications" (such as an H.264 encoder). The former is probably the better focus, at least early on. I think the thing that makes benchmarking SIMD ISAs/algorithms especially hard (compared to "normal" benchmarks) is that there is a huge range in implementation qualities. A naively auto-vectorized version of a simple 3x3 matrix product, for example, will be far worse than an expert-tuned implementation—and the expert-tuned implementation will differ radically for different ISA targets! Which means that, if you invent a new SIMD ISA (or WASM extension), it can be really hard to say whether it's "better" without doing a bunch of incredibly painful tuning for that architecture. The problem is that the straightforward/obvious implementations of a given benchmark could be misleading w/r/t what the "peak" performance could be, given infinite programmer effort. "Infinite programmer effort" may seem like a fantasy, but in practice, the world has shown that it is willing to expend seemingly-infinite effort for high-value libraries like BLAS. That's not feasible to do for early-stage designs, so there will inevitably be some artistic guesswork involved about how much a given ISA feature will matter for industrial-strength vectorized implementations. |
After a discussion with @penzn and @RossTate, we realized it would be helpful to catalog some resources for auto-vectorizable and hand-vectorized code. These should be useful for benchmarking and staring at as examples to judge the flexibility of the "long vector" proposal.
Probably the canonical benchmark suite for small, vectorizable kernels is MediaBench.
It's old and dusty, though, and the code just a big bunch of undifferentiated C, so it's not all that helpful at revealing tight, vectorizable inner loops.
Something that might be a bit more helpful is VectorBench, from the MIT folks who have been working on the problem of "revitalizing" vectorized code hand-tuned for old SIMD ISAs so they can be recompiled for new SIMD ISAs.
Namely, a good place to start would be the code they adopted from this repository of a great deal of hand-vectorized image processing kernels.
Unfortunately, even these simple kernels reveal how maddeningly complex hand-vectorized code can truly be. For example, binarizing an image (taking a grayscale image and rounding to a black-and-white image) should be super simple, right? Here's the core loop for that:
https://github.com/ermig1979/Simd/blob/ab23fb98f5bebfeeef170c8abbd1ab9d596b0246/src/Simd/SimdAvx512bwBinarization.cpp#L63-L74
And of course it gets way more gnarly if you're doing something less perfectly elementwise, like a stencil/filter type of thing:
https://github.com/ermig1979/Simd/blob/master/src/Simd/SimdAvx2Sobel.cpp
Perhaps a better way to start is to look at code that should be amenable to auto-vectorization but that is not yet hand-tuned with vector intrinsics. For example, NPB is an old and small benchmark suite with this characteristic.
VectorBench includes the OpenMP-annotated NPB suite. You can find such intriguing loop nests as this:
https://github.com/revec/VectorBench/blob/77073a06779a1dbd948bd5f5a37660946ae07750/scalar/NPB3.0-omp-C/BT/bt.c#L241-L255
Of course, it's up to the reader's imagination to divine the best way to vectorize these loops. To that end, I think the revec paper itself can be helpful.
For example, take a gander at Figure 1, which shows the original scalar code, a vectorized version for SSE2, and the desired vectorization for two other ISAs (AVX2 and AVX-512).
The text was updated successfully, but these errors were encountered: