Skip to content

Commit

Permalink
add documentation, clean up
Browse files Browse the repository at this point in the history
  • Loading branch information
bashbaug committed Sep 28, 2024
1 parent bd73dc4 commit 9860956
Show file tree
Hide file tree
Showing 2 changed files with 27 additions and 8 deletions.
23 changes: 19 additions & 4 deletions samples/16_floatatomics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,27 @@

## Sample Purpose

TODO
This is an advanced sample that demonstrates how to do atomic floating-point atomic addition in a kernel.
The most standard way to perform floating-point atomic addition uses the [cl_ext_float_atomics](https://registry.khronos.org/OpenCL/extensions/ext/cl_ext_float_atomics.html) extension.
This extension adds device queries and built-in functions to optionally support floating-point atomic add, min, max, load, and store on 16-bit, 32-bit, and 64-bit floating-point types.
When the `cl_ext_float_atomics` extenison is supported, and 32-bit floating point atomic adds are supported, this sample will use the built-in functions added by this extension.

Inspired by: https://pipinspace.github.io/blog/atomic-float-addition-in-opencl.html
This sample also fallback implentations when the `cl_ext_float_atomics` extension is not supported:

* For NVIDIA GPUs, this sample includes a fallback that does the floating-point atomic add using inline PTX assembly language.
* For AMD GPUs, this sample includes a fallback that calls a compiler intrinsic to do the floating-point atomic add.
* For other devices, this sample includes a fallback that emulates the floating-point atomic add using 32-bit `atomic_xchg` functions.
This fallback implementation cannot reliably return the "old" value that was in memory before performing the atomic add, so it is unsuitable for all usages, but it does work for some important uses-cases, such as reductions.

This sample was inspired by the blog post: https://pipinspace.github.io/blog/atomic-float-addition-in-opencl.html

## Key APIs and Concepts

TODO
```c
CL_DEVICE_SINGLE_FP_ATOMIC_CAPABILITIES_EXT
__opencl_c_ext_fp32_global_atomic_add
atomic_fetch_add_explicit
```

## Command Line Options

Expand All @@ -17,5 +31,6 @@ TODO
| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
| `-i <number>` | 16 | Specify the number of iterations to execute.
| `--gwx <number>` | 1024 | Specify the global work size to execute, which is also the number of floating-point atomics to perform.
| `--gwx <number>` | 16384 | Specify the global work size to execute, which is also the number of floating-point atomics to perform.
| `-e` | N/A | Unconditionally use the emulated floating-point atomic add.
| `-e` | N/A | Check intermediate results for correctness, requires non-emulated atomics, requires adding a positive value.
12 changes: 8 additions & 4 deletions samples/16_floatatomics/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -177,8 +177,7 @@ int main(
commandQueue.finish();

auto start = std::chrono::system_clock::now();
for( size_t i = 0; i < iterations; i++ )
{
for (size_t i = 0; i < iterations; i++) {
cl_float zero = 0.0f;
commandQueue.enqueueFillBuffer(
dst,
Expand All @@ -200,15 +199,20 @@ int main(

// basic validation
{
cl_float check = 0.0f;
for (size_t i = 0; i < gwx; i++) {
check += 1.0f;
}

cl_float result = 0.0f;
commandQueue.enqueueReadBuffer(
dst,
CL_TRUE,
0,
sizeof(result),
&result);
if (result != (float)gwx) {
printf("Error: expected %f, got %f!\n", (float)gwx, result);
if (result != check) {
printf("Error: expected %f, got %f!\n", check, result);
} else {
printf("Basic Validation: Success.\n");
}
Expand Down

0 comments on commit 9860956

Please sign in to comment.