add documentation, clean up

bashbaug · Sep 28, 2024 · 9860956 · 9860956
1 parent bd73dc4
commit 9860956
Show file tree

Hide file tree

Showing 2 changed files with 27 additions and 8 deletions.
diff --git a/samples/16_floatatomics/README.md b/samples/16_floatatomics/README.md
@@ -2,13 +2,27 @@
 
 ## Sample Purpose
 
-TODO
+This is an advanced sample that demonstrates how to do atomic floating-point atomic addition in a kernel.
+The most standard way to perform floating-point atomic addition uses the [cl_ext_float_atomics](https://registry.khronos.org/OpenCL/extensions/ext/cl_ext_float_atomics.html) extension.
+This extension adds device queries and built-in functions to optionally support floating-point atomic add, min, max, load, and store on 16-bit, 32-bit, and 64-bit floating-point types.
+When the `cl_ext_float_atomics` extenison is supported, and 32-bit floating point atomic adds are supported, this sample will use the built-in functions added by this extension.
 
-Inspired by: https://pipinspace.github.io/blog/atomic-float-addition-in-opencl.html
+This sample also fallback implentations when the `cl_ext_float_atomics` extension is not supported:
+
+* For NVIDIA GPUs, this sample includes a fallback that does the floating-point atomic add using inline PTX assembly language.
+* For AMD GPUs, this sample includes a fallback that calls a compiler intrinsic to do the floating-point atomic add.
+* For other devices, this sample includes a fallback that emulates the floating-point atomic add using 32-bit `atomic_xchg` functions.
+This fallback implementation cannot reliably return the "old" value that was in memory before performing the atomic add, so it is unsuitable for all usages, but it does work for some important uses-cases, such as reductions.
+
+This sample was inspired by the blog post: https://pipinspace.github.io/blog/atomic-float-addition-in-opencl.html
 
 ## Key APIs and Concepts
 
-TODO
+```c
+CL_DEVICE_SINGLE_FP_ATOMIC_CAPABILITIES_EXT
+__opencl_c_ext_fp32_global_atomic_add
+atomic_fetch_add_explicit
+```
 
 ## Command Line Options
 
@@ -17,5 +31,6 @@ TODO
 | `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
 | `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
 | `-i <number>` | 16 | Specify the number of iterations to execute.
-| `--gwx <number>` | 1024 | Specify the global work size to execute, which is also the number of floating-point atomics to perform.
+| `--gwx <number>` | 16384 | Specify the global work size to execute, which is also the number of floating-point atomics to perform.
 | `-e` | N/A | Unconditionally use the emulated floating-point atomic add.
+| `-e` | N/A | Check intermediate results for correctness, requires non-emulated atomics, requires adding a positive value.
diff --git a/samples/16_floatatomics/main.cpp b/samples/16_floatatomics/main.cpp
@@ -177,8 +177,7 @@ int main(
         commandQueue.finish();
 
         auto start = std::chrono::system_clock::now();
-        for( size_t i = 0; i < iterations; i++ )
-        {
+        for (size_t i = 0; i < iterations; i++) {
             cl_float zero = 0.0f;
             commandQueue.enqueueFillBuffer(
                 dst,
@@ -200,15 +199,20 @@ int main(
 
     // basic validation
     {
+        cl_float check = 0.0f;
+        for (size_t i = 0; i < gwx; i++) {
+            check += 1.0f;
+        }
+
         cl_float result = 0.0f;
         commandQueue.enqueueReadBuffer(
             dst,
             CL_TRUE,
             0,
             sizeof(result),
             &result);
-        if (result != (float)gwx) {
-            printf("Error: expected %f, got %f!\n", (float)gwx, result);
+        if (result != check) {
+            printf("Error: expected %f, got %f!\n", check, result);
         } else {
             printf("Basic Validation: Success.\n");
         }