final tidy up

bashbaug · Oct 3, 2024 · d1207e3 · d1207e3
1 parent 656647f
commit d1207e3
Show file tree

Hide file tree

Showing 2 changed files with 16 additions and 10 deletions.
diff --git a/samples/16_floatatomics/README.md b/samples/16_floatatomics/README.md
@@ -2,26 +2,31 @@
 
 ## Sample Purpose
 
-This is an advanced sample that demonstrates how to do atomic floating-point atomic addition in a kernel.
-The most standard way to perform floating-point atomic addition uses the [cl_ext_float_atomics](https://registry.khronos.org/OpenCL/extensions/ext/cl_ext_float_atomics.html) extension.
+This is an advanced sample that demonstrates how to do atomic floating-point addition in a kernel.
+The most standard way to perform atomic floating-point addition uses the [cl_ext_float_atomics](https://registry.khronos.org/OpenCL/extensions/ext/cl_ext_float_atomics.html) extension.
 This extension adds device queries and built-in functions to optionally support floating-point atomic add, min, max, load, and store on 16-bit, 32-bit, and 64-bit floating-point types.
-When the `cl_ext_float_atomics` extenison is supported, and 32-bit floating point atomic adds are supported, this sample will use the built-in functions added by this extension.
+When the `cl_ext_float_atomics` extension is supported, and 32-bit floating point atomic adds are supported, this sample will use the built-in functions added by this extension.
 
-This sample also fallback implentations when the `cl_ext_float_atomics` extension is not supported:
+This sample also includes fallback implementations when the `cl_ext_float_atomics` extension is not supported:
 
 * For NVIDIA GPUs, this sample includes a fallback that does the floating-point atomic add using inline PTX assembly language.
 * For AMD GPUs, this sample includes a fallback that calls a compiler intrinsic to do the floating-point atomic add.
-* For other devices, this sample includes a fallback that emulates the floating-point atomic add using 32-bit `atomic_xchg` functions.
-This fallback implementation cannot reliably return the "old" value that was in memory before performing the atomic add, so it is unsuitable for all usages, but it does work for some important uses-cases, such as reductions.
+* For other devices, this sample includes two fallback implementations:
+    * The first emulates the floating-point atomic add using 32-bit `atomic_xchg` functions.
+      This fallback implementation cannot reliably return the "old" value that was in memory before performing the atomic add, so it is unsuitable for all usages, but it does work for some important uses-cases, such as reductions.
+    * The second emulates the floating-point atomic add using 32-bit `atomic_cmpxchg` functions.
+      This is a slower emulation, but it is able to reliably return the "old" value that was in memory before performing the atomic add.
 
 This sample was inspired by the blog post: https://pipinspace.github.io/blog/atomic-float-addition-in-opencl.html
 
 ## Key APIs and Concepts
 
-```c
+```
 CL_DEVICE_SINGLE_FP_ATOMIC_CAPABILITIES_EXT
 __opencl_c_ext_fp32_global_atomic_add
 atomic_fetch_add_explicit
+atomic_xchg
+atomic_cmpxchg
 ```
 
 ## Command Line Options
@@ -31,6 +36,7 @@ atomic_fetch_add_explicit
 | `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
 | `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
 | `-i <number>` | 16 | Specify the number of iterations to execute.
-| `--gwx <number>` | 16384 | Specify the global work size to execute, which is also the number of floating-point atomics to perform.
+| `--gwx <number>` | 16384 | Specify the global work size, which is also the number of floating-point atomics to perform.
 | `-e` | N/A | Unconditionally use the emulated floating-point atomic add.
-| `-e` | N/A | Check intermediate results for correctness, requires non-emulated atomics, requires adding a positive value.
+| `-s` | N/A | Unconditionally use the slower and safer emulated floating-point atomic add.
+| `-e` | N/A | Check intermediate results for correctness, unsupported for the faster emulated atomics, requires adding a positive value.
diff --git a/samples/16_floatatomics/main.cpp b/samples/16_floatatomics/main.cpp
@@ -102,7 +102,7 @@ int main(
         op.add<popl::Value<size_t>>("i", "iterations", "Iterations", iterations, &iterations);
         op.add<popl::Value<size_t>>("", "gwx", "Global Work Size X AKA Number of Atomics", gwx, &gwx);
         op.add<popl::Switch>("e", "emulate", "Unconditionally Emulate Float Atomics", &emulate);
-        op.add<popl::Switch>("s", "slow-emulate", "Unconditionally Emulate Float Atomics with Return Support", &slowEmulate);
+        op.add<popl::Switch>("s", "slow-emulate", "Unconditionally Emulate Float Atomics (slowly and safely)", &slowEmulate);
         op.add<popl::Switch>("c", "check", "Check Intermediate Results", &check);
 
         bool printUsage = false;