Skip to content

Commit

Permalink
final tidy up
Browse files Browse the repository at this point in the history
  • Loading branch information
bashbaug committed Oct 3, 2024
1 parent 656647f commit d1207e3
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 10 deletions.
24 changes: 15 additions & 9 deletions samples/16_floatatomics/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,26 +2,31 @@

## Sample Purpose

This is an advanced sample that demonstrates how to do atomic floating-point atomic addition in a kernel.
The most standard way to perform floating-point atomic addition uses the [cl_ext_float_atomics](https://registry.khronos.org/OpenCL/extensions/ext/cl_ext_float_atomics.html) extension.
This is an advanced sample that demonstrates how to do atomic floating-point addition in a kernel.
The most standard way to perform atomic floating-point addition uses the [cl_ext_float_atomics](https://registry.khronos.org/OpenCL/extensions/ext/cl_ext_float_atomics.html) extension.
This extension adds device queries and built-in functions to optionally support floating-point atomic add, min, max, load, and store on 16-bit, 32-bit, and 64-bit floating-point types.
When the `cl_ext_float_atomics` extenison is supported, and 32-bit floating point atomic adds are supported, this sample will use the built-in functions added by this extension.
When the `cl_ext_float_atomics` extension is supported, and 32-bit floating point atomic adds are supported, this sample will use the built-in functions added by this extension.

This sample also fallback implentations when the `cl_ext_float_atomics` extension is not supported:
This sample also includes fallback implementations when the `cl_ext_float_atomics` extension is not supported:

* For NVIDIA GPUs, this sample includes a fallback that does the floating-point atomic add using inline PTX assembly language.
* For AMD GPUs, this sample includes a fallback that calls a compiler intrinsic to do the floating-point atomic add.
* For other devices, this sample includes a fallback that emulates the floating-point atomic add using 32-bit `atomic_xchg` functions.
This fallback implementation cannot reliably return the "old" value that was in memory before performing the atomic add, so it is unsuitable for all usages, but it does work for some important uses-cases, such as reductions.
* For other devices, this sample includes two fallback implementations:
* The first emulates the floating-point atomic add using 32-bit `atomic_xchg` functions.
This fallback implementation cannot reliably return the "old" value that was in memory before performing the atomic add, so it is unsuitable for all usages, but it does work for some important uses-cases, such as reductions.
* The second emulates the floating-point atomic add using 32-bit `atomic_cmpxchg` functions.
This is a slower emulation, but it is able to reliably return the "old" value that was in memory before performing the atomic add.

This sample was inspired by the blog post: https://pipinspace.github.io/blog/atomic-float-addition-in-opencl.html

## Key APIs and Concepts

```c
```
CL_DEVICE_SINGLE_FP_ATOMIC_CAPABILITIES_EXT
__opencl_c_ext_fp32_global_atomic_add
atomic_fetch_add_explicit
atomic_xchg
atomic_cmpxchg
```

## Command Line Options
Expand All @@ -31,6 +36,7 @@ atomic_fetch_add_explicit
| `-d <index>` | 0 | Specify the index of the OpenCL device in the platform to execute on the sample on.
| `-p <index>` | 0 | Specify the index of the OpenCL platform to execute the sample on.
| `-i <number>` | 16 | Specify the number of iterations to execute.
| `--gwx <number>` | 16384 | Specify the global work size to execute, which is also the number of floating-point atomics to perform.
| `--gwx <number>` | 16384 | Specify the global work size, which is also the number of floating-point atomics to perform.
| `-e` | N/A | Unconditionally use the emulated floating-point atomic add.
| `-e` | N/A | Check intermediate results for correctness, requires non-emulated atomics, requires adding a positive value.
| `-s` | N/A | Unconditionally use the slower and safer emulated floating-point atomic add.
| `-e` | N/A | Check intermediate results for correctness, unsupported for the faster emulated atomics, requires adding a positive value.
2 changes: 1 addition & 1 deletion samples/16_floatatomics/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ int main(
op.add<popl::Value<size_t>>("i", "iterations", "Iterations", iterations, &iterations);
op.add<popl::Value<size_t>>("", "gwx", "Global Work Size X AKA Number of Atomics", gwx, &gwx);
op.add<popl::Switch>("e", "emulate", "Unconditionally Emulate Float Atomics", &emulate);
op.add<popl::Switch>("s", "slow-emulate", "Unconditionally Emulate Float Atomics with Return Support", &slowEmulate);
op.add<popl::Switch>("s", "slow-emulate", "Unconditionally Emulate Float Atomics (slowly and safely)", &slowEmulate);
op.add<popl::Switch>("c", "check", "Check Intermediate Results", &check);

bool printUsage = false;
Expand Down

0 comments on commit d1207e3

Please sign in to comment.