ToDo

The GPU reduction is actually slower that CPU reduction even for large array sizes.
Properly implement the leader and follower functions for chpl__initOnLocales in ChapelLocale.chpl.
Add HSA runtime source-code in the third-party directory instead of binaries (even the binaries do not work right now - hsa version mismatch)
Add test cases
Determine how to identify that a sublocale has GPU support. Right now it is just based on the ID (0 = CPU, 1 = GPU)
Check that the array is a rectangular 1D array before invoking GPU reductions.
Make sure execution on the parent locale also goes to the CPU sublocale / Decide what happens when only the parent locale is specified.
Implement GPU reductions for all data-types and functions. Right now, only int32 / sum is implemented
How many queues should be created.
Decide on multi-gpu support.
Handling of asynchronous GPU kernel execution. Right now the execution on GPU is always synchronous.
Fix data allocation methods for the GPU sublocale.

Provide feedback