Skip to content

CALDGEMM Performance Optimization Guide (OpenCL CUDA)

David Rohr edited this page May 15, 2015 · 1 revision

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Guidelines for OpenCLL / CUDA: The CUDA part is not fully implemented yet. This guide is written as if it was fully integrated, feel free to implement the missing features for CUDA yourself :|.

The most important thing for OpenCL is the 3rd party library for the DGEMM kernel. CALDGEMM itself comes only with an unoptimized reference implementation. There is a sample 3rd party library with a template that shows how such a library has to work. In caldgemm_config.h there are also some options to tweak the integrated OpenCL kernels's performance. Important aspects here are ENABLE_TILED_KERNEL and DISABLE_SIMPLE_BUFFERS, but performance will anyway be much less than with proper 3r party kernel.

In general, you should try to use OpenCL with GPU_C = 1. It is almost always better. Only in the case of a com- paratively fast CPU (like 2 * 12 core CPU + slow GPU like 5870), the GPU_C = 0 option is possibly faster. In general, GPU_C = 0 works better with CAL, which is usually around 5% faster than OpenCL. So if you want to test is, CAL is probably the way to go (although no longer supported for newer GPUs).

OpenCL with GPU_C = 0 setting has almost identical behavior as CAL, so please follow the above guide. The following refers to OpenCL with GPU_C = 1 and to CUDA.

OpenCL with GPU_C = 1 will transfers tile of the C matrix completely to the GPU using strided submatrix transfers. There are no intermediate host buffers.

Therefore, there are no pre-/postprocessing threads on the CPU.

Due to the nature of GPU_C = 1, the GPU pinnin has practically no influence (except perhaps of device API internal buffers, which can be pinned to the either or the other GPU.) Hence, it makes sense to set the -UAx settings as described above for the -Gx settings, as it comes at zero cost. But it is not really necessary. It works well without.

In general, you will want to use device-runtime-allocated memory. It is usually mich faster than plain malloc. The -o c setting enforces device-runtime-allocated memory for OpenCL in any case. For this you need the -_ option. Be aware that some OpenCL drivers have problems allocated the large buffers required. If this leads to memory allocation problems, you should first try to fix this driver issue before you start to disable device-runtime-allocated memory.

The baseline for OpenCL will thus be something like

./dgemm_bench -O 1 -Oc 1 -o g -_ -Ol my_opencl_3rd_party_lib.so -w 1920 -h 3072 -UAx... -A -c -z -X -p -m ... -n ...

The most relevant optimization settings are: -Oq (enable simple queuing, almost always faster) -J 1 (enable small tiles) -bb ? (choose correct number of bbuffers) -Op ? (chose correct preallocation setting) -Ox (exclude CPU from context) -Ot (improved transposition kernel) -Xb 1/2 (improved scheduler balancing)

Of course you do not need -X -Xb if you use only a single GPU

////////////////////////////////////////////////////////////////////////////////////////////////////////////////