- Scott Grauer-Gray ([email protected])
- William Killian ([email protected])
- John Cavazos ([email protected])
- Robert Searles ([email protected])
- Lifan Xu ([email protected])
- CUDA
- OpenCL
- HMPP
- OpenACC
- OpenMP
This benchmark suite is partially derived from the PolyBench benchmark suite developed by Louis-Noel Pouchet and available at http://www.cs.ucla.edu/~pouchet/software/polybench/
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Auto-tuning a High-Level Language Targeted to GPU Codes. Proceedings of Innovative Parallel Computing (InPar '12), 2012.
- correlation
- covariance
- 2mm
- 3mm
- atax
- bicg
- cholesky [*]
- doitgen
- gemm
- gemver
- gesummv
- mvt
- symm [*]
- syr2k
- syrk
- trisolv [*]
- trmm [*]
- durbin [*]
- dynprog [*]
- gramschmidt
- lu
- ludcmp [*]
- adi
- convolution-2d
- convolution-3d
- fdtd-2d
- jacobi-1d-imper
- jacobi-2d-imper
- seidel-2d [*]
[*] - not available for CUDA or OpenCL
- Set up
PATH
andLD_LIBRARY_PATH
environment variables to point to CUDA installation - Run
make
in target folder(s) with codes to generate executable(s) - Run the generated executable file(s).
- Set up
PATH
andLD_LIBRARY_PATH
environment variables to point to OpenCL installation - Set location of SDK in
common.mk
file in utilities folder (in OpenCL directory) - Run
make
in target folder(s) to generate executable(s) - Run the generated executable file(s).
- Set up
PATH
andLD_LIBRARY_PATH
environment variables to point to CUDA/OpenCL installation - Set up
HMPP/OpenACC
environment variables with sourcehmpp-env.sh
orcaps-env.sh
- Run
make exe
in target folder(s) with codes to generate executable(s) - Run the generated executable file(s).
- Set up
PATH
andLD_LIBRARY_PATH
environment variables for RoseACC (see RoseACC's Getting Started) - Run
make exe
in target folder(s) with codes to generate executable(s) - Run the generated executable file(s).
Parameters such as the input sizes, data type, and threshold for GPU-CPU output comparison can be modified using constants
within the codes and .h files. After modifying, run make clean
then make
on relevant code for modifications to take effect in resulting executable.
By default the STANDARD_DATASET
as defined in the .cuh/.h
file is used as the input size. The dataset choice can be adjusted from STANDARD_DATASET
to other
options (MINI_DATASET
, SMALL_DATASET
, etc) in the .cuh/.h
file, the dataset size can be adjusted by defining the input size manually in the .cuh/.h
file, or
the input size can be changed by simply adjusting the STANDARD_DATASET
so the program has different input dimensions.
Declares if the kernel will be run on the accelerator and CPU (with the run-time for each given and the outputs compared) or only on the accelerator. By default, RUN_ON_CPU
is defined so the kernel is run on both the accelerator and the CPU to make it easy to compare accelerator/CPU outputs and run-times. Commenting out or removing the #define RUN_ON_CPU
statement and re-compiling the code will cause the kernel to only be run on the accelerator.
By default, the DATA_TYPE
used in these codes are float
that can be changed to double
by changing the DATA_TYPE
typedef. Note that in OpenCL, the DATA_TYPE
needs to be changed in both the .h and .cl files, as the .cl files contain the kernel code and is compiled separately at run-time.
The PERCENT_DIFF_ERROR_THRESHOLD
refers to the percent difference (0.0-100.0) that the GPU and CPU results are allowed to differ and still be considered "matching"; this parameter can be adjusted for each code in the input code file.
Declares the type of accelerator to use for running the OpenCL kernel(s).
CL_DEVICE_TYPE_GPU
- run the OpenCL kernel on the GPU (default)CL_DEVICE_TYPE_CPU
- run the OpenCL kernel on the CPUCL_DEVICE_TYPE_ACCELERATOR
- run the OpenCL kernel on another accelerator such as the Intel Xeon Phi processor or IBM Cell Blade
These are passed as macro definitions during compilation time
(e.g -Dname_of_the_option
) or can be added with a #define
to the code.
-
POLYBENCH_STACK_ARRAYS
(only applies to allocation on host): use stack allocation instead of malloc [default: off] -
POLYBENCH_DUMP_ARRAYS
: dump all live-out arrays on stderr [default: off] -
POLYBENCH_CYCLE_ACCURATE_TIMER
: Use Time Stamp Counter to monitor the execution time of the kernel [default: off] -
MINI_DATASET
,SMALL_DATASET
,STANDARD_DATASET
,LARGE_DATASET
,EXTRALARGE_DATASET
: set the dataset size to be used [default:STANDARD_DATASET
] -
POLYBENCH_USE_C99_PROTO
: Use standard C99 prototype for the functions. -
POLYBENCH_USE_SCALAR_LB
: Use scalar loop bounds instead of parametric ones.
The following contributions have been made to this benchmark suite by the following people:
- Lifan Xu -- Original implementation of CUDA and OpenCL kernels
- Robert Searles -- Original implementation of HMPP kernels (version 2.x)
- Scott Grauer-Gray -- Modified implementations of CUDA and OpenCL
- William Killian -- Modified HMPP kernels (updated to 3.x), OpenACC kernels, OpenMP kernels
This work was funded in part by the U.S. National Science Foundation through the NSF Career award 0953667 and the Defense Advanced Research Projects Agency through the DARPA Computer Science Study Group (CSSG).