This benchmark evaluates the performance of GridTools, an open source C++-embedded DSL for weather and climate codes. The benchmark implements an advection-diffusion solver using the finite difference method. The benchmark uses operator splitting for temporal integration typical of weather and climate codes, whereby the two horizontal dimensions use an explicit discretization, and the third, vertical, dimension is implicit.
The benchmark requires version 1.1.2 of GridTools, which can be obtained from the GridTools repository:
$ git clone --branch v1.1.2 https://github.com/GridTools/gridtools.git
GridTools further depends on Boost (minimum version 1.67.0). A recent version of CMake is required to build and install GridTools (minimum version 3.14.5).
Follow the GridTools documentation for an installation guideline. Note that GPU support has to be enabled when building GridTools if the benchmark is to be run on a GPU system.
The benchmark requires the GHEX library. It can be obtained using:
$ git clone https://github.com/GridTools/GHEX.git
Required:
Optional:
Additionally, CMake is required for building GHEX.
Once all necessary and optional dependencies have been installed, GHEX can be installed using CMake as follows:
$ cd /PATH/TO/GHEX-SOURCE
$ mkdir build && cd build
$ cmake -DCMAKE_INSTALL_PREFIX=/PATH/TO/GHEX-INSTALLATION ..
To enable UCX support, pass additionally the following flags
-DGHEX_USE_UCP=ON \
-DUCP_INCLUDE_DIR=/PATH/TO/UCX-INSTALLATION/include \
-DUCP_LIBRARY=/PATH/TO/UCX-INSTALLATION/lib/libucp.so
If PMIx shall be enabled, follow the above pattern by define additionally
-DGHEX_USE_PMIx=ON \
-DPMIX_INCLUDE_DIR=/PATH/TO/PMIX-INSTALLATION/include \
-DPMIX_LIBRARY=/PATH/TO/PMIX-INSTALLATION/lib/libpmix.so
After successful configuration, type
$ make install
Once the dependencies are installed, CMake can be used to configure the build as follows:
$ cd /PATH/TO/GTBENCH-SOURCE
$ mkdir build && cd build
$ cmake ..
Depending on the setup, the GridTools and GHEX installation directories might have to be specified. This can be accomplished by passing additional arguments to cmake:
$ cmake -DGridTools_DIR=/PATH/TO/GRIDTOOLS-INSTALLATION \
-DGHEX_DIR=/PATH/TO/GHEX-INSTALLATION \
..
The GridTools backend specifies which hardware architecture to target.
The backend can be selected by setting the GTBENCH_BACKEND
option when configuring with CMake:
$ cmake -DGTBENCH_BACKEND=<BACKEND> ..
where <BACKEND>
must be either x86
, mc
, or cuda
. The x86
and mc
backends are two different CPU-backends of GridTools. On modern CPUs with large vector width and/or many cores, the mc
backend might perform significantly better. On CPUs without vectorization or small vector width and limited parallelism, the x86
backend might perform better. The cuda
backend currently supports running NVIDIA CUDA-capable GPUs and – despite its name – also AMD HIP-capable GPUs.
Note This section is only relevant for GPU targets.
There are three GPU targets available, which are set at when configuring GridTools by setting the CMake GT_CUDA_COMPILATION_TYPE
parameter:
NVCC-CUDA
: NVIDIA CUDA compilation using the NVIDIA compiler.Clang-CUDA
: Clang CUDA compilation using the compiler.HIPCC-AMDGPU
AMD HIP compilation using AMD’s HIP-Clang compiler. Note: the deprecated HCC compiler is not supported.
The benchmark implementation brings several runtimes, implementing different scheduling and communication strategies. These can be selected using the CMake variable GTBENCH_RUNTIME
:
$ cmake -DGTBENCH_RUNTIME=<RUNTIME> ..
where RUNTIME
can be ghex_comm
, gcl
, simple_mpi
, single_node
. simple_mpi
and single_node
are for debugging purposes only.
- The
single_node
options is useful for performing "single-node" tests to understand kernel performance. - The
simple_mpi
implementation uses a simple MPI 2 sided communication for halo exchanges. - The
gcl
implementation uses a optimized MPI based communication library shipped with GridTools. - The
ghex_comm
option will use highly optimized distributed communication via the GHEX library, designed for best performance at scale. Additionally, this option will enable a multi-threaded version of the benchmark, where a rank may have more than one sub-domain (over-subscription), which are delegated to separate threads. Note: The gridtools computations use openmp threads on the CPU targets which will not be affected by this parameter.
If the ghex_comm
runtime has been selected, the underlying transport layer will be either
UCX or MPI. The behaviour can be chosen by defining the the CMake boolean variables GHEX_USE_UCP
when configuring the GHEX library, see above.
The benchmark executable takes a single command line parameter, the global horizontal domain size N
. The simulation will then be performed on a total domain size of NX×NY×60
grid points. To launch the benchmark use the appropriate MPI launcher (mpirun
, mpiexec
, srun
or similar):
$ mpi_launcher <LAUNCHER_OPTIONS> ./benchmark --domain-size <NX> <NY>
Example output of a single-node benchmark run:
Running GTBENCH
Domain size: 100x100x60
Floating-point type: float
GridTools backend: cuda
Runtime: single_node
Median time: 0.198082s (95% confidence: 0.19754s - 0.200368s)
Columns per second: 50484.1 (95% confidence: 49908.1 - 50622.6)
For testing, the number of runs (and thus the run time) can be reduced as follows:
$ mpi_launcher <LAUNCHER_OPTIONS> ./benchmark --domain-size <N> <NY> --runs <RUNS>
For example, run only once:
$ mpi_launcher ./benchmark --domain-size 24000 24000 --runs 1
Running GTBENCH
Domain size: 24000x24000x60
Floating-point type: float
GridTools backend: cuda
Runtime: ghex_comm
Median time: 8.97857s
Columns per second: 6.41528e+07
Note that no confidence intervals are given in this case, but they are required for the final benchmark runs.
Note that there are additional per-runtime command line options. Use ./benchmark --help
to list all available options.
To make sure that the solver converges to the analytical solution of the advection-diffusion equation, we provide convergence tests. They might be helpful for evaluating correctness after possible code changes in the computation or runtime, or the compiler optimization level. To run them, use:
$ mpi_launcher <LAUNCHER_OPTIONS> ./convergence_tests
The convergence tests can be run on 1, 2 or 4 MPI ranks.
Example outputs for single and double precision configurations can be found in the files convergence_float.out and convergence_double.out.
Note that the expected convergence orders of some tests do not exactly match the theoretical order. This is either due to limited numerical precision or to suboptimal range of tested spatial or temporal resolutions for a specific discretization.