Howto

HPL-GPU and CALDGEMM How-To

This is a howto that explains how to setup, run, and tune CALDGEMM and HPL-GPU. CALDGEMM has four different backends to work with OpenCL, CAL, CUDA, or CPUs only. In addition, three BLAS libraries are supported: Intel MKL, AMD ACML, and GotoBLAS. The user will have to select one backend and one BLAS library.

This howto will mostly cover the case the the backend is OpenCL running on AMD GPUs and where the BLAS library is Intel MKL. Some side notes will explain the necessary changes to use other backends or BLAS libraries.

Many steps in this howto will be optional, or only necessary in certain cases.

Part 1: Distribution Requirements

In general, CALDGEMM and HPL-GPU should work on any Linux distribution. For this Howto, we assume an OpenSuSE 13.2 setup with minimal installation as baseline.

During this howto, we will need certain software from the standard OpenSuSE repository. Not all of these software packages will be required, depending on which path you follow (which BLAS library, which GPU backend, and which optional steps you need). The requirements are:

gcc-c++ (For compilation)
rpm-build (For building AMD driver package)
python (For adl3/atitweak utility)
gcc-fortran, mpc-devel, mpfr-devel, gmp-devel (For AMD ACML)
xdm (Display manager for headless X-Server)
nano (Text editor)

To just install all these packages, so you do not need to bother them it during the howto, run as root:

zypper install gcc-c++ gcc-fortran rpm-build python xdm nano mpc-devel mpfr-devel gmp-devel

We will install HPL for the user called hpluser. Make sure that this user exists. We will assume its home directory in /home/hpluser, and we assume the $HOME environment variable pointing to its home directory.

Part 2: Environment

As a next step, we will have to set up the environment. This refers to hardware and software environment. We will setup environment variables pointing to all the software packages (these are needed by the build process), some environment variables related to GPUs and to the X server, and we set certain ulimit values correctly to allow large memory allocation.

In some cases, if you have an Intel CPU, you should disable HyperThreading in the BIOS before running CALDGEMM or HPL-GPU. Otherwise, the thread-pinning routinges might get confused.
- If you use the CAL backend, or if you use the OpenCL backend without GPU_C option (see CALDGEMM Command Line Options), you should disable HyperThreading.
- If you use GotoBLAS as BLAS library, you should disable Hyperthreading.
- If you use Intel MKL of 2015 or later, CUDA or OpenCL Backend with GPU_C, and run on a Haswell system, it is suggested to enable HyperThreading.
CALDGEMM and HPL-GPU rely on anonymous huge pages for huge pages allocation.
- Please make sure that "Transparent Hugepage Support" in the Linux kernel is enabled and the policy defaults to always (available as from 2.6.28).
- For older Linux kernels, there is also a deprecated option in CALDGEMM to provide huga page allocation. In order to use this, please enable #define USE_OLD_HUGE_MALLOC in caldgemm_config.h (see below).
For the environment variables, we need to create a private library path, which we use to preload libraries by putting this first in $LD_LIBRARY_PATH. We also need a temp directory for downloads, etc.

mkdir $HOME/lib $HOME/tmp

For allowing large memory allocations, we will have to edit the ulimit limits for the non-root user hpluser. For this, we edit /etc/security/limits.conf
- Add:

hpluser          -       memlock         unlimited

* (drohr is the username)

Next, we download the HPL-GPU and CALDGEMM software (because they contain some scripts required here). You can either install the latest release (by cloning the respective tag from the git repository, or by downloading the file) or you can check out the current master branch, for the very newest version. The master branch in the repository should usually be stable, while the test branch is used for development.
- Downloading the files of the latest release: In this case please unpack the files to $(HOME) and create symlinks $HOME/caldgemm and $HOME/hpl-gpu pointing to the versioned directories.
- For cloning the master repository, do

cd $HOME
git clone https://github.com/davidrohr/caldgemm.git
git clone https://github.com/davidrohr/hpl-gpu.git

CALDGEMM comes with an example script for the environment: caldgemm_setenv.sh.sample. If you leave all directories as they are in this howto, you can just use this script as is. Otherwise, please change the script accordingly. The section Environment Variables contains a description of all relevant variables.
- To bring the script in place:

cp $(HOME)/caldgemm/environment/caldgemm_setenv.sh.sample $(HOME)/caldgemm_setenv.sh´

No, you can set the environment via:

source $(HOME)/caldgemm_setenv.sh

* If one of the `ulimit` commands in the script fails, you have probably not set up `/etc/security/limits.conf` properly as explained above.
* If you want to have the proper environment available upon login as `hpluser`, add `source $(HOME)/caldgemm_setenv.sh` to `/home/hpluser/.bashrc`.
* This script will modify your `DISPLAY` variable to enable a headless X setup. This will break SSH X-forwarding. Please refer to [[Headless System with X Server]] for details.

Part 3: GPU Driver

In case of an AMP GPU, download the GPU driver from http://support.amd.com/de-de/download. For FirePro GPU cards, you need driver version 14.502.x or newer. AMD Drivers are available from http://support.amd.com/en-us/download. Download this driver to /home/hpluser/tmp. We are using driver version 14.502.1040 (14.502.1040-linux-cert-retail.zip) which contains the driver installer file if you unpack it. In our example, the driver file is called amd-driver-installer-14.502.1040-x86.x86_64.run.

Build the proper RPM package for the SuSE distribution. You can get a list of all packages with the --listpkg option:

cd $(HOME)/tmp
./amd-driver-installer-14.502.1040-x86.x86_64.run --listpkg

From the listed packages, we select select SuSE/SUSE132-AMD64 for OpenSuSE 13.2 and build the RPMs:

./amd-driver-installer-14.502.1040-x86.x86_64.run --buildpkg SuSE/SUSE132-AMD64

Now, we can install the driver:

zypper install --force fglrx*.rpm

For AMD GPUs, we have to create a proper X-config and set a variable to allow large OpenCL buffers (We rename the old x-config beforehand in order to start really from scratch because aticonfig --initial sometimes has problems with this):

mv /etx/X11/xorg.conf /etc/X11/xorg.olg
aticonfig --initial --adapter=ALL
aticonfig --set-pcs-u32=MCIL,VmClientAddressSpaceGB,512

The setting in the last call should match the memory of your machine, i.e. the above line is for a server with 512 GB.

Now, we load the kernel module, and check whether it detected the GPUs:

modprobe fglrx
dmesg | grep fglrx

The last command should show something like:

[1437512.156586] <6>[fglrx] module loaded - fglrx 14.50.2 [Apr  6 2015] with 8 minors

Part 4: Headless X Setup (Optional)

This step is required for the following cases:

If you want to use CAL as GPU backend.
If you want to use the adl3 / atitweak utility to set AMD GPU's powertune feature.
If you want to use the aticonfig utility to monitor GPU clocks and temperatures on AMD GPUs.

A headless X Setup is a setup where the server runs an X-Server with one screen per GPU, but the user does not log into the X server but remotely via SSH. Sometimes, the X-Server handles certain GPU features, and in that case a running X server is needed, even though X itself is not used. Details can be found in the Headless System with X Server entry.

For a headless X Setup, we need to perform the following actions:

Create a proper X Config. (We already did this while installing the GPU driver in the previous part via aticonfig --initial --adapter=ALL.
Set xdm as Display Manager and disable some security features such that a user that logs in remotely can access the X server:
- Edit: /etc/X11/xdm/xdm-config
- Change:

 DisplayManager._0.authorize:    false           (Set to false)
  * Edit: `/etc/sysconfig/displaymanager`
  * Change:
DISPLAYMANAGER="xdm"
DISPLAYMANAGER_REMOTE_ACCESS="yes"
DISPLAYMANAGER_ROOT_LOGIN_REMOTE="yes"
DISPLAYMANAGER_XSERVER_TCP_PORT_6000_OPEN="yes"

Now, you can start the X server via

rcxdm start

and stop it with

rcxdm stop

After the startup, you'll have to wait a certain time for the X server to come up. You can trace the X-log via

tail -f /var/log/Xorg.0.log

until you see the following lines, which indicate that the GPUs are ready (one line per GPU in the server):

[1437157.875] (II) fglrx(0): Restoring Recent Mode via PCS is not supported in RANDR 1.2 capable environments
[1437157.875] (II) fglrx(1): Restoring Recent Mode via PCS is not supported in RANDR 1.2 capable environments
...

Part 5: Install the adl3 / atitweak utility (Optional)

This utility can be used to set the powertune level of AMD GPUs. It requires python and an X server. You can get it from github: https://github.com/mjmvisser/adl3:

cd $HOME
git clone https://github.com/mjmvisser/adl3.git

To verify that atitweak works, with a running X server, please execute:

$HOME/adl3/atitweak -k

In order to obtain best performance in DGEMM (and hence in HPL) on AMD GPUs, you have to set powertune to raise the GPUs TDP (see HPL Tuning). Please be aware, that this could damage your hardware, because it might rise the TDP beyond the specifications of the hardware. AMD GPUs offer a certain range in which powertune can be set. Usually it is -20% to 20% or -50% to 50%. To set powertune to 50%, please run:

$HOME/adl3/atitweak -p 50

Please be aware that similar TDP limitations hold true for NVIDIA GPUs.

Part 6: Installing the BLAS library

CALDGEMM and HPL-GPU support three BLAS libraries: Intel MKL, AMD ACML, and GotoBLAS2. Other BLAS libraries might be supported but are not tested. Given the fact, that certain patches are required for GotoBLAS and ACML, it is likely that other BLAS libraries will require similar patches as well, to work with CALDGEMM and HPL-GPU at full performance.

In general, this Wiki assumes usage of the Intel MKL library, but this part will explain the setup of all three libraries. In the later parts, you will have to select the proper one if you are not using MKL.

Installation of Intel MKL works quite out of the box. Just install it as provided from Intel and make sure the environment paths are set correctly. (caldgemm_setenv.sh.sample assumes an installation in $HOME/intel.) Two paths are relevant:
- $MKL_PATH: This should point to the actual MKL directory.
- $ICC_PATH: The MKL library requires the Intel OpenMP runtime library. If the Intel Compiler (ICC) is installed, it brings the OpenMP runtime with it. In that case, $ICC_PATH can point to the ICC directory. If ICC is not installed, the MKL brings a redistributable version of the OpenMP runtime with it.
Installation of GotoBLAS is slightly more complicated. CALDGEMM requires a patch to be applied to GotoBLAS in order to reserve CPU cores for GPU related task.
- Download the latest GotoBLAS (1.13) from here to $HOME/tmp.
- Unpack it to $HOME/GotoBLAS2:

cd $HOME
tar -zxf tmp/GotoBLAS2-1.13.tar.gz

Apply the GotoBLAS patch from CALDGEMM:

cd GotoBLAS2
patch -p0 < ../caldgemm/gotoblas_patch/gotoblas.patch

You may want to edit Makefile.rule and set TARGET to your CPU architecture (like NEHALEM or BARCELONA).
Compile GotoBLAS:

make -j32

The build process should have created the library file $HOME/GotoBLAS2/libgoto2.a.
Installation of AMD ACML is even more complicated, as it requires a patch to GCC's libgomp OpenMP runtime library. Without this thread, libgomp will continously terminate and recreate threads during HPL-GPU execution resulting in very poor performance. In addition, ACML comes only with a BLAS interface, and we will have to compile a CBLAS interface seperately.
- Download the latest version of ACML from here to $HOME/tmp. (Currently, this is 6.1.0. Please use the Linux version for GCC / GFORTRAN.)
- Unpack ACML:

mkdir $HOME/acml
cd $HOME/acml
tar -zxf ../tmp/acml-6.1.0.31-gfortran64.tgz

Download the CBLAS interface from www.netlib.org/blas/ to $HOME/tmp, unpack it and compile it:

cd $HOME/tmp
wget http://www.netlib.org/blas/blast-forum/cblas.tgz
cd ..
tar -zxf tmp/cblas.tgz
cd CBLAS
make alllib

This should have created a library file $HOME/CBLAS/lib/cblas_LINUX.a.

In the next step, we will create a patched GCC libgomp. We will download the GCC sources, apply the patch that comes will caldgemm, and compile libgomp only. Then, we copy the patched libgomp.so file to $HOME/lib where it gets precedence in $LD_LIBRARY_PATH over the system library.
- Download the gcc sources that match your system gcc (you can obtain the version via gcc --version) from a mirror at gcc.gnu.org. In my case, I use the gwdg.de mirror and download version 4.8.3:

cd $HOME/tmp
wget ftp://ftp.gwdg.de/pub/misc/gcc/releases/gcc-4.8.3/gcc-4.8.3.tar.gz

* Next step is to unpack the sources and apply the patch:

cd gcc-4.8.3
patch -p0 < $HOME/caldgemm/gcc_patch/libgomp.patch

Due to possible changes to libgomp, the patch will not work out of the box for all GCC versions. Fortunately, changes to the thread creation part in team.c are quite rare, such that the patch actually usually just works - and if not, it can be adapted easily. * We have to configure gcc (the entire gcc, not libgomp seperately) and compile all relevant parts for libgomp:

./configure --disable-multilib --disable-bootstrap
make all-target-libgomp -j32

This should have created the library file libgomp/.libs/libgomp.so in the respective architecture directory. (In my case /home/hpluser/tmp/gcc-4.8.3/x86_64-unknown-linux-gnu/libgomp/.libs/libgomp.so.) As last step, we have to copy the library files to the library directory:

cp /home/hpluser/tmp/gcc-4.8.3/x86_64-unknown-linux-gnu/libgomp/.libs/libgomp.so* $HOME/lib

Part 7: Installing the SDK for the GPU Backend

If you do not want to use the CPU backend of CALDGEMM (which you probably do not want, because you want to run HPL-GPU on GPUs), you have to install at lease one of the following SDKs.

Installation of AMD OpenCL SDK: The OpenCL development toolkit of AMD is called AMD APP SDK, and it is available here. Please install the most current version (currently 2.9.1) to the path indicated by $AMDAPPSDKROOT in your environment (caldgemm_setenv.sh.sample).
Installation of AMD CAL: AMD CAL is deprecated and the CAL headers are no longer included in the AMD APP SDK. The suggested method is to install the current APP SDK first (as detailed above for AMD OpenCL). Then, fetch an older version of the AMD APP SDK or AMD Stream SDK that still contains the CAL headers, and extract these headers to $AMDAPPSDKROOT/include/CAL.
Installation of CUDA: To use NVIDIA CUDA you need to install the NVIDIA kernel module and the NVIDIA CUDA SDK. You can find packages for most distributions at developer.nvidia.com/cuda-downloads.

Part 8: Installing 3rd party OpenCL DGEMM kernel libraries (Optional)

To obtain best performance with the OpenCL backend, you need a 3rd party library with an optimized DGEMM kernel for your particular GPU. These libraries are usually provided as binary files from the vendor.

In our example, we use binary DGEMM kernels provided by AMD for the Tahiti and Hawaii GPU families. For convenience, the shared object file with the kernels is shipped with CALDGEMM in the directory 3rd_party_dgemm_kernels. Please copy that file to the lib directory created:

 cp $HOME/caldgemm/3rd_party_dgemm_kernels/amd_dgemm_2015_08_05/amddgemm.so $HOME/lib

For other GPUs, the kernel binary must be provided by the veondor

In certain cases, a special OpenCL runtime is also required to run this kernel. This special OpenCL runtime should come with the library.
Please download the library and unpack all contained files to $HOME/lib.

Alternatively, you can create you own DGEMM kernel. There is an example in 3rd_party_template in the caldgemm repository.

##Part 9: Install AMD Display Library (ADL) (Optional)

This part is optional and only required if you want to enable ADL support in CALDGEMM and / or HPL-GPU. With ADL support, a monitoring of GPU temperatures is possible.

Download the SDK from the AMD Website.
Unpack it to $HOME/ADL.
In the following steps, you will have to modify the caldgemm and HPL configuration files:
- In CALDGEMM, outcommend #define _NO_ADL in caldgemm_config.h.
- In HPL-GPU, enable the HPL_DEFS += -DHPL_GPU_TEMPERATURE_THRESHOLD=92. option in the HPL-GPU compile time configuration file. (See HPL Tuning.)

Part 10: Apply AMD CAL binary driver patch (Optional)

This step is only required if you want to use the CAL backend with the -o c option. (See CALDGEMM Command Line Options for a list of options.) Essentially, this means if you are working on a system with AMD processors. Be aware that this patch is only available for some older AMD GPU driver versions. For current GPUs, the OpenCL backend is favored, which does not require such a binary patch.

Apply the patch to libaticaldd.so as explained in Catalyst Driver Patch.
Copy the patched library file to $HOME/lib.

Part 11: Installing MPI for Multi-Node runs (Optional)

MPI is only required if you want to run a distributed Linpack on multiple nodes. Three MPI implementations have been tested: OpenMPI, MPICH / MVAPICH, Intel MPI. The suggested implementation is OpenMPI. If you want to use other MPI implementations, please outcomment the respective MPdir, MPinc, and MPlib lines in the HPL compile time configuration as explained below.

For using OpenMPI, you can either install a package that comes with your distribution or compile a current version by yourself. Since the packages delivered by the distribution are usually outdated, it is suggested to compile the current version from the sources.

Please download the current version from www.open-mpi.org/software. You should use version 1.8 or higher.
Unpack, configure and compile OpenMPI. Make sure the $OPENMPI_PATH environment variable points to the correct folder.

Part 12: Intel Threading Building Blocks (TBB) (Optional)

HPL-GPU requires the Intel Threading Building blocks. There are three methods to provide this library:

You can provide a $TBB_PATH environment variable. HPL-GPU will then automatically search for the TBB headers in $TBB_PATH/include/tbb and for the shared object files in $TBB_PATH/lib/intel64.
You can create symbolic links in $HOME/hpl-gpu/include and $HOME/hpl-gpu/lib to tbb. For instance, in case you use the TBB library that comes with Intel MKL (with $INTELPATH the common Intel Software path), you can create the links in the following way:

cd $HOME/hpl-gpu/include
ln -s $INTELPATH/tbb/include/tbb
cd ../lib
for i in $INTELPATH/tbb/lib/intel64/gcc4.4/libtbb*; do ln -s $i; done

If you do neither provide the $TBB_PATH variables nor the symbolic links, the build system will automatically download and compile TBB.

Part 13: Compiling CALDGEMM and HPL-GPU

This howto assumes that we want to build the OpenCL backend of CALDGEMM, which means that boch CALDGEMM and HPL-GPU must link against OpenCL (libOpenCL.so). The standard config files that are provided assume the AMD OpenCL SDK and will files will search for for the libs in $AMDAPPSDKROOT/lib/x86_64. If your configuration differs, you will have to adapt the build scripts.

First, we have to setup the CALDGEMM configuration files:

cd $HOME/caldgemm
cp config_options.sample config_options.mak
cp caldgemm_config.sample caldgemm_config.h

Usually, you can leave caldgemm_config.h as is. You will only need to edit this file if you use the CAL backend. In that case, refer to the documentation in HPL Tuning.
Please edit config_options.mak, enable the GPU backend(s) you want to support, choose the BLAS backend, and set CONFIGURED to one. In my case, this yields the following settings:

BLAS_BACKEND                            = MKL
INCLUDE_OPENCL                          = 1
INCLUDE_CAL                             = 0
INCLUDE_CUDA                            = 0
CONFIGURED                              = 1

Build CALDGEMM

make -j

Create a link to caldgemm inside HPL-GPU and prepare HPL compile time config files:

cd ..
cd hpl-gpu
ln -s ../caldgemm
cp setup/Make.Generic Make.Generic.Options .

Edit Make.Generic.Options (HPL configuration file with tuning parameters)
- Set HPL_CONFIG_MPI depending on whether you compiled OpenMPI or not.
- You can leave HPL_CONFIG_VERBOSE at 3 for single-node tests. For multi-node setups it should be reduced.
- Select one of the backends enabled in CALDGEMM's config_options.mak as HPL_CALDGEMM_BACKEND.
- It should be save to leave the rest as it is. Refer to HPL Tuning for a description of the parameters. My config file looks like:

HPL_CONFIG_MPI = 0
HPL_CONFIG_VERBOSE = 3
HPL_CALDGEMM_BACKEND = opencl

Now, we can compile HPL:

./build.sh

Finally, we have to setup HPL's runtime configuration files bin/Generic/HPL.dat and bin/Generic/HPL-GPU.conf
- In HPL.dat set:
  - NBs set to 1920
  - Ns must be set in the following way: (Ns * Ns * 8) must smaller than system memory, Ns should be a multiple of NBs
    - Example (NBs = 1920, Memory Size 128 GB): 60 * 1920 = 115200, 115200 * 115200 * sizeof(double) = 106 168 320 000 --> 106 GB, OK for 128 GB of system memory: we use 115200
  - LOOKAHEADs: set to 2
- Refer to HPL Tuning for documentation in the options of HPL-GPU.conf.
- If you want to configure a multi-node run, where the participating nodes are inhomogeneous, please refer to Heterogeneous cluster with different node types for documentation how to setup node-perf.datfor a heterogeneous cluster.

Part 14: Run HPL-GPU

Before we run HPL-GPU, we might want to perform two optional steps:

Set Powertune

Start the X server (as root)

rcxdm start

Wait for some time for X to come up (see above).
Set powertune:

/home/hpluser/adl3/atitweak -p 20

(Will increase TDP by 20%, can set between 0 and 50)

Stop X server again (HPL-GPU runs faster if X is shut down)

rcxdm stop

Monitor GPU Temperature and Clocks:

The X-server must run for using the aticonfig utility
- (Be aware: You cannot start X, while HPL is running!)

rcxdm start

Execute the following command in a second shell, to minitor GPU parameters:

while true; do clear && aticonfig --odgc --odgt --adapter=ALL && sleep 2; done

Finally, everything is ready to start your HPL-GPU run:

cd $HOME/hpl-gpu/bin/Generic/
./xhpl

You can see the current performance during the run in the line that ends with "System Gflops *******", where ******* is the performance
The final performance is shown in the line that starts like :

WC26L2C64 ................................ *******

HPL performs a verification whether the result is correct at the end.
- IT MUST PRINT "PASSED", otherwise there was a computational error.

Part 15: Tuning

In order to tune HPL-GPU performance on OpenCL, you should proceed in three steps:

Ensure sufficient PCI Express performance:
- For this purpose, CALDGEMM provides a test tool in $HOME/caldgemm/memtest.
- Compile it via:

cd $HOME/caldgemm/memtest
./build.sh

Then, run the test command listed in $HOME/caldgemm/memtest/cmd:

./mem -g -2 -c -1 -x -z -l -lh 3072 -lw 3072 -lx 20 -ly 20 -a -u

The relevant line in the output is the bidirectional aggreagate (all devices) performance for the mode Buffer / Strided. The last column in that line, the value for MaxFlop, shows the highest possible DGEMM performance achieved by reason of the measured PCI Express bandwidth. On my system, this line looks like

Platform 0 Device all:   Buffer/Strided   -  to GPU: 43.458 GB/s (0.222 s)  -  to Host: 52.997 GB/s (0.182 s)  -  bidir: 51.317 GB/s (0.377 s)  -  MaxFlop 13137.197

so PCI Express does not limit the DGEMM performance as long as it is below 13 TFLOPS.

The next step is to optimize the DGEMM performance itself. This Wiki contains several entries with guidelines how to tune DGEMM performance:
- Section 2a) in HPL Tuning gives a rough guideline on the most important aspects and how DGEMM and HPL performance are related.
- The CALDGEMM Performance Guide for CAL and OpenCL gives an overview over many CALDGEMM tuning options. However, not all of the remarks in there are relevant for the new OpenCL version with the GPU_C option. But is a good starting place.
- The CALDGEMM Performance Guide for OpenCL and CUDA provides guideline updates for the new GPU_C option. It supersedes the older guide for GPU_C where the guidelines differ, but it does not repeat all the basic guidelines from the older guide. In general, it is a good idea for an enthusiastic user to go through both guides to learn about the internals.
- For reference, the is a list of all CALDGEMM Command Line Options and a set of CALDGEMM dgemm_bench examples.
As soon as CALDGEMM achieves good DGEMM performance, the next goal is to transfer and maintain this performance in HPL-GPU. Please refer to HPL Tuning for guidelines.
- It is suggested to use the new generic runtime configuration interface as explained in HPL Tuning. Refer to the following two sample files for the generic configuration:
  - HPL GPU Generic Runtime Configuration Sample
  - HPL GPU Generic Compile Time Configuration Sample
- For a complete list of all available compile time configuration options, please refer to HPL Compile Time Options.