Howto

HPL-GPU and CALDGEMM How-To

caldgemm: DGEMM Library
hpl: Our version of hpl. It is called HPL-GPU. Based on hpl 2.0. Customized
adl3: Tool to set AMD powertune
lib: Special OpenCL library from AMD for HPL
tmp/amd-driver-installer-14.50.2.-1006: Special AMD Driver
memtest: My DMA bandwdith test
.bashrc: Setup file for environment

Part 1: Distribution Requirements

In general, CALDGEMM and HPL-GPU should work on any Linux distribution. For this Howto, we assume an OpenSuSE 13.2 setup with minimal installation as baseline.

During this howto, we will need certain software from the standard OpenSuSE repository. Not all of these software packages will be required, depending on which path you follow (which BLAS library, which GPU backend, and which optional steps you need). The requirements are:

gcc-c++ (For compilation)
rpm-build (For building AMD driver package)
python (For adl3/atitweak utility)
gcc-fortran, mpc-devel, mpfr-devel, gmp-devel (For AMD ACML)
xdm (Display manager for headless X-Server)
nano (Text editor)

To just install all these packages, so you do not need to bother them it during the howto, run as root:

zypper install gcc-c++ gcc-fortran rpm-build python xdm nano mpc-devel mpfr-devel gmp-devel

We will install HPL for the user called hpluser. Make sure that this user exists. We will assume its home directory in /home/hpluser, and we assume the $HOME environment variable pointing to its home directory.

Part 2: Environment

As a next step, we will have to set up the environment. We will setup environment variables pointing to all the software packages (these are needed by the build process), some environment variables related to GPUs and to the X server, and we set certain ulimit values correctly to allow large memory allocation.

First, we need to create a private library path, which we use to preload libraries by putting this first in $LD_LIBRARY_PATH. We also need a temp directory for downloads, etc.

mkdir $HOME/lib $HOME/tmp

For allowing large memory allocations, we will have to edit the ulimit limits for the non-root user hpluser. For this, we edit /etc/security/limits.conf
- Add:

drohr            -       memlock         unlimited          (drohr is the username)

Next, we download the HPL-GPU and CALDGEMM software (because they contain some scripts required here). You can either install the latest release (by cloning the respective tag from the git repository, or by downloading the file) or you can check out the current master branch, for the very newest version. The master branch in the repository should usually be stable, while the test branch is used for development.
- Downloading the files of the latest release: In this case please unpack the files to $(HOME) and create symlinks $HOME/caldgemm and $HOME/hpl-gpu pointing to the versioned directories.
- For cloning the master repository, do

cd $HOME
git clone https://github.com/davidrohr/caldgemm.git
git clone https://github.com/davidrohr/hpl-gpu.git

CALDGEMM comes with an example script for the environment: caldgemm_setenv.sh.sample. If you leave all directories as they are in this howto, you can just use this script as is. Otherwise, please change the script accordingly.
- To bring the script in place:

cp $(HOME)/caldgemm/environment/caldgemm_setenv.sh.sample $(HOME)/caldgemm_setenv.sh´

No, you can set the environment via:

source $(HOME)/caldgemm_setenv.sh

* If one of the `ulimit` commands in the script fails, you have properly not set up `/etc/security/limits.conf` properly as explained above.
* If you want to have the proper environment available upon login as `hpluser`, add `source $(HOME)/caldgemm_setenv.sh` to `/home/hpluser/.bashrc`.
* This script will modify your `DISPLAY` variable to enable a headless X setup. This will break SSH X-forwarding. Please refer to [[Headless System with X Server]] for details.

Part 3: GPU Driver

In case of an AMP GPU, download the GPU driver from http://support.amd.com/de-de/download. For FirePro GPU cards, you need driver version 14.502.x or newer. Download this driver to /home/hpluser/tmp. In the example, the driver file is called amd-driver-installer-14.502-150406a-182396E-Retail_End_User-x86.x86_64.run.

Build the proper RPM package for the SuSE distribution. You can get a list of all packages with the --listpkg option:

cd $(HOME)/tmp
./amd-driver-installer-14.502-150406a-182396E-Retail_End_User-x86.x86_64.run --listpkg

From the listed packages, we select select SuSE/SUSE132-AMD64 for OpenSuSE 13.2 and build the RPMs:

./amd-driver-installer-14.502-150406a-182396E-Retail_End_User-x86.x86_64.run --buildpkg SuSE/SUSE132-AMD64

Now, we can install the driver:

zypper install --force fglrx*.rpm

For AMD GPUs, we have to create a proper X-config and set a variable to allow large OpenCL buffers:

aticonfig --initial --adapter=ALL
aticonfig --set-pcs-u32=MCIL,VmClientAddressSpaceGB,512

The setting in the last call should match the memory of your machine, i.e. the above line is for a server with 512 GB.

Now, we load the kernel module, and check whether it detected the GPUs:

modprobe fglrx
dmesg | grep fglrx

The last command should show something like:

[1437512.156586] <6>[fglrx] module loaded - fglrx 14.50.2 [Apr  6 2015] with 8 minors

Part 4: Headless X Setup (Optional)

This step is required for the following cases:

If you want to use CAL as GPU backend.
If you want to use the adl3 / atitweak utility to set AMD GPU's powertune feature.
If you want to use the aticonfig utility to monitor GPU clocks and temperatures on AMD GPUs.

A headless X Setup is a setup where the server runs an X-Server with one screen per GPU, but the user does not log into the X server but remotely via SSH. Sometimes, the X-Server handles certain GPU features, and in that case a running X server is needed, even though X itself is not used. Details can be found in the Headless System with X Server entry.

For a headless X Setup, we need to perform the following actions:

Create a proper X Config. (We already did this while installing the GPU driver in the previous part via aticonfig --initial --adapter=ALL.
Set xdm as Display Manager and disable some security features such that a user that logs in remotely can access the X server:
- Edit: /etc/X11/xdm/xdm-config
- Change:

 DisplayManager._0.authorize:    false           (Set to false)
  * Edit: `/etc/sysconfig/displaymanager`
  * Change:
DISPLAYMANAGER="xdm"
DISPLAYMANAGER_REMOTE_ACCESS="yes"
DISPLAYMANAGER_ROOT_LOGIN_REMOTE="yes"
DISPLAYMANAGER_XSERVER_TCP_PORT_6000_OPEN="yes"

Now, you can start the X server via

rcxdm start

and stop it with

rcxdm stop

After the startup, you'll have to wait a certain time for the X server to come up. You can trace the X-log via

tail -f /var/log/Xorg.0.log

until you see the following lines, which indicate that the GPUs are ready (one line per GPU in the server):

[1437157.875] (II) fglrx(0): Restoring Recent Mode via PCS is not supported in RANDR 1.2 capable environments
[1437157.875] (II) fglrx(1): Restoring Recent Mode via PCS is not supported in RANDR 1.2 capable environments
...

Part 5: Install the adl3 / atitweak utility (Optional)

This utility can be used to set the powertune level of AMD GPUs. It requires python and an X server. You can get it from github: https://github.com/mjmvisser/adl3:

cd $HOME
git clone https://github.com/mjmvisser/adl3.git

To verify that atitweak works, with a running X server, please execute:

$HOME/adl3/atitweak -k

In order to obtain best performance in DGEMM (and hence in HPL) on AMD GPUs, you have to set powertune to raise the GPUs TDP (see HPL Tuning). Please be aware, that this could damage your hardware, because it might rise the TDP beyond the specifications of the hardware. AMD GPUs offer a certain range in which powertune can be set. Usually it is -20% to 20% or -50% to 50%. To set powertune to 50%, please run:

$HOME/adl3/atitweak -p 50

Please be aware that similar TDP limitations hold true for NVIDIA GPUs.

Part 6: Installing the BLAS library

CALDGEMM and HPL-GPU support three BLAS libraries: Intel MKL, AMD ACML, and GotoBLAS2. Other BLAS libraries might be supported but are not tested. Given the fact, that certain patches are required for GotoBLAS and ACML, it is likely that other BLAS libraries will require similar patches as well, to work with CALDGEMM and HPL-GPU at full performance.

In general, this Wiki assumes usage of the Intel MKL library, but this part will explain the setup of all three libraries. In the later parts, you will have to select the proper one if you are not using MKL.

Installation of Intel MKL works quite out of the box. Just install it as provided from Intel and make sure the environment paths are set correctly. (caldgemm_setenv.sh.sample assumes an installation in $HOME/intel.) Two paths are relevant:
- $MKL_PATH: This should point to the actual MKL directory.
- $ICC_PATH: The MKL library requires the Intel OpenMP runtime library. If the Intel Compiler (ICC) is installed, it brings the OpenMP runtime with it. In that case, $ICC_PATH can point to the ICC directory. If ICC is not installed, the MKL brings a redistributable version of the OpenMP runtime with it.
Installation of GotoBLAS is slightly more complicated. CALDGEMM requires a patch to be applied to GotoBLAS in order to reserve CPU cores for GPU related task.
- Download the latest GotoBLAS (1.13) from here to $HOME/tmp.
- Unpack it to $HOME/GotoBLAS2:

cd $HOME
tar -zxf tmp/GotoBLAS2-1.13.tar.gz

Apply the GotoBLAS patch from CALDGEMM:

cd GotoBLAS2
patch -p0 < ../caldgemm/gotoblas_patch/gotoblas.patch

You may want to edit Makefile.rule and set TARGET to your CPU architecture (like NEHALEM or BARCELONA).
Compile GotoBLAS:

make -j32

The build process should have created the library file $HOME/GotoBLAS2/libgoto2.a.
Installation of AMD ACML is even more complicated, as it requires a patch to GCC's libgomp OpenMP runtime library. Without this thread, libgomp will continously terminate and recreate threads during HPL-GPU execution resulting in very poor performance. In addition, ACML comes only with a BLAS interface, and we will have to compile a CBLAS interface seperately.
- Download the latest version of ACML from here to $HOME/tmp. (Currently, this is 6.1.0. Please use the Linux version for GCC / GFORTRAN.)
- Unpack ACML:

mkdir $HOME/acml
cd $HOME/acml
tar -zxf ../tmp/acml-6.1.0.31-gfortran64.tgz

Download the CBLAS interface from www.netlib.org/blas/ to $HOME/tmp, unpack it and compile it:

cd $HOME/tmp
wget http://www.netlib.org/blas/blast-forum/cblas.tgz
cd ..
tar -zxf tmp/cblas.tgz
cd CBLAS
make alllib

This should have created a library file $HOME/CBLAS/lib/cblas_LINUX.a.

In the next step, we will create a patched GCC libgomp. We will download the GCC sources, apply the patch that comes will caldgemm, and compile libgomp only. Then, we copy the patched libgomp.so file to $HOME/lib where it gets precedence in $LD_LIBRARY_PATH over the system library.
- Download the gcc sources that match your system gcc (you can obtain the version via gcc --version) from a mirror at gcc.gnu.org. In my case, I use the gwdg.de mirror and download version 4.8.3:

cd $HOME/tmp
wget ftp://ftp.gwdg.de/pub/misc/gcc/releases/gcc-4.8.3/gcc-4.8.3.tar.gz

* Next step is to unpack the sources and apply the patch:

cd gcc-4.8.3
patch -p0 < $HOME/caldgemm/gcc_patch/libgomp.patch

Due to possible changes to libgomp, the patch will not work out of the box for all GCC versions. Fortunately, changes to the thread creation part in team.c are quite rare, such that the patch actually usually just works - and if not, it can be adapted easily. * We have to configure gcc (the entire gcc, not libgomp seperately) and compile all relevant parts for libgomp:

./configure --disable-multilib --disable-bootstrap
make all-target-libgomp -j32

This should have created the library file libgomp/.libs/libgomp.so in the respective architecture directory. (In my case /home/hpluser/tmp/gcc-4.8.3/x86_64-unknown-linux-gnu/libgomp/.libs/libgomp.so.) As last step, we have to copy the library files to the library directory:

cp /home/hpluser/tmp/gcc-4.8.3/x86_64-unknown-linux-gnu/libgomp/.libs/libgomp.so* $HOME/lib

Part 7: Installing the SDK for the GPU Backend

If you do not want to use the CPU backend of CALDGEMM (which you probably do not want, because you want to run HPL-GPU on GPUs), you have to install at lease one of the following SDKs.

Installation of AMD OpenCL SDK: The OpenCL development toolkit of AMD is called AMD APP SDK, and it is available here. Please install the most current version (currently 2.9.1) to the path indicated by $AMDAPPSDKROOT in your environment (caldgemm_setenv.sh.sample).
Installation of AMD CAL: AMD CAL is deprecated and the CAL headers are no longer included in the AMD APP SDK. The suggested method is to install the current APP SDK first (as detailed above for AMD OpenCL). Then, fetch an older version of the AMD APP SDK or AMD Stream SDK that still contains the CAL headers, and extract these headers to $AMDAPPSDKROOT/include/CAL.
Installation of CUDA: To use NVIDIA CUDA you need to install the NVIDIA kernel module and the NVIDIA CUDA SDK. You can find packages for most distributions at developer.nvidia.com/cuda-downloads.

Part 8: Installing 3rd party OpenCL DGEMM kernel libraries (Optional)

To obtain best performance with the OpenCL backend, you need a 3rd party library with an optimized DGEMM kernel for your particular GPU. These libraries are usually provided as binary files from the vendor.

To obtain the kernels for AMD Tahiti and AMD Hawaii family GPUs, please download the library here: MISSING!!!

In certain cases, a special OpenCL runtime is also required to run this kernel. This special OpenCL runtime should come with the library.

Please download the library and unpack all contained files to $HOME/lib.

##Part 9: Install AMD Display Library (ADL) (Optional)

This part is optional and only required if you want to enable ADL support in CALDGEMM and / or HPL-GPU. With ADL support, a monitoring of GPU temperatures is possible.

Download the SDK from the AMD Website.
Unpack it to $HOME/ADL.
In the following steps, you will have to modify the caldgemm and HPL configuration files:
- In CALDGEMM, outcommend #define _NO_ADL in caldgemm_config.h.
- In HPL-GPU, enable the HPL_DEFS += -DHPL_GPU_TEMPERATURE_THRESHOLD=92. option in the HPL-GPU compile time configuration file. (See HPL Tuning.)

Part 10: Apply AMD CAL binary driver patch (Optional)

This step is only required if you want to use the CAL backend with the -o c option. (See CALDGEMM Command Line Options for a list of options.) Essentially, this means if you are working on a system with AMD processors. Be aware that this patch is only available for some older AMD GPU driver versions. For current GPUs, the OpenCL backend is favored, which does not require such a binary patch.

Apply the patch to libaticaldd.so as explained in Catalyst Driver Patch.
Copy the patched library file to $HOME/lib.

Part 11: Compiling CALDGEMM and HPL-GPU

First, we have to setup the CALDGEMM configuration files:

cd $HOME/caldgemm
cp config_options.sample config_options.mak
cp caldgemm_config.sample caldgemm_config.h

Usually, you can leave caldgemm_config.h as is. You will only need to edit this file if you use the CAL backend. In that case, refer to the documentation in HPL Tuning.
Please edit config_options.mak, enable the GPU backend(s) you want to support, choose the BLAS backend, and set CONFIGURED to one. In my case, this yields the following settings:

BLAS_BACKEND                            = MKL
INCLUDE_OPENCL                          = 1
INCLUDE_CAL                             = 0
INCLUDE_CUDA                            = 0
CONFIGURED                              = 1

Build CALDGEMM
Build CALDGEMM

drohr@linux-zsrp:~/caldgemm> make

drohr@linux-zsrp:~/caldgemm> cd ..
drohr@linux-zsrp:~> cd hpl
drohr@linux-zsrp:~/hpl> ln -s ../caldgemm
drohr@linux-zsrp:~/hpl> cp setup/Make.Generic* .

Edit Make.Generic.Options (HPL configuration file with tuning parameters)
Disable HPL_CONFIG_MPI=0 in Make.Generic.Options

drohr@linux-zsrp:~/hpl> ./build.sh

HPL Runtime Configuration is in hpl/bin/Generic/HPL.dat

In HPL.dat set:
NBs set to 1920
Ns set to: (Ns * Ns * 8) must smaller than system memory, Ns should be multiple of NBs
- Example: 60 * 1920 = 115200, 115200 * 115200 * 8 = 106 168 320 000 --> 106 GB, OK for 128 GB of system memory, use 115200
LOOKAHEADs: set to 2

Part 2: Set Powertune

Login as root:

(start x server)

rcxdm start

wait for 10 seconds

/home/drohr/adl3/atitweak -p 30

(Will increase TDP by 30%, can set between 0 and 50)

rcxdm stop

Optional Part 3: Monitor GPU Temperature and Clocks:

(You cannot start monitoring, while HPL is running!) Login as root

rcxdm start
while true; do clear && aticonfig --odgc --odgt --adapter=ALL && sleep 2; done

Part 4: Run

Login as drohr

drohr@linux-zsrp:~> cd hpl/bin/Generic/
drohr@linux-zsrp:~/hpl/bin/Generic> ./xhpl

You can see the current performance during the run in the line that ends with "System Gflops *******", where ******* is the performance
Performance result in line: WC26L2C64 ................................ *******
Verification: IT MUST PRINT "PASSED", otherwise computational error