Howto

HPL-GPU and CALDGEMM How-To

caldgemm: DGEMM Library
hpl: Our version of hpl. It is called HPL-GPU. Based on hpl 2.0. Customized
adl3: Tool to set AMD powertune
lib: Special OpenCL library from AMD for HPL
tmp/amd-driver-installer-14.50.2.-1006: Special AMD Driver
memtest: My DMA bandwdith test
.bashrc: Setup file for environment

Part 1: Distribution Requirements

In general, CALDGEMM and HPL-GPU should work on any Linux distribution. For this Howto, we assume an OpenSuSE 13.2 setup with minimal installation as baseline.

During this howto, we will need certain software from the standard OpenSuSE repository. Not all of these software packages will be required, depending on which path you follow (which BLAS library, which GPU backend, and which optional steps you need). The requirements are:

gcc-c++ (For compilation)
rpm-build (For building AMD driver package)
python (For adl3/atitweak utility)
gcc-fortran, mpc-devel, mpfr-devel, gmp-devel (For AMD ACML)
xdm (Display manager for headless X-Server)
nano (Text editor)

To just install all these packages, so you do not need to bother them it during the howto, run as root:

zypper install gcc-c++ gcc-fortran rpm-build python xdm nano mpc-devel mpfr-devel gmp-devel

We will install HPL for the user called hpluser. Make sure that this user exists. We will assume its home directory in /home/hpluser, and we assume the $HOME environment variable pointing to its home directory.

Part 2: Environment

As a next step, we will have to set up the environment. We will setup environment variables pointing to all the software packages (these are needed by the build process), some environment variables related to GPUs and to the X server, and we set certain ulimit values correctly to allow large memory allocation.

First, we need to create a private library path, which we use to preload libraries by putting this first in $LD_LIBRARY_PATH. We also need a temp directory for downloads, etc.

mkdir $HOME/lib $HOME/tmp

For allowing large memory allocations, we will have to edit the ulimit limits for the non-root user hpluser. For this, we edit /etc/security/limits.conf
- Add:

drohr            -       memlock         unlimited          (drohr is the username)

Next, we download the HPL-GPU and CALDGEMM software (because they contain some scripts required here). You can either install the latest release (by cloning the respective tag from the git repository, or by downloading the file) or you can check out the current master branch, for the very newest version. The master branch in the repository should usually be stable, while the test branch is used for development.
- Downloading the files of the latest release: In this case please unpack the files to $(HOME) and create symlinks $HOME/caldgemm and $HOME/hpl-gpu pointing to the versioned directories.
- For cloning the master repository, do

cd $HOME
git clone https://github.com/davidrohr/caldgemm.git
git clone https://github.com/davidrohr/hpl-gpu.git

CALDGEMM comes with an example script for the environment: caldgemm_setenv.sh.sample. If you leave all directories as they are in this howto, you can just use this script as is. Otherwise, please change the script accordingly.
- To bring the script in place:

cp $(HOME)/caldgemm/environment/caldgemm_setenv.sh.sample $(HOME)/caldgemm_setenv.sh´

No, you can set the environment via:

source $(HOME)/caldgemm_setenv.sh

* If one of the `ulimit` commands in the script fails, you have properly not set up `/etc/security/limits.conf` properly as explained above.
* If you want to have the proper environment available upon login as `hpluser`, add `source $(HOME)/caldgemm_setenv.sh` to `/home/hpluser/.bashrc`.
* This script will modify your `DISPLAY` variable to enable a headless X setup. This will break SSH X-forwarding. Please refer to [[Headless System with X Server]] for details.

Part 3: GPU Driver

In case of an AMP GPU, download the GPU driver from http://support.amd.com/de-de/download. For FirePro GPU cards, you need driver version 14.502.x or newer. Download this driver to /home/hpluser/tmp. In the example, the driver file is called amd-driver-installer-14.502-150406a-182396E-Retail_End_User-x86.x86_64.run.

Build the proper RPM package for the SuSE distribution. You can get a list of all packages with the --listpkg option:

cd $(HOME)/tmp
./amd-driver-installer-14.502-150406a-182396E-Retail_End_User-x86.x86_64.run --listpkg

From the listed packages, we select select SuSE/SUSE132-AMD64 for OpenSuSE 13.2 and build the RPMs:

./amd-driver-installer-14.502-150406a-182396E-Retail_End_User-x86.x86_64.run --buildpkg SuSE/SUSE132-AMD64

Now, we can install the driver:

zypper install --force fglrx*.rpm

For AMD GPUs, we have to create a proper X-config and set a variable to allow large OpenCL buffers:

aticonfig --initial --adapter=ALL
aticonfig --set-pcs-u32=MCIL,VmClientAddressSpaceGB,512

The setting in the last call should match the memory of your machine, i.e. the above line is for a server with 512 GB.

Now, we load the kernel module, and check whether it detected the GPUs:

modprobe fglrx
dmesg | grep fglrx

The last command should show something like:

[1437512.156586] <6>[fglrx] module loaded - fglrx 14.50.2 [Apr  6 2015] with 8 minors

Part 4: Headless X Setup (Optional)

This step is required for the following cases:

If you want to use CAL as GPU backend.
If you want to use the adl3 / atitweak utility to set AMD GPU's powertune feature.
If you want to use the aticonfig utility to monitor GPU clocks and temperatures on AMD GPUs.

A headless X Setup is a setup where the server runs an X-Server with one screen per GPU, but the user does not log into the X server but remotely via SSH. Sometimes, the X-Server handles certain GPU features, and in that case a running X server is needed, even though X itself is not used. Details can be found in the Headless System with X Server entry.

For a headless X Setup, we need to perform the following actions:

Create a proper X Config. (We already did this while installing the GPU driver in the previous part via aticonfig --initial --adapter=ALL.
Set xdm as Display Manager and disable some security features such that a user that logs in remotely can access the X server:
- Edit: /etc/X11/xdm/xdm-config
- Change:

 DisplayManager._0.authorize:    false           (Set to false)
  * Edit: `/etc/sysconfig/displaymanager`
  * Change:
DISPLAYMANAGER="xdm"
DISPLAYMANAGER_REMOTE_ACCESS="yes"
DISPLAYMANAGER_ROOT_LOGIN_REMOTE="yes"
DISPLAYMANAGER_XSERVER_TCP_PORT_6000_OPEN="yes"

Now, you can start the X server via

rcxdm start

and stop it with

rcxdm stop

After the startup, you'll have to wait a certain time for the X server to come up. You can trace the X-log via

tail -f /var/log/Xorg.0.log

until you see the following lines, which indicate that the GPUs are ready (one line per GPU in the server):

[1437157.875] (II) fglrx(0): Restoring Recent Mode via PCS is not supported in RANDR 1.2 capable environments
[1437157.875] (II) fglrx(1): Restoring Recent Mode via PCS is not supported in RANDR 1.2 capable environments
...

Part 5: Install the adl3 / atitweak utility (Optional)

This utility can be used to set the powertune level of AMD GPUs. It requires python and an X server. You can get it from github: https://github.com/mjmvisser/adl3:

cd $HOME
git clone https://github.com/mjmvisser/adl3.git

To verify that atitweak works, with a running X server, please execute:

$HOME/adl3/atitweak -k

In order to obtain best performance in DGEMM (and hence in HPL) on AMD GPUs, you have to set powertune to raise the GPUs TDP (see HPL Tuning). Please be aware, that this could damage your hardware, because it might rise the TDP beyond the specifications of the hardware. AMD GPUs offer a certain range in which powertune can be set. Usually it is -20% to 20% or -50% to 50%. To set powertune to 50%, please run:

$HOME/adl3/atitweak -p 50

Please be aware that similar TDP limitations hold true for NVIDIA GPUs.

Part 6: Installing the BLAS library

CALDGEMM and HPL-GPU support three BLAS libraries: Intel MKL, AMD ACML, and GotoBLAS2. Other BLAS libraries might be supported but are not tested. Given the fact, that certain patches are required for GotoBLAS and ACML, it is likely that other BLAS libraries will require similar patches as well, to work with CALDGEMM and HPL-GPU at full performance.

In general, this Wiki assumes usage of the Intel MKL library, but this part will explain the setup of all three libraries. In the later parts, you will have to select the proper one if you are not using MKL.

Installation of Intel MKL works quite out of the box. Just install it as provided from Intel and make sure the environment paths are set correctly. (caldgemm_setenv.sh.sample assumes an installation in $HOME/intel.) Two paths are relevant:
- $MKL_PATH: This should point to the actual MKL directory.
- $ICC_PATH: The MKL library requires the Intel OpenMP runtime library. If the Intel Compiler (ICC) is installed, it brings the OpenMP runtime with it. In that case, $ICC_PATH can point to the ICC directory. If ICC is not installed, the MKL brings a redistributable version of the OpenMP runtime with it.
Installation of GotoBLAS is slightly more complicated. CALDGEMM requires a patch to be applied to GotoBLAS in order to reserve CPU cores for GPU related task.
- Download the latest GotoBLAS (1.13) from here to $HOME/tmp.
- Unpack it to $HOME/GotoBLAS2:

cd $HOME
tar -zxf tmp/GotoBLAS2-1.13.tar.gz

Apply the GotoBLAS patch from CALDGEMM:

cd GotoBLAS2
patch -p0 < ../caldgemm/gotoblas_patch/gotoblas.patch

You may want to edit Makefile.rule and set TARGET to your CPU architecture (like NEHALEM or BARCELONA).
Compile GotoBLAS:

make -j32

The build process should have created the library file $HOME/GotoBLAS2/libgoto2.a.
Installation of AMD ACML is even more complicated, as it requires a patch to GCC's libgomp OpenMP runtime library. Without this thread, libgomp will continously terminate and recreate threads during HPL-GPU execution resulting in very poor performance. In addition, ACML comes only with a BLAS interface, and we will have to compile a CBLAS interface seperately.
- Download the latest version of ACML from here to $HOME/tmp. (Currently, this is 6.1.0. Please use the Linux version for GCC / GFORTRAN.)
- Unpack ACML:

mkdir $HOME/acml
cd $HOME/acml
tar -zxf ../tmp/acml-6.1.0.31-gfortran64.tgz

Download the CBLAS interface from www.netlib.org/blas/ to $HOME/tmp, unpack it and compile it:

cd $HOME/tmp
wget http://www.netlib.org/blas/blast-forum/cblas.tgz
cd ..
tar -zxf tmp/cblas.tgz
cd CBLAS
make alllib

This should have created a library file $HOME/CBLAS/lib/cblas_LINUX.a.

Login as drohr:

drohr@linux-zsrp:~> cd caldgemm/
drohr@linux-zsrp:~/caldgemm> cd amd_dgemm_hawai
drohr@linux-zsrp:~/caldgemm/amd_dgemm_hawai> make (build AMD DGEMM kernel)
drohr@linux-zsrp:~/caldgemm/amd_dgemm_hawai> cd ..
drohr@linux-zsrp:~/caldgemm> cp config_options.sample config_options.mak
drohr@linux-zsrp:~/caldgemm> cp caldgemm_config.sample caldgemm_config.h

Edit config_options.mak: Enable OpenCL, Disable CAL and CUDA, CONFIGURED=1
Build CALDGEMM

drohr@linux-zsrp:~/caldgemm> make

drohr@linux-zsrp:~/caldgemm> cd ..
drohr@linux-zsrp:~> cd hpl
drohr@linux-zsrp:~/hpl> ln -s ../caldgemm
drohr@linux-zsrp:~/hpl> cp setup/Make.Generic* .

Edit Make.Generic.Options (HPL configuration file with tuning parameters)
Disable HPL_CONFIG_MPI=0 in Make.Generic.Options

drohr@linux-zsrp:~/hpl> ./build.sh

HPL Runtime Configuration is in hpl/bin/Generic/HPL.dat

In HPL.dat set:
NBs set to 1920
Ns set to: (Ns * Ns * 8) must smaller than system memory, Ns should be multiple of NBs
- Example: 60 * 1920 = 115200, 115200 * 115200 * 8 = 106 168 320 000 --> 106 GB, OK for 128 GB of system memory, use 115200
LOOKAHEADs: set to 2

Part 2: Set Powertune

Login as root:

(start x server)

rcxdm start

wait for 10 seconds

/home/drohr/adl3/atitweak -p 30

(Will increase TDP by 30%, can set between 0 and 50)

rcxdm stop

Optional Part 3: Monitor GPU Temperature and Clocks:

(You cannot start monitoring, while HPL is running!) Login as root

rcxdm start
while true; do clear && aticonfig --odgc --odgt --adapter=ALL && sleep 2; done

Part 4: Run

Login as drohr

drohr@linux-zsrp:~> cd hpl/bin/Generic/
drohr@linux-zsrp:~/hpl/bin/Generic> ./xhpl

You can see the current performance during the run in the line that ends with "System Gflops *******", where ******* is the performance
Performance result in line: WC26L2C64 ................................ *******
Verification: IT MUST PRINT "PASSED", otherwise computational error