spock

Spock at Oak Ridge National Laboratory

Installation

Modules

Use recent ROCm modules to get the best performance. As of Jan 2022, rocm/4.3.0 is the default version; rocm/4.5.0 is available but has not yet been performance tested.

Spock build instructions (PrgEnv-gnu, libsci, rocm)

The PrgEnv-gnu build process works smoothly. As of Jan 2022 there are some routines that fail testing (gbmm, gels, heev, he2hb, hegv, gbnorm: memory segfaults; gemmA, sterf: MPI failures). Note that the Cray libsci provides only lapack-3.5 compatibility (gelqf, gesvd, ge2tb require LAPACK>=3.7). These bugs are being worked upon.

module load craype-accel-amd-gfx908
module load PrgEnv-gnu;  module load rocm/4.3.0
export CPATH=${ROCM_PATH}/include

$ module -t list
gcc/11.2.0
cray-libsci/21.08.1.2
PrgEnv-gnu/8.2.0
rocm/4.3.0
...
$ cat make.inc
CXX=CC
FC=ftn
CXXFLAGS=-I${ROCM_PATH}/include -craype-verbose
LDFLAGS=-L${ROCM_PATH}/lib -craype-verbose
blas=libsci
gpu_backend=hip
hip_arch=gfx908
mpi=1
$ nice make -j 3
...
# If you get an error message building BLAS++, try adding the ROCM library path
$ env LIBRARY_PATH="${ROCM_PATH}/lib:$LIBRARY_PATH" nice make -j 3
...
# Run on one node, using 4 MPI processes, process-per-gpu.
# Note:  The warm-up run will be very slow
$ env MPICH_SMP_SINGLE_COPY_MODE=CMA OMP_MAX_ACTIVE_LEVELS=10 OMP_MAX_TASK_PRIORITY=0 OMP_PROC_BIND=TRUE srun -A CSC391 -p ecp -t 0:20:0 -N1 -n4 -c16 --gpus-per-task=1 --gpu-bind=closest --threads-per-core=1 ./test/tester --type d --nb 640  --dim 61440 --check y --ref n --origin h --target d  --repeat 4 gemm                                                                                                                                                                  
SLATE version 2021.05.02, id bb599ae4
input: /autofs/nccs-svm1_home1/ayarkhan/icl/spock/slate-dev/./test/tester --type d --nb 640 --dim 61440 --check y --ref n --origin h --target d --repeat 4 gemm
2022-01-11 14:03:33, MPI size 4, OpenMP threads 16, GPU devices available 1
type     origin   target     norm   transA   transB       m       n       k    nrhs        alpha         beta     nb       p       q  la      error       time(s)        gflops   ref_time(s)    ref_gflops  status
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i    640       2       2   1   5.05e-16        27.728     16728.983            NA            NA  pass
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i    640       2       2   1   5.03e-16        18.575     24972.102            NA            NA  pass
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i    640       2       2   1   5.06e-16        18.609     24926.647            NA            NA  pass
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i    640       2       2   1   5.06e-16        18.659     24859.702            NA            NA  pass

All tests passed.

# The dgemm shown above achieves 24.4 TFlops on 4 MI100 GPUS (6.2 TFlops per GPU).

Spock build instructions (PrgEnv-cray, libsci, rocm)

These instructions work with the slate-dev repository master branch. This build uses the Cray programming environment and compilers and libsci math libraries. Since libsci only supports an older lapack standard, the slate-lapackpp dependency was updated to allow SLATE to build with lapack versions < 3.7.1 (unsupported LQ routines are skipped).

$ git id
c627a4ce
$ git branch
* master
$ module -t list
libfabric/1.11.0.4.75
cray-libsci/21.06.1.1
PrgEnv-cray/8.1.0
rocm/4.2.0
...
$ cat make.inc
CXX=CC
FC=ftn
CXXFLAGS=-I${ROCM_PATH}/include -g
LDFLAGS=-L${ROCM_PATH}/lib -g
LIBRARY_PATH=${ROCM_PATH}/lib:${SCALAPACK_PATH}
blas=libsci
gpu_backend=hip
mpi=1
$ nice make -j 3
...
# GEMM on a single node, using 4 processes (via MPI), 16 cores/threads-per-process
$ env CRAY_OMP_CHECK_AFFINITY=FALSE OMP_NUM_THREADS=16 srun -A CSC391 -p ecp -t 0:20:0 -N1 -n4 -c16 --gpus-per-task=1 --gpu-bind=closest --threads-per-core=1 ./test/tester --type d --nb 2048  --dim 61440 --check y --ref n --origin h --target d  --repeat 4 gemm
WARNING: omp_set_nested has been deprecated in OpenMP 5.0.
SLATE version 2021.05.02, id c627a4ce
input: ./spock/slate-dev/test/tester --type d --nb 2048 --dim 61440 --check y --ref n --origin h --target d --repeat 4 gemm
2021-08-25 17:26:46, MPI size 4, OpenMP threads 16, GPU devices available 1
type     origin   target     norm   transA   transB       m       n       k    nrhs        alpha         beta     nb       p       q  la      error       time(s)        gflops   ref_time(s)    ref_gflops  status
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i   2048       2       2   1   4.66e-16        40.127     11559.589            NA            NA  pass
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i   2048       2       2   1   4.65e-16        29.206     15882.058            NA            NA  pass
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i   2048       2       2   1   4.65e-16        21.208     21871.854            NA            NA  pass
   d       host  devices        1  notrans  notrans   61440   61440   61440      10   3.14+1.41i   2.72+1.73i   2048       2       2   1   4.65e-16        21.212     21867.621            NA            NA  pass
All tests passed.

Out-dated instructions using Netlib LAPACK

LAPACK

LibSci v21.06 (check CC --cray-print-opts) supports LAPACK 3.5.0 so some kernels (e.g., tpmlqt) do not exist in LibSci. Hence, Netlib LAPACK is used.

git clone https://github.com/Reference-LAPACK/lapack.git
cd lapack
mkdir build && cd build
CC=cc CXX=CC FC=ftn cmake .. -DBUILD_SHARED_LIBS=ON -DLAPACKE_WITH_TMG=ON -DCBLAS=OFF -DUSE_OPTIMIZED_BLAS=ON
make -j 20
export LAPACK_PATH=$PWD/lib
cd ../..

SLATE

The installation steps here are tested for commit 859efbd of SLATE.

git clone --recursive https://bitbucket.org/icl/slate.git
cd slate

Add the following lines to GNUmakefile after line 290:

# if LibSci
else ifeq ($(blas),libsci)
    FLAGS += -DSLATE_WITH_LIBSCI
    # no LIBS to add
    scalapack =

export CPATH=${ROCM_PATH}/include
export LD_LIBRARY_PATH=${LAPACK_PATH}:$LD_LIBRARY_PATH

make.inc file for SLATE:

CXX=CC
FC=ftn
CXXFLAGS=-I${ROCM_PATH}/include
LDFLAGS=-L${ROCM_PATH}/lib -L${LAPACK_PATH} -llapack -llapacke
LIBRARY_PATH=${ROCM_PATH}/lib:${LAPACK_PATH}
blas=libsci
gpu_backend=hip
mpi=1

Run make -j. The submodules will be configured. After the configuration, change LAPACK version in lapackpp/include/lapack/defines.h as follows:

#define LAPACK_VERSION 30700

Add the following include path to CXXFLAGS in lapackpp/make.inc:

-I${LAPACK_PATH}/../include

Set LIBS in lapackpp/make.inc as follows:

LIBS     = -L${LAPACK_PATH} -llapack -llapacke

Run make clean in lapackpp folder.

Run make -j 20 in slate folder.

The following command will run DGEMM on one MI100. (Substitute your account number.) The performance should be around 6 TF/s.

export OMP_NUM_THREADS=1 && srun -A CSC391 -p ecp -t 0:15:00 -N 1 -n 1 --ntasks-per-node=1 --cpus-per-task=${OMP_NUM_THREADS} --threads-per-core=1 --gpus-per-task=1 --gpu-bind=closest -J testjob -o %x-%j.out  ./test/tester --type d --nb 2048 --dim 1234,36864 --grid 1x1 --check n --ref n --origin h --target d --repeat 3 gemm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly