-
Notifications
You must be signed in to change notification settings - Fork 23
spock
Use recent ROCm modules to get the best performance. As of Jan 2022, rocm/4.3.0 is the default version; rocm/4.5.0 is available but has not yet been performance tested.
The PrgEnv-gnu build process works smoothly. As of Jan 2022 there are some routines that fail testing (gbmm, gels, heev, he2hb, hegv, gbnorm: memory segfaults; gemmA, sterf: MPI failures). Note that the Cray libsci provides only lapack-3.5 compatibility (gelqf, gesvd, ge2tb require LAPACK>=3.7). These bugs are being worked upon.
module load craype-accel-amd-gfx908
module load PrgEnv-gnu; module load rocm/4.3.0
export CPATH=${ROCM_PATH}/include
$ module -t list
gcc/11.2.0
cray-libsci/21.08.1.2
PrgEnv-gnu/8.2.0
rocm/4.3.0
...
$ cat make.inc
CXX=CC
FC=ftn
CXXFLAGS=-I${ROCM_PATH}/include -craype-verbose
LDFLAGS=-L${ROCM_PATH}/lib -craype-verbose
blas=libsci
gpu_backend=hip
hip_arch=gfx908
mpi=1
$ nice make -j 3
...
# If you get an error message building BLAS++, try adding the ROCM library path
$ env LIBRARY_PATH="${ROCM_PATH}/lib:$LIBRARY_PATH" nice make -j 3
...
# Run on one node, using 4 MPI processes, process-per-gpu.
# Note: The warm-up run will be very slow
$ env MPICH_SMP_SINGLE_COPY_MODE=CMA OMP_MAX_ACTIVE_LEVELS=10 OMP_MAX_TASK_PRIORITY=0 OMP_PROC_BIND=TRUE srun -A CSC391 -p ecp -t 0:20:0 -N1 -n4 -c16 --gpus-per-task=1 --gpu-bind=closest --threads-per-core=1 ./test/tester --type d --nb 640 --dim 61440 --check y --ref n --origin h --target d --repeat 4 gemm
SLATE version 2021.05.02, id bb599ae4
input: /autofs/nccs-svm1_home1/ayarkhan/icl/spock/slate-dev/./test/tester --type d --nb 640 --dim 61440 --check y --ref n --origin h --target d --repeat 4 gemm
2022-01-11 14:03:33, MPI size 4, OpenMP threads 16, GPU devices available 1
type origin target norm transA transB m n k nrhs alpha beta nb p q la error time(s) gflops ref_time(s) ref_gflops status
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 640 2 2 1 5.05e-16 27.728 16728.983 NA NA pass
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 640 2 2 1 5.03e-16 18.575 24972.102 NA NA pass
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 640 2 2 1 5.06e-16 18.609 24926.647 NA NA pass
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 640 2 2 1 5.06e-16 18.659 24859.702 NA NA pass
All tests passed.
# The dgemm shown above achieves 24.4 TFlops on 4 MI100 GPUS (6.2 TFlops per GPU).
These instructions work with the slate-dev repository master branch. This build uses the Cray programming environment and compilers and libsci math libraries. Since libsci only supports an older lapack standard, the slate-lapackpp dependency was updated to allow SLATE to build with lapack versions < 3.7.1 (unsupported LQ routines are skipped).
$ git id
c627a4ce
$ git branch
* master
$ module -t list
libfabric/1.11.0.4.75
cray-libsci/21.06.1.1
PrgEnv-cray/8.1.0
rocm/4.2.0
...
$ cat make.inc
CXX=CC
FC=ftn
CXXFLAGS=-I${ROCM_PATH}/include -g
LDFLAGS=-L${ROCM_PATH}/lib -g
LIBRARY_PATH=${ROCM_PATH}/lib:${SCALAPACK_PATH}
blas=libsci
gpu_backend=hip
mpi=1
$ nice make -j 3
...
# GEMM on a single node, using 4 processes (via MPI), 16 cores/threads-per-process
$ env CRAY_OMP_CHECK_AFFINITY=FALSE OMP_NUM_THREADS=16 srun -A CSC391 -p ecp -t 0:20:0 -N1 -n4 -c16 --gpus-per-task=1 --gpu-bind=closest --threads-per-core=1 ./test/tester --type d --nb 2048 --dim 61440 --check y --ref n --origin h --target d --repeat 4 gemm
WARNING: omp_set_nested has been deprecated in OpenMP 5.0.
SLATE version 2021.05.02, id c627a4ce
input: ./spock/slate-dev/test/tester --type d --nb 2048 --dim 61440 --check y --ref n --origin h --target d --repeat 4 gemm
2021-08-25 17:26:46, MPI size 4, OpenMP threads 16, GPU devices available 1
type origin target norm transA transB m n k nrhs alpha beta nb p q la error time(s) gflops ref_time(s) ref_gflops status
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 2048 2 2 1 4.66e-16 40.127 11559.589 NA NA pass
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 2048 2 2 1 4.65e-16 29.206 15882.058 NA NA pass
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 2048 2 2 1 4.65e-16 21.208 21871.854 NA NA pass
d host devices 1 notrans notrans 61440 61440 61440 10 3.14+1.41i 2.72+1.73i 2048 2 2 1 4.65e-16 21.212 21867.621 NA NA pass
All tests passed.
LibSci v21.06 (check CC --cray-print-opts
) supports LAPACK 3.5.0 so some kernels (e.g., tpmlqt
) do not exist
in LibSci. Hence, Netlib LAPACK is used.
git clone https://github.com/Reference-LAPACK/lapack.git
cd lapack
mkdir build && cd build
CC=cc CXX=CC FC=ftn cmake .. -DBUILD_SHARED_LIBS=ON -DLAPACKE_WITH_TMG=ON -DCBLAS=OFF -DUSE_OPTIMIZED_BLAS=ON
make -j 20
export LAPACK_PATH=$PWD/lib
cd ../..
The installation steps here are tested for commit 859efbd of SLATE.
git clone --recursive https://bitbucket.org/icl/slate.git
cd slate
Add the following lines to GNUmakefile
after line 290:
# if LibSci
else ifeq ($(blas),libsci)
FLAGS += -DSLATE_WITH_LIBSCI
# no LIBS to add
scalapack =
export CPATH=${ROCM_PATH}/include
export LD_LIBRARY_PATH=${LAPACK_PATH}:$LD_LIBRARY_PATH
make.inc
file for SLATE:
CXX=CC
FC=ftn
CXXFLAGS=-I${ROCM_PATH}/include
LDFLAGS=-L${ROCM_PATH}/lib -L${LAPACK_PATH} -llapack -llapacke
LIBRARY_PATH=${ROCM_PATH}/lib:${LAPACK_PATH}
blas=libsci
gpu_backend=hip
mpi=1
Run make -j
. The submodules will be configured. After the configuration,
change LAPACK version in lapackpp/include/lapack/defines.h
as follows:
#define LAPACK_VERSION 30700
Add the following include path to CXXFLAGS
in lapackpp/make.inc
:
-I${LAPACK_PATH}/../include
Set LIBS
in lapackpp/make.inc
as follows:
LIBS = -L${LAPACK_PATH} -llapack -llapacke
Run make clean
in lapackpp
folder.
Run make -j 20
in slate
folder.
The following command will run DGEMM on one MI100. (Substitute your account number.) The performance should be around 6 TF/s.
export OMP_NUM_THREADS=1 && srun -A CSC391 -p ecp -t 0:15:00 -N 1 -n 1 --ntasks-per-node=1 --cpus-per-task=${OMP_NUM_THREADS} --threads-per-core=1 --gpus-per-task=1 --gpu-bind=closest -J testjob -o %x-%j.out ./test/tester --type d --nb 2048 --dim 1234,36864 --grid 1x1 --check n --ref n --origin h --target d --repeat 3 gemm