-
Notifications
You must be signed in to change notification settings - Fork 23
summit
[TOC]
On Summit, SLATE is compiled using the GCC compiler suite.
#!bash
# Load up modules to compile SLATE
module unload xl
module load gcc/8.1.1
module load essl
module load cuda/11
module load spectrum-mpi
module load netlib-lapack
module load netlib-scalapack
The modules do not set the necessary LIBRARY_PATH. Update the LIBRARY_PATH so that needed libraries are found during compilation.
#!bash
export LIBRARY_PATH=$OLCF_ESSL_ROOT/lib64:$LIBRARY_PATH
export LIBRARY_PATH=$CUDA_DIR/lib64:$LIBRARY_PATH
SLATE can be built using GNU Makefile or CMake.
Using GNU Makefile can use a make.inc configuration file where the compiler and BLAS lib are specified.
#!bash
# Setup slate/make.inc to specify the compiler
# cat make.inc
CXX=mpicxx
FC=mpif90
blas=essl
# To build from scratch do "make distclean"
# This will build blaspp, lapackpp, testsweeper and slate.
nice make -j 4
The following simple test allocates 2 Summit nodes interactively and runs a small sample GEMM execution.
#!bash
# Change to test directory
cd test
# For tests allocate 2 summit nodes in an interactive session
# Note SLATE wants nodes allocated smt1 (symmetric-multithread 1).
bsub -P YOURPROJECT -alloc_flags smt1 -nnodes 2 -W 0:90 -Is /bin/bash
# Spectrum MPI tunings needed for maximum bandwidth
# https://docs.olcf.ornl.gov/systems/summit_user_guide.html#spectrum-mpi-tunings-needed-for-maximum-bandwidth
export PAMI_ENABLE_STRIPING=1 PAMI_IBV_ADAPTER_AFFINITY=1 PAMI_IBV_DEVICE_NAME="mlx5_0:1,mlx5_3:1" PAMI_IBV_DEVICE_NAME_1="mlx5_3:1,mlx5_0:1"
# Set to true to display OpenMP settings
export OMP_DISPLAY_ENV=false
# Tell OpenMP that there is nested parallism
export OMP_NESTED=true
# Run tests using one process-per-socket (4 processes)
# Small gemm test on --target device (GPUs).
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 448 --dim 16000 --check n --ref n --target d gemm
SLATE version 2020.10.00, id d9dbd28
input: ./tester --type d --nb 448 --dim 16000 --check n --ref n --target d gemm
2021-01-04 11:59:38, MPI size 4, OpenMP threads 21, CUDA devices available 3
type origin target norm transA transB m n k alpha beta nb p q la error time(s) gflops ref_time(s) ref_gflops status
# You can also run tests using one process-per-gpu (2 nodes with 6 GPUS/node = 12 MPI processes)
# In this setup each MPI process will use 1 GPU and 7 CPU cores.
# Note: The performance may be lower using process-per-gpu binding.
# Small gemm test on --target device (GPUs) using process-per-gpu binding
jsrun -n12 -a1 -c7 -g1 -brs ./tester --type d --nb 448 --dim 8000 --grid 4x3 --check y --ref y --target d gemm
SLATE version 2021.05.02, id 8807a501
input: ./tester --type d --nb 448 --dim 8000 --grid 4x3 --check y --ref y --target d --tol 200 gemm
2021-09-09 18:22:08, MPI size 12, OpenMP threads 7, GPU devices available 1
type origin target norm transA transB m n k nrhs alpha beta nb p q la error time(s) gflops ref_time(s) ref_gflops status
d host devices 1 notrans notrans 8000 8000 8000 10 3.14+1.41i 2.72+1.73i 448 4 3 1 2.20e-17 0.154 6653.311 6.430 159.248 pass
All tests passed.
Getting good performance from SLATE may require some tuning (-size, grid pxq layout, etc). As an example, a sweep over a number of tile sizes can be used to find an appropriate --nb.
#!bash
# Sweep over block sizes for tuning --nb for potrf
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 192:1024:64 --dim 64000 --check n --ref n --target d potrf
SLATE version 2020.10.00, id d9dbd28
input: ./tester --type d --nb 192:1024:64 --dim 64000 --check n --ref n --target d potrf
2021-01-04 12:10:32, MPI size 4, OpenMP threads 21, CUDA devices available 3
type origin target dev-dist uplo n nrhs nb p q la error time(s) gflops ref_time(s) ref_gflops status
d host devices column lower 64000 10 192 2 2 1 NA 11.589 7540.475 NA NA no check
d host devices column lower 64000 10 256 2 2 1 NA 6.178 14143.806 NA NA no check
d host devices column lower 64000 10 320 2 2 1 NA 4.659 18755.573 NA NA no check
d host devices column lower 64000 10 384 2 2 1 NA 4.844 18038.195 NA NA no check
d host devices column lower 64000 10 448 2 2 1 NA 5.307 16466.368 NA NA no check
d host devices column lower 64000 10 512 2 2 1 NA 5.013 17429.662 NA NA no check
d host devices column lower 64000 10 576 2 2 1 NA 5.159 16937.867 NA NA no check
d host devices column lower 64000 10 640 2 2 1 NA 5.159 16937.807 NA NA no check
d host devices column lower 64000 10 704 2 2 1 NA 5.062 17263.743 NA NA no check
d host devices column lower 64000 10 768 2 2 1 NA 5.251 16641.912 NA NA no check
d host devices column lower 64000 10 832 2 2 1 NA 5.202 16797.125 NA NA no check
d host devices column lower 64000 10 896 2 2 1 NA 5.487 15926.100 NA NA no check
d host devices column lower 64000 10 960 2 2 1 NA 5.489 15919.523 NA NA no check
d host devices column lower 64000 10 1024 2 2 1 NA 5.604 15593.451 NA NA no check
# Sweep over block sizes for tuning --nb for gemm
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 192:1024:64 --dim 64000 --grid 2x2 --check n --ref n --target d gemm
# Run a potrf sweep using tile size --nb 512
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 512 --dim 16000,32000,64000,128000,192000 --grid 2x2 --check y --ref n --target d potrf
SLATE version 2020.10.00, id d9dbd28
input: ./tester --type d --nb 512 --dim 16000,32000,64000,128000,192000 --grid 2x2 --check y --ref n --target d potrf
2021-01-06 12:45:06, MPI size 4, OpenMP threads 21, CUDA devices available 3
type origin target dev-dist uplo n nrhs nb p q la error time(s) gflops ref_time(s) ref_gflops status
d host devices column lower 16000 10 512 2 2 1 1.56e-19 0.569 2397.794 NA NA pass
d host devices column lower 32000 10 512 2 2 1 1.08e-19 1.380 7915.930 NA NA pass
d host devices column lower 64000 10 512 2 2 1 7.39e-20 4.803 18192.179 NA NA pass
d host devices column lower 128000 10 512 2 2 1 5.18e-20 18.657 37469.000 NA NA pass
d host devices column lower 192000 10 512 2 2 1 4.21e-20 48.822 48324.347 NA NA pass
All tests passed.
SLATE performance depends on the configuration used on a specific number of nodes, i.e, number of resource sets (-n), number of tasks per resource set (-a) and the number of cores (-c). The following jsrun commands can be used to test slate GEMM on Summit using 16 nodes and 96 GPUs:
#!bash
# 1 mpi rank per node, 6 GPUs per rank
jsrun -n16 -a1 -c42 -g6 -brs ./tester --type d --nb 896 --dim 1234,16000,32000,64000,128000,192000,256000,320000,384000 --grid 4x4 --check n --ref n --target d
--lookahead 1 --repeat 1 gemm
# 2 mpi rank per node, 3 GPUs per rank
jsrun -n32 -a1 -c21 -g3 -brs ./tester --type d --nb 896 --dim 1234,16000,32000,64000,128000,192000,256000,320000,384000 --grid 8x4 --check n --ref n --target d
--lookahead 1 --repeat 1 gemm
# 6 mpi rank per node, 1 GPUs per rank
#BSUB -alloc_flags "smt1 gpudefault"
jsrun -n96 -a1 -c7 -g1 -brs ./tester --type d --nb 896 --dim 1234,16000,32000,64000,128000,192000,256000,320000,384000 --grid 12x8 --check n --ref n --target d --lookahead 1 --repeat 1 gemm
The following figure shows the performance of slate GEMM using the previous jsrun commands. The best performance can be achieved by setting 2 rank/node x 3 GPUs/rank (-n32 -a1 -c21 -g3) or 6 rank/node x 1 GPUs/rank (-n96 -a1 -c7 -g1).
#!bash
We used the following jsrun commands to test ELPA using GPUs. ELPA uses one MPI process for each CPU core, this requires more than 1 mpi rank to target the same GPU. Therefore the default mode of GPUS (EXCLUSIVE_PROCESS) has to be changed to the compute mode by setting "-alloc_flags gpudefault".
# 42 mpi rank per node, 3 GPUs per rank
#BSUB -alloc_flags "smt1 gpudefault"
jsrun -n32 -a21 -c21 -g3 -brs ./validate_real_double_eigenvectors_2stage_all_kernels_gpu_random 40000 40000 32
# 42 mpi rank per node, 2 GPUs per rank
#BSUB -alloc_flags "smt1 gpudefault"
jsrun -n48 -a14 -c14 -g2 -brs ./validate_real_double_eigenvectors_2stage_all_kernels_gpu_random 40000 40000 32