Skip to content
Neil Lindquist edited this page Aug 1, 2023 · 1 revision

Summit at Oak Ridge National Laboratory

[TOC]

Installation

On Summit, SLATE is compiled using the GCC compiler suite.

#!bash
# Load up modules to compile SLATE
module unload xl
module load gcc/8.1.1
module load essl
module load cuda/11
module load spectrum-mpi
module load netlib-lapack
module load netlib-scalapack

The modules do not set the necessary LIBRARY_PATH. Update the LIBRARY_PATH so that needed libraries are found during compilation.

#!bash
export LIBRARY_PATH=$OLCF_ESSL_ROOT/lib64:$LIBRARY_PATH
export LIBRARY_PATH=$CUDA_DIR/lib64:$LIBRARY_PATH

SLATE can be built using GNU Makefile or CMake.

Using GNU Makefile can use a make.inc configuration file where the compiler and BLAS lib are specified.

#!bash
# Setup slate/make.inc to specify the compiler
# cat make.inc
CXX=mpicxx
FC=mpif90
blas=essl

# To build from scratch do "make distclean"

# This will build blaspp, lapackpp, testsweeper and slate.
nice make -j 4

Running

The following simple test allocates 2 Summit nodes interactively and runs a small sample GEMM execution.

#!bash

# Change to test directory
cd test

# For tests allocate 2 summit nodes in an interactive session
# Note SLATE wants nodes allocated smt1 (symmetric-multithread 1).
bsub -P YOURPROJECT -alloc_flags smt1 -nnodes 2 -W 0:90 -Is /bin/bash

# Spectrum MPI tunings needed for maximum bandwidth
# https://docs.olcf.ornl.gov/systems/summit_user_guide.html#spectrum-mpi-tunings-needed-for-maximum-bandwidth
export PAMI_ENABLE_STRIPING=1 PAMI_IBV_ADAPTER_AFFINITY=1 PAMI_IBV_DEVICE_NAME="mlx5_0:1,mlx5_3:1" PAMI_IBV_DEVICE_NAME_1="mlx5_3:1,mlx5_0:1"

# Set to true to display OpenMP settings
export OMP_DISPLAY_ENV=false 
# Tell OpenMP that there is nested parallism
export OMP_NESTED=true

# Run tests using one process-per-socket (4 processes)

# Small gemm test on --target device (GPUs).
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 448 --dim 16000 --check n --ref n --target d gemm
SLATE version 2020.10.00, id d9dbd28
input: ./tester --type d --nb 448 --dim 16000 --check n --ref n --target d gemm
2021-01-04 11:59:38, MPI size 4, OpenMP threads 21, CUDA devices available 3
type     origin   target     norm   transA   transB       m       n       k     alpha      beta     nb     p     q  la      error       time(s)        gflops   ref_time(s)    ref_gflops  status


# You can also run tests using one process-per-gpu (2 nodes with 6 GPUS/node = 12 MPI processes)
# In this setup each MPI process will use 1 GPU and 7 CPU cores.
# Note: The performance may be lower using process-per-gpu binding.

# Small gemm test on --target device (GPUs) using process-per-gpu binding
jsrun -n12 -a1 -c7 -g1 -brs ./tester --type d --nb 448 --dim 8000 --grid 4x3 --check y --ref y --target d gemm
SLATE version 2021.05.02, id 8807a501
input: ./tester --type d --nb 448 --dim 8000 --grid 4x3 --check y --ref y --target d --tol 200 gemm
2021-09-09 18:22:08, MPI size 12, OpenMP threads 7, GPU devices available 1
type     origin   target     norm   transA   transB       m       n       k    nrhs        alpha         beta     nb       p       q  la      error       time(s)        gflops   ref_time(s)    ref_gflops  status
   d       host  devices        1  notrans  notrans    8000    8000    8000      10   3.14+1.41i   2.72+1.73i    448       4       3   1   2.20e-17         0.154      6653.311         6.430       159.248  pass
All tests passed.

Performance

Getting good performance from SLATE may require some tuning (-size, grid pxq layout, etc). As an example, a sweep over a number of tile sizes can be used to find an appropriate --nb.

#!bash
# Sweep over block sizes for tuning --nb for potrf
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 192:1024:64 --dim 64000 --check n --ref n --target d potrf
SLATE version 2020.10.00, id d9dbd28
input: ./tester --type d --nb 192:1024:64 --dim 64000 --check n --ref n --target d potrf
2021-01-04 12:10:32, MPI size 4, OpenMP threads 21, CUDA devices available 3
type     origin   target   dev-dist    uplo       n    nrhs     nb     p     q  la      error       time(s)        gflops   ref_time(s)    ref_gflops  status
   d       host  devices     column   lower   64000      10    192     2     2   1         NA        11.589      7540.475            NA            NA  no check
   d       host  devices     column   lower   64000      10    256     2     2   1         NA         6.178     14143.806            NA            NA  no check
   d       host  devices     column   lower   64000      10    320     2     2   1         NA         4.659     18755.573            NA            NA  no check
   d       host  devices     column   lower   64000      10    384     2     2   1         NA         4.844     18038.195            NA            NA  no check
   d       host  devices     column   lower   64000      10    448     2     2   1         NA         5.307     16466.368            NA            NA  no check
   d       host  devices     column   lower   64000      10    512     2     2   1         NA         5.013     17429.662            NA            NA  no check
   d       host  devices     column   lower   64000      10    576     2     2   1         NA         5.159     16937.867            NA            NA  no check
   d       host  devices     column   lower   64000      10    640     2     2   1         NA         5.159     16937.807            NA            NA  no check
   d       host  devices     column   lower   64000      10    704     2     2   1         NA         5.062     17263.743            NA            NA  no check
   d       host  devices     column   lower   64000      10    768     2     2   1         NA         5.251     16641.912            NA            NA  no check
   d       host  devices     column   lower   64000      10    832     2     2   1         NA         5.202     16797.125            NA            NA  no check
   d       host  devices     column   lower   64000      10    896     2     2   1         NA         5.487     15926.100            NA            NA  no check
   d       host  devices     column   lower   64000      10    960     2     2   1         NA         5.489     15919.523            NA            NA  no check
   d       host  devices     column   lower   64000      10   1024     2     2   1         NA         5.604     15593.451            NA            NA  no check

# Sweep over block sizes for tuning --nb for gemm
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 192:1024:64 --dim 64000 --grid 2x2 --check n --ref n --target d gemm

# Run a potrf sweep using tile size --nb 512
jsrun -n4 -a1 -c21 -g3 -brs ./tester --type d --nb 512 --dim 16000,32000,64000,128000,192000 --grid 2x2 --check y --ref n --target d potrf
SLATE version 2020.10.00, id d9dbd28
input: ./tester --type d --nb 512 --dim 16000,32000,64000,128000,192000 --grid 2x2 --check y --ref n --target d potrf
2021-01-06 12:45:06, MPI size 4, OpenMP threads 21, CUDA devices available 3
type     origin   target   dev-dist    uplo       n    nrhs     nb     p     q  la      error       time(s)        gflops   ref_time(s)    ref_gflops  status
   d       host  devices     column   lower   16000      10    512     2     2   1   1.56e-19         0.569      2397.794            NA            NA  pass
   d       host  devices     column   lower   32000      10    512     2     2   1   1.08e-19         1.380      7915.930            NA            NA  pass
   d       host  devices     column   lower   64000      10    512     2     2   1   7.39e-20         4.803     18192.179            NA            NA  pass
   d       host  devices     column   lower  128000      10    512     2     2   1   5.18e-20        18.657     37469.000            NA            NA  pass
   d       host  devices     column   lower  192000      10    512     2     2   1   4.21e-20        48.822     48324.347            NA            NA  pass
All tests passed.

Testing SLATE using various modes

SLATE performance depends on the configuration used on a specific number of nodes, i.e, number of resource sets (-n), number of tasks per resource set (-a) and the number of cores (-c). The following jsrun commands can be used to test slate GEMM on Summit using 16 nodes and 96 GPUs:

#!bash
# 1 mpi rank per node, 6 GPUs per rank
jsrun -n16 -a1 -c42 -g6 -brs ./tester --type d --nb 896 --dim 1234,16000,32000,64000,128000,192000,256000,320000,384000 --grid 4x4 --check n --ref n --target d 
--lookahead 1 --repeat 1 gemm

# 2 mpi rank per node, 3 GPUs per rank 
jsrun -n32 -a1 -c21 -g3 -brs ./tester --type d --nb 896 --dim 1234,16000,32000,64000,128000,192000,256000,320000,384000 --grid 8x4 --check n --ref n --target d 
--lookahead 1 --repeat 1 gemm

# 6 mpi rank per node, 1 GPUs per rank
#BSUB -alloc_flags "smt1 gpudefault"
jsrun -n96 -a1 -c7 -g1 -brs ./tester --type d --nb 896 --dim 1234,16000,32000,64000,128000,192000,256000,320000,384000 --grid 12x8 --check n --ref n --target d --lookahead 1 --repeat 1 gemm

The following figure shows the performance of slate GEMM using the previous jsrun commands. The best performance can be achieved by setting 2 rank/node x 3 GPUs/rank (-n32 -a1 -c21 -g3) or 6 rank/node x 1 GPUs/rank (-n96 -a1 -c7 -g1).

GEMM

Testing ELPA

#!bash
We used the following jsrun commands to test ELPA using GPUs. ELPA uses one MPI process for each CPU core, this requires more than 1 mpi rank to target the same GPU. Therefore the default mode of GPUS (EXCLUSIVE_PROCESS) has to be changed to the compute mode by setting "-alloc_flags gpudefault".

# 42 mpi rank per node, 3 GPUs per rank
#BSUB -alloc_flags "smt1 gpudefault"
jsrun -n32 -a21 -c21 -g3 -brs ./validate_real_double_eigenvectors_2stage_all_kernels_gpu_random 40000 40000 32

# 42 mpi rank per node, 2 GPUs per rank
#BSUB -alloc_flags "smt1 gpudefault"
jsrun -n48 -a14 -c14 -g2 -brs ./validate_real_double_eigenvectors_2stage_all_kernels_gpu_random 40000 40000 32

Clone this wiki locally