-
Notifications
You must be signed in to change notification settings - Fork 23
saturn
Recommended modules, circa 12-2020 (in ~/.bash_profile):
module purge
module load git
module load cmake/3.18.2 # latest
module load gcc/7.3.0 # GNU gcc & g++. CUDA is picky about gcc version.
module load llvm # clang & clang++
module load icc/2018 # Intel icc & icpc. SLATE fails with icc/2019.
module load cuda/11.1.0 # latest
module load intel-mpi
module load intel-mkl
module load openblas
module load python # python 3
-
Load the following modules:
module load gcc/7.3.0 module load cuda/11.1.0 module load intel-mpi module load intel-mkl
-
Set
make.inc
with GNU compilers:CXX = mpicxx FC = mpif90 blas = mkl blas_fortran = gfortran # default mkl_blacs = intelmpi # default cuda_arch = pascal # default (gtx1060 are pascal)
Note mpi=1, cuda=1, openmp=1 should be set automatically.
-
Load the following modules:
module load gcc/7.3.0 module load icc/2018 module load cuda/11.1.0 module load intel-mpi # was mpi/intel/2018 module load intel-mkl
-
Set
make.inc
with Intel compilers:CXX = mpiicpc FC = mpiifort LIBS = -lifcore blas = mkl blas_fortran = ifort # was mkl_intel = 1 mkl_blacs = intelmpi # default cuda_arch = pascal # default (gtx1060 are pascal)
Note mpi=1, cuda=1, openmp=1 should be set automatically.
- Unfortunately, it seems
-std=c++17
breaks the Intel compiler. Editing the GNUmakefile to use-std=c++11
allows most files to be compiled, but there is an error inomp taskloop
inlistBcastMT
.
On the head node, use nice make -j4
.
Faster to compile in an interactive job, not on the head node, e.g.:
# Get node with gtx1060 GPU (b01 - b04) for 240 minutes.
[saturn ~]$ salloc -N 1 -C gtx1060 -t 240 srun --pty bash
# Compile and run interactively on that node.
[b01 ~/slate]$ nice make -j20
[b01 ~/slate]$ ./test/tester gemm
Submission command on saturn b nodes, assuming Intel MPI and MKL BLAS.
salloc -N 4 -w b[01-04] --tasks-per-node 1 env OMP_NUM_THREADS=20 OMP_NESTED=true \
OMP_DISPLAY_ENV=true MKL_NUM_THREADS=1 MKL_VERBOSE=0 \
mpirun -n 4 -env I_MPI_DEBUG=3 ./test/tester --type --nb 352 --dim $[1024*4] \
--grid 2x2 --target d --lookahead 1 --ref n --check n --repeat 2 gemm
--tasks-per-node 1 otherwise the processes may get bound to the same node
-env I_MPI_DEBUG=3 so that Intel MPI prints out process-to-node mapping
OMP_NUM_THREADS=20 to avoid the hyperthreading
MKL_VERBOSE=1 gives lots of output from MKL per function (threads used, etc)
OMP_DISPLAY_ENV=true to show that OMP is setup correctly
Output can be disabled if the process/thread binding is occurring properly
The saturn b nodes have gaming-level NVidia GPUs. Single-precision performance is reasonable, however double-precision performance is slow. These nodes are good for development and debugging, but performance bottlenecks may only show up when testing on faster GPUs (e.g. NVidia V100 GPUs on Summit at ORNL).