-
Notifications
You must be signed in to change notification settings - Fork 5
Running 3D BCP on OLCF Summit
Follow the instructions on Running ExaRL on Summit
source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
conda activate exarl_summit
git clone --recursive https://github.com/exalearn/ExaRL.git
If you already have the ExaRL repo and need to update the submodules, please use:
cd ExaRL
git submodule update --init --recursive
CPU version:
vi ./envs/env_vault/LibTDLG/TDLG_CPU/Makefile
make sure the following settings are used.
CFLAGS = -std=c++17 -I./ -DPREC_FP32 -DOMP -fvisibility=default -O2
LFLAGS = -lm -lgomp
save the above Makefile, and compile:
cd ./envs/env_vault/LibTDLG/TDLG_CPU/
make
GPU version:
vi ./envs/env_vault/LibTDLG/TDLG_GPU/Makefile
make sure the following two settings are used. change the cuda version accordingly in LFLAGS setting: (below setting examples use cuda version 10.1.243. module list
will show the specific version currently used.)
LFLAGS = -lm -L/sw/summit/cuda/10.1.243/lib64 -lcudart -lcuda
CUDA_FLAGS = -arch=sm_70
Save the above Makefile settings and compile:
cd ./envs/env_vault/LibTDLG/TDLG_GPU/
make
Detailed instruction could be found here
vim ./envs/env_vault/env_cfg/ExaLearnBlockCoPolymerTDLG-v3.json
Modifyapp_core
to either "cpu" (in that case, you may update theapp_threads
value accordingly) or "gpu". Set the number of threads in "app_threads", to manage the resources each occurrence of TDLG environment will use. Also, change the path of the environment in the value part of "app_dir", for e.g., I had to modify "app_dir" as:"app_dir" : "./envs/env_vault/LibTDLG/"
.
cd ../../../../
python driver/test_tdlg_v3.py
Below is an example of bsub file. For maximum performance, the OMP_NUM_THREADS in the conda should be set the same as the --bind packed setting in the jsrun. These two settings should also match the "app_threads" setting in the configuration shown in step 4.
#!/bin/bash
#BSUB -P <project>
#BSUB -W 02:00
#BSUB -nnodes 16
#BSUB -J ExaBCP_TDLG
#BSUB -o ExaBCP_TDLG.%J
#BSUB -e ExaBCP_TDLG.%J
source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
conda activate exarl_summit
export OMP_NUM_THREADS=7
export OMP_PLACES=cores
export OMP_PROC_BIND=close
jsrun --nrs 96 --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --latency_priority CPU-CPU --launch_distribution packed --bind packed:7 python ./driver/driver_example.py --output_dir '<output folder>' --env 'ExaLearnBlockCoPolymerTDLG-v3' --agent 'DQN-LSTM-v0' --n_episodes '10' --n_steps '10' --run_type 'static-omp'
Example script to run CPU/OpenMP version of TDLG on multiple Summit nodes (pay attention to the passed arguments, you may not need to pass those or may need to change the value of the arguments):
#!/bin/bash
#BSUB -P AST153
#BSUB -W 1:30
#BSUB -nnodes 32
#BSUB -J ExaRL_TDLG
#BSUB -o ExaRL_TDLG.%J
#BSUB -e ExaRL_TDLG.%J
source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
conda activate exarl_summit
export OMP_NUM_THREADS=7
for t in 1 2 4 8 16 32
do
export nres=$((t*6))
echo "---------------------------------------"
echo "Running ExaRL with TDLG-CPU on $t nodes"
echo "---------------------------------------"
jsrun --nrs $nres --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --latency_priority GPU-CPU --launch_distribution packed --bind packed:7 python driver/driver_example.py --output_dir /gpfs/alpine/ast153/scratch/$USER/ --env ExaLearnBlockCoPolymerTDLG-v3 --n_episodes 500 --n_steps 60 --learner_type async --agent DQN-v0 --model_type LSTM --action_type fixed
done
The TDLG_CPU code uses OpenMP 4.0 pragmas for shared-memory parallelization, and following average execution times over a number of processes are observed for the driver/driver_example.py
script that ran for 10 steps and 10 episodes. Following table compares execution times (in secs.) of ExaCartPole-V1 and 3D BCP (TDLG_CPU) environments. The TDLG_CPU environment uses 7 OpenMP threads. The ExaCartPole runs are about 20x faster than TDLG_CPU.
Nodes(Processes) | TDLG_CPU | ExaCartPole-V1 |
---|---|---|
1(6) | 394.40 | 18.16 |
2(12) | 538.90 | 38.71 |
4(24) | 801.46 | 71.59 |
It is possible to improve the CPU times further by using aggressive compiler optimizations and minor tweaks in the BCP C++ code. For instance, the following was found to improve the performance by ~8-20%: 1) In update_field.cpp
and update_field_noisy.cpp
, modified blocking factor to 4 from 8 (bsz,bsx,bsy
), and 2) Added some fast math optimization flags for building TDLG_CPU (potentially could lead to non-IEEE-754 compliant arithmetic): CFLAGS = -std=c++17 -mcpu=native -ffast-math -funsafe-math-optimizations -I./ -fopenmp -fopenmp-simd -DPREC_FP32 -DOMP -fvisibility=default -O3
. Results are as follows (compare with the table above):
Nodes(Processes) | TDLG_CPU (w aggressive GCC compiler options) |
---|---|
1(6) | 313.04 |
2(12) | 461.12 |
4(24) | 737.59 |
Similar optimizations are possible by building TDLG_CPU with IBM XL compiler, in order to do so, modify the Makefile as follows:
CC = xlC
PROG = LibTDLG_CPU
CFLAGS = -std=gnu++1y -qsmp=omp -qsimd -I./ -DPREC_FP32 -DOMP -O3 -qhot -qarch=pwr9 -qtune=pwr9
LFLAGS = -lm -Wl,-rpath=/sw/summit/xl/16.1.1-5/lib/ -lxlsmp
OBJ = *.o
MYLIB = $(PROG).so
Lib:$(OBJ)
$(CC) FH_RS.o update_field_noisy.o update_field.o -shared -o ./$(MYLIB) $(CFLAGS) $(LFLAGS)
$(OBJ):
$(CC) $(CFLAGS) -c -qpic *.cpp
clean:
rm -f *.o *~
The correctness of 3D BCP code is assessed using driver/test_tdlg_v3.py
script (that runs for a single episode and 10 steps), which prints the results difference (cumulative rewards for 10 steps is 10) and average per-step execution time. Following table lists the execution time per step and the results difference.
TDLG_CPU (best GCC) | TDLG_CPU (best XL) | TDLG_GPU (Summit) | TDLG_GPU (Darwin) |
---|---|---|---|
1.69s, 4.0E-06 | 2.012s, 4.0E-06 | 0.502s, 4.0E-06 | 0.502s, 4.0E-06 |
The TDLG_GPU version used the default one task per GPU configuration on Summit, comparatively, when CUDA MPS is turned on to run TDLG_GPU, the results are about 3x slower per step (mostly due to a very high kernel launch overhead).
The above results are using versions before develop tag v0.1.
On Cori GPU (using devel_v0.3
), the results are:
# CPU | # GPU | Avg elapsed time (sec) |
---|---|---|
5 | 1 | 38 |
10 | 2 | 69 |
20 | 4 | 119 |
40 | 8 | 219 |