Skip to content

Running 3D BCP on OLCF Summit

Sayan Ghosh edited this page Sep 29, 2020 · 7 revisions

1. To setup the proper conda environment for ExaRL

Follow the instructions on Running ExaRL on Summit

source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
conda activate exarl_summit

2. Clone the ExaRL repo with the submodules:

git clone --recursive https://github.com/exalearn/ExaRL.git

If you already have the ExaRL repo and need to update the submodules, please use:

cd ExaRL
git submodule update --init --recursive

3. Compile the TDLG simulator:

CPU version:

vi ./envs/env_vault/LibTDLG/TDLG_CPU/Makefile

make sure the following settings are used.

CFLAGS = -std=c++17 -I./ -DPREC_FP32 -DOMP -fvisibility=default -O2
LFLAGS = -lm -lgomp

save the above Makefile, and compile:

cd ./envs/env_vault/LibTDLG/TDLG_CPU/
make

GPU version:

vi ./envs/env_vault/LibTDLG/TDLG_GPU/Makefile

make sure the following two settings are used. change the cuda version accordingly in LFLAGS setting: (below setting examples use cuda version 10.1.243. module list will show the specific version currently used.)

LFLAGS = -lm -L/sw/summit/cuda/10.1.243/lib64 -lcudart -lcuda
CUDA_FLAGS = -arch=sm_70

Save the above Makefile settings and compile:

cd ./envs/env_vault/LibTDLG/TDLG_GPU/
make

Detailed instruction could be found here

4. Update the TDLG configuration:

vim ./envs/env_vault/env_cfg/ExaLearnBlockCoPolymerTDLG-v3.json
Modify app_core to either "cpu" (in that case, you may update the app_threads value accordingly) or "gpu". Set the number of threads in "app_threads", to manage the resources each occurrence of TDLG environment will use. Also, change the path of the environment in the value part of "app_dir", for e.g., I had to modify "app_dir" as: "app_dir" : "./envs/env_vault/LibTDLG/".

5. Try a testing run:

cd ../../../../
python driver/test_tdlg_v3.py

6. For job submission:

Below is an example of bsub file. For maximum performance, the OMP_NUM_THREADS in the conda should be set the same as the --bind packed setting in the jsrun. These two settings should also match the "app_threads" setting in the configuration shown in step 4.

#!/bin/bash
#BSUB -P <project>
#BSUB -W 02:00
#BSUB -nnodes 16
#BSUB -J ExaBCP_TDLG
#BSUB -o ExaBCP_TDLG.%J
#BSUB -e ExaBCP_TDLG.%J

source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
conda activate exarl_summit

export OMP_NUM_THREADS=7
export OMP_PLACES=cores
export OMP_PROC_BIND=close

jsrun --nrs 96 --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --latency_priority CPU-CPU --launch_distribution packed --bind packed:7 python ./driver/driver_example.py --output_dir '<output folder>' --env 'ExaLearnBlockCoPolymerTDLG-v3' --agent 'DQN-LSTM-v0' --n_episodes '10' --n_steps '10' --run_type 'static-omp'

Example script to run CPU/OpenMP version of TDLG on multiple Summit nodes (pay attention to the passed arguments, you may not need to pass those or may need to change the value of the arguments):

#!/bin/bash
#BSUB -P AST153
#BSUB -W 1:30
#BSUB -nnodes 32
#BSUB -J ExaRL_TDLG
#BSUB -o ExaRL_TDLG.%J
#BSUB -e ExaRL_TDLG.%J

source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
conda activate exarl_summit
export OMP_NUM_THREADS=7

for t in 1 2 4 8 16 32
do
export nres=$((t*6))
echo "---------------------------------------"
echo "Running ExaRL with TDLG-CPU on $t nodes"
echo "---------------------------------------"
jsrun --nrs $nres --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --latency_priority GPU-CPU --launch_distribution packed --bind packed:7 python driver/driver_example.py --output_dir /gpfs/alpine/ast153/scratch/$USER/ --env ExaLearnBlockCoPolymerTDLG-v3 --n_episodes 500 --n_steps 60 --learner_type async --agent DQN-v0 --model_type LSTM --action_type fixed
done

Execution time analysis on Summit

The TDLG_CPU code uses OpenMP 4.0 pragmas for shared-memory parallelization, and following average execution times over a number of processes are observed for the driver/driver_example.py script that ran for 10 steps and 10 episodes. Following table compares execution times (in secs.) of ExaCartPole-V1 and 3D BCP (TDLG_CPU) environments. The TDLG_CPU environment uses 7 OpenMP threads. The ExaCartPole runs are about 20x faster than TDLG_CPU.

Nodes(Processes) TDLG_CPU ExaCartPole-V1
1(6) 394.40 18.16
2(12) 538.90 38.71
4(24) 801.46 71.59

It is possible to improve the CPU times further by using aggressive compiler optimizations and minor tweaks in the BCP C++ code. For instance, the following was found to improve the performance by ~8-20%: 1) In update_field.cpp and update_field_noisy.cpp, modified blocking factor to 4 from 8 (bsz,bsx,bsy), and 2) Added some fast math optimization flags for building TDLG_CPU (potentially could lead to non-IEEE-754 compliant arithmetic): CFLAGS = -std=c++17 -mcpu=native -ffast-math -funsafe-math-optimizations -I./ -fopenmp -fopenmp-simd -DPREC_FP32 -DOMP -fvisibility=default -O3. Results are as follows (compare with the table above):

Nodes(Processes) TDLG_CPU (w aggressive GCC compiler options)
1(6) 313.04
2(12) 461.12
4(24) 737.59

Similar optimizations are possible by building TDLG_CPU with IBM XL compiler, in order to do so, modify the Makefile as follows:

CC = xlC
PROG = LibTDLG_CPU
CFLAGS = -std=gnu++1y -qsmp=omp -qsimd -I./ -DPREC_FP32 -DOMP -O3 -qhot -qarch=pwr9 -qtune=pwr9
LFLAGS = -lm -Wl,-rpath=/sw/summit/xl/16.1.1-5/lib/ -lxlsmp
OBJ = *.o
MYLIB = $(PROG).so
Lib:$(OBJ)
    $(CC) FH_RS.o update_field_noisy.o update_field.o -shared -o ./$(MYLIB) $(CFLAGS) $(LFLAGS)
$(OBJ):
    $(CC) $(CFLAGS) -c -qpic *.cpp
clean:
    rm -f *.o *~

The correctness of 3D BCP code is assessed using driver/test_tdlg_v3.py script (that runs for a single episode and 10 steps), which prints the results difference (cumulative rewards for 10 steps is 10) and average per-step execution time. Following table lists the execution time per step and the results difference.

TDLG_CPU (best GCC) TDLG_CPU (best XL) TDLG_GPU (Summit) TDLG_GPU (Darwin)
1.69s, 4.0E-06 2.012s, 4.0E-06 0.502s, 4.0E-06 0.502s, 4.0E-06

The TDLG_GPU version used the default one task per GPU configuration on Summit, comparatively, when CUDA MPS is turned on to run TDLG_GPU, the results are about 3x slower per step (mostly due to a very high kernel launch overhead). The above results are using versions before develop tag v0.1. On Cori GPU (using devel_v0.3), the results are:

# CPU # GPU Avg elapsed time (sec)
5 1 38
10 2 69
20 4 119
40 8 219