-
Notifications
You must be signed in to change notification settings - Fork 0
ROGER Parallel Execution
The Basic Fusion program itself is a serial program that takes advantage of no parallel libraries. One program instance is designed to generate a single granule of data, i.e. one Terra orbit. The large-scale datasets of this program are generated using an embarrassingly parallel workflow. This document outlines one possible way of submitting parallel jobs on ROGER.
The first step of any C program is obviously to first compile it. The steps to do this are outlined in the root README.md file, however to briefly reiterate, ROGER provides a series of pre-installed libraries and packages that can be found using the module avail
command. The Makefile.roger file takes advantage of these modules and assumes that these modules have been loaded. It is also possible to compile the program using privately-installed libraries. In that case, a new Makefile would have to be generated so that the compilers and linkers have proper visibility to all of the libraries specified in the README.md files. HDF4 and HDF5 libraries can be downloaded from the HDFgroup.org website.
Before executing the program, a database containing information on the file paths of all the input files must be generated. Please refer to this wiki page on how to generate and query the database.
The Fusion program can be executed using the GNU parallel tool. GNU Parallel takes as input a list of commands to be executed, either an explicit program call or a shell script, and distributes each of those commands amongst the compute nodes. GNU parallel is run on compute nodes, not login nodes, so calls to parallel must be submitted to the job scheduler (discussed later on). The commands.txt file (or whatever you would like to name it) contains the explicit Fusion program calls and is passed to GNU parallel. This file should be generated by some kind of shell script to pass the appropriate arguments to each instance of the function. One example of a 3-instance commands.txt file:
./basicFusion ./TERRA_FUS_69365.h5 ./input69365.txt ./orbit_info.bin 2>> ./errors69365.txt
./basicFusion ./TERRA_FUS_69366.h5 ./input69366.txt ./orbit_info.bin 2>> ./errors69366.txt
./basicFusion ./TERRA_FUS_69367.h5 ./input69367.txt ./orbit_info.bin 2>> ./errors69367.txt