-
Notifications
You must be signed in to change notification settings - Fork 0
ROGER Parallel Execution
The Basic Fusion program itself is a purely serial program that takes advantage of no parallel libraries. One program instance is designed to generate a single granule of data, i.e. one Terra orbit. The large-scale datasets of this program are generated using an embarrassingly parallel workflow. This document outlines one possible way of submitting parallel jobs on ROGER.
The first step of any C program is obviously to first compile it. The steps to do this are outlined in the root README.md file, however to briefly reiterate, ROGER provides a series of pre-installed libraries and packages that can be found using the module avail
command. The Makefile.roger file takes advantage of these modules and assumes that these modules have been loaded. It is also possible to compile the program using privately-installed libraries. In that case, a new Makefile would have to be generated so that the compilers and linkers have proper visibility to all of the libraries specified in the README.md files. HDF4 and HDF5 libraries can be downloaded from the HDFgroup.org website.
Before executing the program, a database containing information on the file paths of all the input files must be generated. Please refer to this wiki page on how to generate and query the database.
One of the prerequisites for the program is the input text file. This text file contains the paths of all the input hdf4 and hdf5 files necessary for file fusion. Once the database has been generated, a simple script has been provided that queries the database and orders all of the files to generate a file. This script is located at metadata-input/genInput/genFusionInput.sh.
It takes as an argument the required orbit number and the desired output .txt filename and queries the database. Because the database does not return the files in an ordered manner, and also because the database itself cannot perform file verification, this script also orders the files properly and performs extensive file verification to ensure that there are no bugs in the generated text file. This file verification and ordering step is done to prevent the fusion program from crashing due to bad input.
One of these input files will need to be generated for each instance of the program (there should be one instance of the program for each orbit, as per the fusion granularity being defined as one Terra orbit). Because more than 2 or 3 simultaneous database queries is computationally expensive, it is recommended that you submit these queries for parallel processing:
- Write a script to generate a commands.txt file for GNU parallel containing all of the necessary database queries, one query per line.
- Create a PBS (Portable Batch System) script that calls GNU parallel, passing to it your commands.txt file.
- Submit your PBS script using
qsub
.
One example of a commands.txt file for the database query:
./genFusionInput.sh 69365 "../fusionTestInput/input/input69365.txt" 2> errors/errors69365.txt
./genFusionInput.sh 69366 "../fusionTestInput/input/input69366.txt" 2> errors/errors69366.txt
./genFusionInput.sh 69367 "../fusionTestInput/input/input69367.txt" 2> errors/errors69367.txt
Examples of using GNU parallel can be found at the ROGER user guide.
The Fusion program can be executed using the GNU parallel tool. GNU Parallel takes as input a list of commands to be executed, either an explicit program call or a shell script, and distributes each of those commands amongst the compute nodes automatically. GNU parallel is run on compute nodes, not login nodes, so calls to parallel must be submitted to the job scheduler (discussed later on). The commands.txt file (or whatever you would like to name it) contains the explicit Fusion program calls and is passed to GNU parallel. This file should be generated by some kind of shell script to pass the appropriate arguments to each instance of the function. One example of a 3-instance commands.txt file:
./basicFusion ./TERRA_FUS_69365.h5 ./input69365.txt ./orbit_info.bin 2> ./errors69365.txt
./basicFusion ./TERRA_FUS_69366.h5 ./input69366.txt ./orbit_info.bin 2> ./errors69366.txt
./basicFusion ./TERRA_FUS_69367.h5 ./input69367.txt ./orbit_info.bin 2> ./errors69367.txt
Each line contains:
- Path to basicFusion executable
- Name of the output file
- Path to the input text file that contains paths to the necessary hdf4 and hdf5 files
- Path to the orbit_info.bin file
- Error message redirection
This commands file will then be passed to GNU parallel (as is documented on the linked ROGER page) inside the PBS script. Submit the PBS script to the job scheduler using qsub [PBS script name]
. You can view the status of your job by using qstat | grep [NCSA username]
. A status of Q means it is enqueued. R means it is running. C means it is complete.
Example PBS scripts that contain the necessary parameters can be found on the ROGER documentation.