-
Notifications
You must be signed in to change notification settings - Fork 213
How do I run a case
Before you can run the job, you need to make sure the batch queue variables are set correctly for the specific run being targeted. This is done currently by manually editing $CASE.run
. You should carefully check the batch queue submission lines and make sure that you have appropriate account numbers, time limits, and stdout file names. In looking at the ccsm_timing.$CASE.$datestamp files for "Model Throughput", output like the following will be found:
Overall Metrics:
Model Cost: 327.14 pe-hrs/simulated_year (scale= 0.50)
Model Throughput: 4.70 simulated_years/day
The model throughput is the estimated number of model years that you can run in a wallclock day. Based on this, you can maximize $CASE.run queue limit and change $STOP_OPTION and $STOP_N in env_run.xml
. For example, say a model's throughput is 4.7 simulated_years/day. On yellowstone(??), the maximum runtime limit is 6 hours. 4.7 model years/24 hours * 6 hours = 1.17 years. On the massively parallel computers, there is always some variability in how long it will take a job to run. On some machines, you may need to leave as much as 20% buffer time in your run to guarantee that jobs finish reliably before the time limit. For that reason we will set our model to run only one model year/job. Continuing to assume that the run is on yellowstone, in $CASE.yellowstone.run set
:
#BSUB -W 6:00
and xmlchange
should be invoked as follows in CASEROOT
:
./xmlchange STOP_OPTION=nyears
./xmlchange STOP_N=1
./xmlchange REST_OPTION=nyears
./xmlchange REST_N=1
Once you have configured and built the model, submit $CASE.run to your machine's batch queue system using the $CASE.submit
command.
> $CASE.submit
You can see a complete example of how to run a case in the basic example.
When executed, the run script, $CASE.run
:
- Will not execute the build script. Building CESM is now done only via an interactive call to the build script,
$CASE.build
. - Will check that locked files are consistent with the current xml files, run the buildnml script for each component and verify that required input data is present on local disk (in
$DIN_LOC_ROOT
). - Will run the CESM model.
- Upon completion, will put timing information in
$LOGDIR/timing
and copy log files back to$LOGDIR
- If
$DOUT_S
is TRUE, component history, log, diagnostic, and restart files will be moved from$RUNDIR
to the short-term archive directory,$DOUT_S_ROOT
. - If
$DOUT_L_MS
is TRUE, the long-term archiver,$CASE.l_archive
, will be submitted to the batch queue upon successful completion of the run. - If
$RESUBMIT
>0, resubmit$CASE.run
If the job runs to completion, you should have "SUCCESSFUL TERMINATION OF CPL7-CCSM" near the end of your STDOUT file. New data should be in the subdirectories under $DOUT_S_ROOT, or if you have long-term archiving turned on, it should be automatically moved to subdirectories under $DOUT_L_MSROOT.
If the job failed, there are several places where you should look for information. Start with the STDOUT and STDERR file(s) in $CASEROOT. If you don't find an obvious error message there, the $RUNDIR/$model.log.$datestamp files will probably give you a hint. First check cpl.log.$datestamp, because it will often tell you when the model failed. Then check the rest of the component log files. Please see troubleshooting runtime errors for more information.
REMINDER: Once you have a successful first run, you must set CONTINUE_RUN to TRUE in env_run.xml
before resubmitting, otherwise the job will not progress. You may also need to modify the STOP_OPTION, STOP_N and/or STOP_DATE, REST_OPTION, REST_N and/or REST_DATE, and RESUBMIT variables in env_run.xml
before resubmitting.
Restart files are written by each active component (and some data components) at intervals dictated by the driver via the setting of the env_run.xml
variables, $REST_OPTION
and $REST_N
. Restart files allow the model to stop and then start again with bit-for-bit exact capability (i.e. the model output is exactly the same as if it had never been stopped). The driver coordinates the writing of restart files as well as the time evolution of the model. All components receive restart and stop information from the driver and write restarts or stop as specified by the driver.
It is important to note that runs that are initialized as branch or hybrid runs, will require restart/initial files from previous model runs (as specified by the variables, $RUN_REFCASE
and $RUN_REFDATE
). These required files must be prestaged by the user to the case $RUNDIR
(normally $EXEROOT/run
) before the model run starts. This is normally done by just copying the contents of the relevant $RUN_REFCASE/rest/$RUN_REFDATE.00000
directory.
Whenever a component writes a restart file, it also writes a restart pointer file of the form, rpointer.$component
. The restart pointer file contains the restart filename that was just written by the component. Upon a restart, each component reads its restart pointer file to determine the filename(s) to read in order to continue the model run. As examples, the following pointer files will be created for a component set using full active model components.
- rpointer.atm
- rpointer.drv
- rpointer.ice
- rpointer.lnd
- rpointer.rof
- rpointer.cism
- rpointer.ocn.ovf
- rpointer.ocn.restart
If short-term archiving is turned on, then the model archives the component restart datasets and pointer files into $DOUT_S_ROOT/rest/yyyy-mm-dd-sssss
, where yyyy-mm-dd-sssss is the model date at the time of the restart (see below for more details). If long-term archiving these restart then archived in $DOUT_L_MSROOT/rest
. DOUT_S_ROOT
and DOUT_L_MSROOT
are set in env_run.xml
, and can be changed at any time during the run.
If a run encounters problems and crashes, you will normally have to back up to a previous restart. Assuming that short-term archiving is enabled, you will need to find the latest $DOUT_S_ROOT/rest/yyyy-mm-dd-ssss/
directory that was created and copy the contents of that directory into your run directory ($RUNDIR
). You can then continue the run and these restarts will be used. It is important to make sure the new rpointer.* files overwrite the rpointer.* files that were in $RUNDIR
, or the job may not restart in the correct place.
Occasionally, when a run has problems restarting, it is because the rpointer files are out of sync with the restart files. The rpointer files are text files and can easily be edited to match the correct dates of the restart and history files. All the restart files should have the same date.
All component log files are copied to the directory specified by the env_run.xml
variable $LOGDIR
which by default is set to $CASEROOT/logs
. This location is where log files are copied when the job completes successfully. If the job aborts, the log files will NOT be copied out of the $RUNDIR
directory.
Once a model run has completed successfully, the output data flow will depend on whether or not short-term archiving is enabled (as set by the env_run.xml
variable, $DOUT_S
). By default, short-term archiving will be done.
If no short-term archiving is performed, then all model output data will remain in the run directory, as specified by the env_run.xml
variable, $RUNDIR
. Furthermore, if short-term archiving is disabled, then long-term archiving will not be allowed.
If short-term archiving is enabled, the component output files will be moved to the short term archiving area on local disk, as specified by $DOUT_S_ROOT
. The directory DOUT_S_ROOT
is normally set to $EXEROOT/../archive/$CASE.
and will contain the following directory structure:
atm/
hist/ logs/
cpl/
hist/ logs/
glc/
logs/
ice/
hist/ logs/
lnd/
hist/ logs/
ocn/
hist/ logs/
rest/
yyyy-mm-dd-sssss/
....
yyyy-mm-dd-sssss/
hist/ contains component history output for the run.
logs/ contains component log files created during the run. In addition to $LOGDIR
, log files are also copied to the short-term archiving directory and therefore are available for long-term archiving.
rest/ contains a subset of directories that each contain a consistent set of restart files, initial files and rpointer files. Each sub-directory has a unique name corresponding to the model year, month, day and seconds into the day where the files were created (e.g. 1852-01-01-00000/). The contents of any restart directory can be used to create a branch run or a hybrid run or back up to a previous restart date.
For long production runs that generate many giga-bytes of data, you will normally want to move the output data from local disk to a long-term archival location. Long-term archiving can be activated by setting $DOUT_L_MS
to TRUE in env_run.xml
. By default, the value of this variable is FALSE, and long-term archiving is disabled. If the value is set to TRUE, then the following additional variables are: $DOUT_L_MSROOT, $DOUT_S_ROOT DOUT_S
(see
variables for output data management).
As was mentioned above, if long-term archiving is enabled, files will be moved out of $DOUT_S_ROOT
to $DOUT_L_ROOT
by $CASE.l_archive
, which is run as a separate batch job after the successful completion of a model run.