-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
global_ens: add restart capability to atmos grid2grid and precip stats jobs #604
global_ens: add restart capability to atmos grid2grid and precip stats jobs #604
Conversation
Full TestJobs for the full test have been submitted. COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0. jevs_global_ens_cmce_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206752404 jevs_global_ens_ecme_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206752480 jevs_global_ens_gefs_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206752548 jevs_global_ens_naefs_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206752662 jevs_global_ens_cmce_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206752755 jevs_global_ens_ecme_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206752988 jevs_global_ens_gefs_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o206753216 jevs_global_ens_naefs_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206753242 |
@malloryprow, please copy the jevs_global_ens_gefs_atmos_grid2grid_stats.sh (may need to increase memory) These two jobs need the exec/evs_g2g_adjustCMC.x file to run. All other jobs have successfully completed. |
jevs_global_ens_gefs_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206855987 jevs_global_ens_naefs_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206856015 |
All jobs have completed successfully. Please do the restart test. |
Will do! The memory for jevs_global_ens_cmce_atmos_precip_stats.sh quite high (mem=100GB) for what it is using (update_job_usage: Memory usage: mem=2568832kb). I think it can be 10GB like the ecme and naefs jobs. Please update the dev driver and ecf script! |
RestartI moved /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs into /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/full_test. COMOUT for the restart testing will be /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0. I believe a few restart grid2grid jobs are still running. jevs_global_ens_cmce_atmos_grid2grid_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206871666 jevs_global_ens_cmce_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206872251 jevs_global_ens_ecme_atmos_grid2grid_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206873673 jevs_global_ens_ecme_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206873677 jevs_global_ens_gefs_atmos_grid2grid_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206874214 jevs_global_ens_gefs_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o206874217 jevs_global_ens_naefs_atmos_grid2grid_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206874568 jevs_global_ens_naefs_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206874484 |
Noticed one more thing with dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.sh should |
I'm looking at the jobs some more now that they restart testing is complete. I'm noticing that the walltime for the restart runs is very close to the full run (the full run times are from today's parallel logs). Particularly for the grid2grid runs.
|
@malloryprow @GwenChen-NOAA I agree. These runtimes indicate that the restart capability is not working as intended. The restart runtimes are often just as long (or longer) than the full run runtimes. Are the restart runs possibly not using the restart files being produced? |
dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.sh
Show resolved
Hide resolved
@AliciaBentley-NOAA, the restart runs do use the restart files as intended. @malloryprow made a mistake. The jevs_global_ens_gefs_atmos_grid2grid_stats.sh (full run) : resources_used.walltime is 00:57:20 in /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/full_test/jevs_global_ens_gefs_atmos_grid2grid_stats.o206855987, longer than the restart run. So, the restart runtime of all jobs is less than or equal to the full runtime. The benefits of restart is not great for the grid2grid jobs, which run GenEnsProd, EnsembleStat, and GridStat tasks. The GenEnsProd task runs very fast, but the EnsembleStat and GridStat tasks take a long time to run (most of the runtime). So, any interruption to the EnsembleStat and GridStat tasks will take about the same time as the full run to rerun these two tasks. |
The restart test looks successful to me. |
So if the run is through EnsembleStat through forecast hour 168 and then gets interrupted. When the job is restarted will it start again at forecast hour 0 or forecast hour 168 where it got interrupted during EnsembleStat? I feel NCO is going to have a close eye on this since it got a waiver for EVS v1.0. |
@malloryprow @GwenChen-NOAA Thanks for the discussion. In operations, the purpose of having restart capabilities is to considerably reduce a job's runtime when that job got part way through running and unexpectedly crashed. When jobs crash part way through running, NCO needs to rerun/finish the job as quickly as possible in order for the ops supercomputer to catch back up to where it should be. For example, if a job that typically takes 1 hour to run crashes at 45 minutes, restart capabilities should allow the job to complete in ~15 minutes when it is rerun. My worry is that NCO will not be satisfied with the restart runtimes in these examples and may even send EVS v2.0 back to us to fix. We'd like to avoid that. Do either of you know how the restart capabilities that Gwen added to global_ens differ from the other restart capacities in EVS that do considerably reduce runtimes? For example, which component of EVS did Gwen use as an example to add these restart capability updates? Examining that EVS component and the code in this PR might reveal where things differ and allow us to get the reduced runtimes that NCO expects. Thanks! |
The forecast hour loop is set within the METplus job (EnsembleStat or GridStat) by setting the The restart setup in this PR mimics the restart setup in NARRE restart (PR #465) that @AliciaBentley-NOAA provided to me. I think this is a common setup for all ensemble stats jobs restart. |
@malloryprow, please run the restart test. Thanks! |
I want to do a full test for the two jobs that had changes with the latest commit. Especially, jevs_global_ens_gefs_atmos_grid2grid_stats.sh to make sure the memory increase covers it! I moved the old directories in COMOUT to /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/full_test_try1, so current testing output is in /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens. Full Test1. jevs_global_ens_gefs_atmos_grid2grid_statsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o208477253 2. jevs_global_ens_naefs_atmos_grid2grid_statsLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o208477238 |
You will need to recompile evs_g2g_adjustCMC.f if you want to test the warning message changes. |
I did! |
@malloryprow, the results of the jevs_global_ens_gefs_atmos_grid2grid_stats and jevs_global_ens_naefs_atmos_grid2grid_stats jobs look good. If you can run the restart test today, I will check it before I leave for AGU meeting next week. |
Hi @GwenChen-NOAA, my work day was over at 3:30pm. Given that you are at AGU now, we will put this testing on hold until you return from AGU. |
Can you do a sync work when you get a chance, @GwenChen-NOAA? |
RESTART TESTS - grid2gridCOMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0. 1. jevs_global_ens_cmce_atmos_grid2grid_stats.sh (interrupted run with walltime of 15 minutes)Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o209305213 2. jevs_global_ens_ecme_atmos_grid2grid_stats.sh (interrupted run with walltime of 35 minutes)Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o209305343 3. jevs_global_ens_gefs_atmos_grid2grid_stats.sh (interrupted run with walltime of 20 minutes)Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o209305414 4. jevs_global_ens_naefs_atmos_grid2grid_stats.sh (interrupted run with walltime of 10 minutes)Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o209305624 |
It looks like the numbers of lines in the final stat file is double of that in the parallel. /lfs/h2/emc/vpppg/noscrub/emc.vpppg/evs/v2.0/stats/global_ens/cmce.20241214/evs.stats.cmce.atmos.grid2grid.v20241214.stat: 15625 I think there needs to be an update to where
|
Just did. |
@malloryprow, do you know a way to restrict stat_analysis to only grab small stat files in top level of |
Can you do |
Try that and let me know how it works! If it doesn't, we can look into a different solution. |
Ok. |
@malloryprow, it works. The new final stat files are the same size as that in the parallel. The restart test looks good to me. You may run the precip jobs to confirm. |
COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0. I reran the grid2grid jobs with the change, so essentially all the small stats files are there but will generate a new final stats file for each job. They look a lot better now. 1. jevs_global_ens_cmce_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o209325989 2. jevs_global_ens_ecme_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o209325997 3. jevs_global_ens_gefs_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o209326005 4. jevs_global_ens_naefs_atmos_grid2grid_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o209326013 It looks like all the precip jobs are less than 15 minutes so they don't need restart. I just ran a full test for them, but I can try to test restart for them if you'd like. Some of them are pretty quick,
5. jevs_global_ens_cmce_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o209327102 6. jevs_global_ens_ecme_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209327117 7. jevs_global_ens_gefs_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o209327267 8. jevs_global_ens_naefs_atmos_precip_stats.shLog File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o209327396 |
The full precip runs ran
|
The new test run looks very good to me. Yes, all precip jobs are under 15 min. I added the restart since |
Ah, got it. Would you like be to test it then to make sure the changes work even though restart isn't technically needed? |
Sure, in case that it's needed in the future. |
COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0. The interrupted runs each ran for 5 min. 1. jevs_global_ens_cmce_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o209329557 2. jevs_global_ens_ecme_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209329612 3. jevs_global_ens_gefs_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209329612 4. jevs_global_ens_naefs_atmos_precip_stats.shInterrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209329612 |
The precip test run is successful. I think this PR is ready to be merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good and testing successful.
Thank you @GwenChen-NOAA!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malloryprow @GwenChen-NOAA Before I approve this PR to be merged, can we confirm that the ecme
and naefs
.ecf scripts that were changed in this PR match in their corresponding dev drivers? There are four .ecf scripts changed and only two dev drivers. Once this is confirmed, please go ahead and merge.
It looks like the changes were already in the corresponding dev drivers https://github.com/GwenChen-NOAA/EVS/blob/grid2grid_restart/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.sh |
Description of Changes
This PR adds restart capability to atmos grid2grid and precip stats jobs for the global_ens component. This PR addresses Issue #532.
Developer Questions and Checklist
${USER}
where necessary throughout the code.HOMEevs
are removed from the code.dev/drivers/scripts
ordev/modulefiles
have been made in the correspondingecf/scripts
andecf/defs/evs-nco.def
?Testing Instructions
(1) Set up jobs
a. Symlink the EVS_fix directory locally as "fix".
b. Copy the exec directory from EVS prod package:
cp -r /lfs/h1/ops/prod/packages/evs.v1.0.13/exec $HOMEevs
c. In the driver scripts, edit the following environment variables:
HOMEevs - set to your test EVS directory
COMIN - set to /lfs/h2/emc/vpppg/noscrub/emc.vpppg/${NET}/$evs_ver_2d
COMOUT - set to your test output directory
(2) Run jobs
Run the following jobs in EVS/dev/drivers/scripts/stats/global_ens:
qsub jevs_global_ens_cmce_atmos_grid2grid_stats.sh
qsub jevs_global_ens_ecme_atmos_grid2grid_stats.sh
qsub jevs_global_ens_gefs_atmos_grid2grid_stats.sh
qsub jevs_global_ens_naefs_atmos_grid2grid_stats.sh
qsub jevs_global_ens_cmce_atmos_precip_stats.sh
qsub jevs_global_ens_ecme_atmos_precip_stats.sh
qsub jevs_global_ens_gefs_atmos_precip_stats.sh
qsub jevs_global_ens_naefs_atmos_precip_stats.sh
Log files should be checked for free of errors.
(3) Test restart capability
After a successful full run, save the final stat file for comparison. Run the job again and randomly stop the job using qdel. Then, resubmit the job using qsub. The resubmitted job should run shorter than the full run, and the final stat file from the resubmitted job should be the same from the full run. Log file from the resubmitted job should also be free of errors.