Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

global_ens: add restart capability to atmos grid2grid and precip stats jobs #604

Merged
merged 14 commits into from
Dec 16, 2024

Conversation

GwenChen-NOAA
Copy link
Contributor

@GwenChen-NOAA GwenChen-NOAA commented Nov 15, 2024

Description of Changes

This PR adds restart capability to atmos grid2grid and precip stats jobs for the global_ens component. This PR addresses Issue #532.

Developer Questions and Checklist

  • Is this a high priority PR? If so, why and is there a date it needs to be merged by? Yes.
  • Do you have any planned upcoming annual leave/PTO? Yes, 11/27-29
  • Are there any changes needed for when the jobs are supposed to run? No.
  • The code changes follow NCO's EE2 Standards.
  • Developer's name is removed throughout the code and have used ${USER} where necessary throughout the code.
  • References the feature branch for HOMEevs are removed from the code.
  • J-Job environment variables, COMIN and COMOUT directories, and output follow what has been defined for EVS.
  • Jobs over 15 minutes in runtime have restart capability.
  • If applicable, changes in the dev/drivers/scripts or dev/modulefiles have been made in the corresponding ecf/scripts and ecf/defs/evs-nco.def?
  • Jobs contain the appropriate file checking and don't run METplus for any missing data.
  • Code is using METplus wrappers structure and not calling MET executables directly.
  • Log is free of any ERRORs or WARNINGs.

Testing Instructions

(1) Set up jobs
a. Symlink the EVS_fix directory locally as "fix".
b. Copy the exec directory from EVS prod package:
cp -r /lfs/h1/ops/prod/packages/evs.v1.0.13/exec $HOMEevs

c. In the driver scripts, edit the following environment variables:

HOMEevs - set to your test EVS directory
COMIN - set to /lfs/h2/emc/vpppg/noscrub/emc.vpppg/${NET}/$evs_ver_2d
COMOUT - set to your test output directory

(2) Run jobs
Run the following jobs in EVS/dev/drivers/scripts/stats/global_ens:

qsub jevs_global_ens_cmce_atmos_grid2grid_stats.sh
qsub jevs_global_ens_ecme_atmos_grid2grid_stats.sh
qsub jevs_global_ens_gefs_atmos_grid2grid_stats.sh
qsub jevs_global_ens_naefs_atmos_grid2grid_stats.sh
qsub jevs_global_ens_cmce_atmos_precip_stats.sh
qsub jevs_global_ens_ecme_atmos_precip_stats.sh
qsub jevs_global_ens_gefs_atmos_precip_stats.sh
qsub jevs_global_ens_naefs_atmos_precip_stats.sh

Log files should be checked for free of errors.

(3) Test restart capability
After a successful full run, save the final stat file for comparison. Run the job again and randomly stop the job using qdel. Then, resubmit the job using qsub. The resubmitted job should run shorter than the full run, and the final stat file from the resubmitted job should be the same from the full run. Log file from the resubmitted job should also be free of errors.

@GwenChen-NOAA GwenChen-NOAA changed the title Add restart capability to atmos grid2grid and precip stats jobs global_ens: add restart capability to atmos grid2grid and precip stats jobs Nov 16, 2024
@malloryprow malloryprow self-assigned this Nov 18, 2024
@malloryprow malloryprow added the enhancement New feature or request label Nov 18, 2024
@malloryprow malloryprow added this to the EVS v2.0.0 milestone Nov 18, 2024
@malloryprow
Copy link
Contributor

Full Test

Jobs for the full test have been submitted. COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0.

jevs_global_ens_cmce_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206752404
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.206752404.dbqs01

jevs_global_ens_ecme_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206752480
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.206752480.dbqs01

jevs_global_ens_gefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206752548
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206752548.dbqs01

jevs_global_ens_naefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206752662
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.206752662.dbqs01

jevs_global_ens_cmce_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206752755
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.206752755.dbqs01

jevs_global_ens_ecme_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206752988
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.206752988.dbqs01

jevs_global_ens_gefs_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o206753216
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.206753216.dbqs01

jevs_global_ens_naefs_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206753242
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.206753242.dbqs01

@GwenChen-NOAA
Copy link
Contributor Author

@malloryprow, please copy the exec directory from EVS prod package (cp -r /lfs/h1/ops/prod/packages/evs.v1.0.13/exec $HOMEevs) and then rerun the following two jobs:

jevs_global_ens_gefs_atmos_grid2grid_stats.sh (may need to increase memory)
jevs_global_ens_naefs_atmos_grid2grid_stats.sh

These two jobs need the exec/evs_g2g_adjustCMC.x file to run. All other jobs have successfully completed.

@malloryprow
Copy link
Contributor

jevs_global_ens_gefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206855987
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206855987.dbqs01

jevs_global_ens_naefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206856015
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206855987.dbqs01

@GwenChen-NOAA
Copy link
Contributor Author

All jobs have completed successfully. Please do the restart test.

@malloryprow
Copy link
Contributor

Will do!

The memory for jevs_global_ens_cmce_atmos_precip_stats.sh quite high (mem=100GB) for what it is using (update_job_usage: Memory usage: mem=2568832kb). I think it can be 10GB like the ecme and naefs jobs. Please update the dev driver and ecf script!

@malloryprow
Copy link
Contributor

Restart

I moved /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs into /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/full_test. COMOUT for the restart testing will be /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0.

I believe a few restart grid2grid jobs are still running.

jevs_global_ens_cmce_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206871666
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.206871666.dbqs01
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o206872712
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.206872712.dbqs01

jevs_global_ens_cmce_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206872251
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.206872251.dbqs01
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o206872739
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.206872739.dbqs01

jevs_global_ens_ecme_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206873673
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.206873673.dbqs01
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o206875508
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.206875508.dbqs01

jevs_global_ens_ecme_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206873677
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.206873677.dbqs01
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o206873949
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.206873949.dbqs01

jevs_global_ens_gefs_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206874214
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206874214.dbqs01
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o206875487
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.206875487.dbqs01

jevs_global_ens_gefs_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o206874217
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.206874217.dbqs01/jevs_global_ens_gefs_atmos_precip_stats.o206874402
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.206874402.dbqs01

jevs_global_ens_naefs_atmos_grid2grid_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206874568
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.206874568.dbqs01
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o206875290
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.206875290.dbqs01

jevs_global_ens_naefs_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206874484
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.206874484.dbqs01
Restart Log: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o206875001
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.206875001.dbqs01

@malloryprow
Copy link
Contributor

Noticed one more thing with dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.sh should select=1:ncpus=1. ecf script is fine. Sorry I didn't catch that earlier with the memory.

@malloryprow
Copy link
Contributor

I'm looking at the jobs some more now that they restart testing is complete. I'm noticing that the walltime for the restart runs is very close to the full run (the full run times are from today's parallel logs). Particularly for the grid2grid runs.

jevs_global_ens_cmce_atmos_grid2grid_stats.sh (full run) resources_used.walltime = 00:29:12
jevs_global_ens_cmce_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:12:06
jevs_global_ens_cmce_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:27:07

jevs_global_ens_ecme_atmos_grid2grid_stats.sh (full run) : resources_used.walltime = 01:05:23
jevs_global_ens_ecme_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:29:47
jevs_global_ens_ecme_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:51:19

jevs_global_ens_gefs_atmos_grid2grid_stats.sh (full run) : resources_used.walltime = 00:54:14
jevs_global_ens_gefs_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:18:41
jevs_global_ens_gefs_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:56:17

jevs_global_ens_naefs_atmos_grid2grid_stats.sh(full run) : resources_used.walltime = 00:19:08
jevs_global_ens_naefs_atmos_grid2grid_stats.sh (interrupted run): resources_used.walltime = 00:07:55
jevs_global_ens_naefs_atmos_grid2grid_stats.sh (restart run): resources_used.walltime = 00:19:08

jevs_global_ens_cmce_atmos_precip_stats.sh(full run): resources_used.walltime = 00:03:33
jevs_global_ens_cmce_atmos_precip_stats.sh (interrupted run): resources_used.walltime = 00:02:26
jevs_global_ens_cmce_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:02:51

jevs_global_ens_ecme_atmos_precip_stats.sh(full run): resources_used.walltime = 00:08:15
jevs_global_ens_ecme_atmos_precip_stat.sh (interrupted run): resources_used.walltime = 00:06:03
jevs_global_ens_ecme_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:07:21

jevs_global_ens_gefs_atmos_precip_stats.sh(full run): resources_used.walltime = 00:05:48
jevs_global_ens_gefs_atmos_precip_stats.sh (interrupted run): resources_used.walltime = 00:04:21
jevs_global_ens_gefs_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:04:57

jevs_global_ens_naefs_atmos_precip_stats.sh(full run): resources_used.walltime = 00:03:32
jevs_global_ens_naefs_atmos_precip_stats.sh (interrupted run): resources_used.walltime = 00:03:22
jevs_global_ens_naefs_atmos_precip_stats.sh (restart run): resources_used.walltime = 00:02:51

@AliciaBentley-NOAA
Copy link
Contributor

@malloryprow @GwenChen-NOAA I agree. These runtimes indicate that the restart capability is not working as intended. The restart runtimes are often just as long (or longer) than the full run runtimes. Are the restart runs possibly not using the restart files being produced?

@GwenChen-NOAA
Copy link
Contributor Author

These runtimes indicate that the restart capability is not working as intended. The restart runtimes are often just as long (or longer) than the full run runtimes. Are the restart runs possibly not using the restart files being produced?

@AliciaBentley-NOAA, the restart runs do use the restart files as intended. @malloryprow made a mistake. The jevs_global_ens_gefs_atmos_grid2grid_stats.sh (full run) : resources_used.walltime is 00:57:20 in /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/full_test/jevs_global_ens_gefs_atmos_grid2grid_stats.o206855987, longer than the restart run. So, the restart runtime of all jobs is less than or equal to the full runtime.

The benefits of restart is not great for the grid2grid jobs, which run GenEnsProd, EnsembleStat, and GridStat tasks. The GenEnsProd task runs very fast, but the EnsembleStat and GridStat tasks take a long time to run (most of the runtime). So, any interruption to the EnsembleStat and GridStat tasks will take about the same time as the full run to rerun these two tasks.

@GwenChen-NOAA
Copy link
Contributor Author

The restart test looks successful to me.

@malloryprow
Copy link
Contributor

So if the run is through EnsembleStat through forecast hour 168 and then gets interrupted. When the job is restarted will it start again at forecast hour 0 or forecast hour 168 where it got interrupted during EnsembleStat?

I feel NCO is going to have a close eye on this since it got a waiver for EVS v1.0.

@AliciaBentley-NOAA
Copy link
Contributor

AliciaBentley-NOAA commented Nov 20, 2024

@malloryprow @GwenChen-NOAA Thanks for the discussion.

In operations, the purpose of having restart capabilities is to considerably reduce a job's runtime when that job got part way through running and unexpectedly crashed. When jobs crash part way through running, NCO needs to rerun/finish the job as quickly as possible in order for the ops supercomputer to catch back up to where it should be. For example, if a job that typically takes 1 hour to run crashes at 45 minutes, restart capabilities should allow the job to complete in ~15 minutes when it is rerun. My worry is that NCO will not be satisfied with the restart runtimes in these examples and may even send EVS v2.0 back to us to fix. We'd like to avoid that.

Do either of you know how the restart capabilities that Gwen added to global_ens differ from the other restart capacities in EVS that do considerably reduce runtimes? For example, which component of EVS did Gwen use as an example to add these restart capability updates? Examining that EVS component and the code in this PR might reveal where things differ and allow us to get the reduced runtimes that NCO expects. Thanks!

@GwenChen-NOAA
Copy link
Contributor Author

The forecast hour loop is set within the METplus job (EnsembleStat or GridStat) by setting the VALID_BEG, VALID_END, and VALID_INCREMENT options in the config file. So, if the METplus job is interrupted and then restart, it will start from the VALID_BEG again. This is a limitation of METplus, since METplus tools are not designed with restart capability.

The restart setup in this PR mimics the restart setup in NARRE restart (PR #465) that @AliciaBentley-NOAA provided to me. I think this is a common setup for all ensemble stats jobs restart.

@GwenChen-NOAA
Copy link
Contributor Author

@malloryprow, please run the restart test. Thanks!

@malloryprow
Copy link
Contributor

I want to do a full test for the two jobs that had changes with the latest commit. Especially, jevs_global_ens_gefs_atmos_grid2grid_stats.sh to make sure the memory increase covers it!

I moved the old directories in COMOUT to /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/full_test_try1, so current testing output is in /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens.

Full Test

1. jevs_global_ens_gefs_atmos_grid2grid_stats

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o208477253
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.208477253.dbqs01

2. jevs_global_ens_naefs_atmos_grid2grid_stats

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o208477238
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.208477238.dbqs01

@GwenChen-NOAA
Copy link
Contributor Author

You will need to recompile evs_g2g_adjustCMC.f if you want to test the warning message changes.

@malloryprow
Copy link
Contributor

You will need to recompile evs_g2g_adjustCMC.f if you want to test the warning message changes.

I did!

@GwenChen-NOAA
Copy link
Contributor Author

@malloryprow, the results of the jevs_global_ens_gefs_atmos_grid2grid_stats and jevs_global_ens_naefs_atmos_grid2grid_stats jobs look good. If you can run the restart test today, I will check it before I leave for AGU meeting next week.

@malloryprow
Copy link
Contributor

Hi @GwenChen-NOAA, my work day was over at 3:30pm. Given that you are at AGU now, we will put this testing on hold until you return from AGU.

@malloryprow
Copy link
Contributor

Can you do a sync work when you get a chance, @GwenChen-NOAA?

@malloryprow
Copy link
Contributor

RESTART TESTS - grid2grid

COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0.

1. jevs_global_ens_cmce_atmos_grid2grid_stats.sh (interrupted run with walltime of 15 minutes)

Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o209305213
Interrupted Run DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.209305213.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o209306324
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.209306324.dbqs01

2. jevs_global_ens_ecme_atmos_grid2grid_stats.sh (interrupted run with walltime of 35 minutes)

Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o209305343
Interrupted Run DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.209305343.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o209307529
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.209307529.dbqs01

3. jevs_global_ens_gefs_atmos_grid2grid_stats.sh (interrupted run with walltime of 20 minutes)

Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o209305414
Interrupted Run DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.209305414.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o209306835
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.209306835.dbqs01

4. jevs_global_ens_naefs_atmos_grid2grid_stats.sh (interrupted run with walltime of 10 minutes)

Interrupted Run Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o209305624
Interrupted Run DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.209305624.dbqs01/lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.209305213.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o209306387
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.209306387.dbqs01

@malloryprow
Copy link
Contributor

It looks like the numbers of lines in the final stat file is double of that in the parallel.

/lfs/h2/emc/vpppg/noscrub/emc.vpppg/evs/v2.0/stats/global_ens/cmce.20241214/evs.stats.cmce.atmos.grid2grid.v20241214.stat: 15625
/lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/cmce.20241214:
31249
If you subtract 1 from each for the header line, it is exactly double.

I think there needs to be an update to where -lookin for stat_analysis is pointing to. It is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/atmos.20241214/cmce/grid2grid where both the file in top level of that directory and everything in restart are being kept.

12/16 13:11:10.319 metplus.5154bc9c INFO: COMMAND: /apps/ops/para/libs/intel/19.1.3.304/met/12.0.0-beta5/bin/stat_analysis -v 2 -lookin /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/atmos.20241214/cmce/grid2grid -config /apps/ops/para/libs/intel/19.1.3.304/metplus/6.0.0-beta5/parm/met_config/STATAnalysisConfig_wrapped
DEBUG 1: Start stat_analysis by mallory.row(32288) at 2024-12-16 13:11:10Z  cmd: /apps/ops/para/libs/intel/19.1.3.304/met/12.0.0-beta5/bin/stat_analysis -v 2 -lookin /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/atmos.20241214/cmce/grid2grid -config /apps/ops/para/libs/intel/19.1.3.304/metplus/6.0.0-beta5/parm/met_config/STATAnalysisConfig_wrapped

@GwenChen-NOAA
Copy link
Contributor Author

Can you do a sync work when you get a chance, @GwenChen-NOAA?

Just did.

@GwenChen-NOAA
Copy link
Contributor Author

It looks like the numbers of lines in the final stat file is double of that in the parallel.

/lfs/h2/emc/vpppg/noscrub/emc.vpppg/evs/v2.0/stats/global_ens/cmce.20241214/evs.stats.cmce.atmos.grid2grid.v20241214.stat: 15625 /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/cmce.20241214: 31249 If you subtract 1 from each for the header line, it is exactly double.

I think there needs to be an update to where -lookin for stat_analysis is pointing to. It is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0/stats/global_ens/atmos.20241214/cmce/grid2grid where both the file in top level of that directory and everything in restart are being kept.

@malloryprow, do you know a way to restrict stat_analysis to only grab small stat files in top level of -lookin? It is set by Line 83 in the StatAnlysis_fcstGENS_obsAnalysis_GatherByDay.conf file.

@malloryprow
Copy link
Contributor

Can you do MODEL1_STAT_ANALYSIS_LOOKIN_DIR = {ENV[stat_file_dir]}/*.stat?

@malloryprow
Copy link
Contributor

Try that and let me know how it works! If it doesn't, we can look into a different solution.

@GwenChen-NOAA
Copy link
Contributor Author

Try that and let me know how it works! If it doesn't, we can look into a different solution.

Ok.

@GwenChen-NOAA
Copy link
Contributor Author

@malloryprow, it works. The new final stat files are the same size as that in the parallel. The restart test looks good to me. You may run the precip jobs to confirm.

@malloryprow
Copy link
Contributor

COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0.

I reran the grid2grid jobs with the change, so essentially all the small stats files are there but will generate a new final stats file for each job. They look a lot better now.

1. jevs_global_ens_cmce_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_grid2grid_stats.o209325989
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_grid2grid_stats.209325989.dbqs01

2. jevs_global_ens_ecme_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_grid2grid_stats.o209325997
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_grid2grid_stats.209325997.dbqs01

3. jevs_global_ens_gefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_grid2grid_stats.o209326005
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_grid2grid_stats.209326005.dbqs01

4. jevs_global_ens_naefs_atmos_grid2grid_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_grid2grid_stats.o209326013
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_grid2grid_stats.209326013.dbqs01

It looks like all the precip jobs are less than 15 minutes so they don't need restart. I just ran a full test for them, but I can try to test restart for them if you'd like. Some of them are pretty quick,

/lfs/h2/emc/ptmp/emc.vpppg/output/jevs_global_ens_cmce_atmos_precip_stats.o209297002: resources_used.walltime = 00:03:31
/lfs/h2/emc/ptmp/emc.vpppg/output/jevs_global_ens_ecme_atmos_precip_stats.o209297003: resources_used.walltime = 00:08:16
/lfs/h2/emc/ptmp/emc.vpppg/output/jevs_global_ens_gefs_atmos_precip_stats.o209297001: resources_used.walltime = 00:05:00
/lfs/h2/emc/ptmp/emc.vpppg/output/jevs_global_ens_naefs_atmos_precip_stats.o209297004: resources_used.walltime = 00:03:31

5. jevs_global_ens_cmce_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o209327102
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.209327102.dbqs01

6. jevs_global_ens_ecme_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209327117
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.209327117.dbqs01

7. jevs_global_ens_gefs_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o209327267
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.209327267.dbqs01

8. jevs_global_ens_naefs_atmos_precip_stats.sh

Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o209327396
DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.209327396.dbqs01

@malloryprow
Copy link
Contributor

The full precip runs ran

jevs_global_ens_cmce_atmos_precip_stats.o209327102: resources_used.walltime = 00:08:13
jevs_global_ens_ecme_atmos_precip_stats.o209327117: resources_used.walltime = 00:13:05
jevs_global_ens_gefs_atmos_precip_stats.o209327267: resources_used.walltime = 00:11:06
jevs_global_ens_naefs_atmos_precip_stats.o209327396: resources_used.walltime = 00:08:15

@GwenChen-NOAA
Copy link
Contributor Author

The new test run looks very good to me. Yes, all precip jobs are under 15 min. I added the restart since precip is included in the evs_global_ens_atmos_grid2grid.sh script.

@malloryprow
Copy link
Contributor

malloryprow commented Dec 16, 2024

Ah, got it. Would you like be to test it then to make sure the changes work even though restart isn't technically needed?

@GwenChen-NOAA
Copy link
Contributor Author

Ah, got it. Would you like be to test it then to make sure the changes work even though restart isn't technically needed?

Sure, in case that it's needed in the future.

@malloryprow
Copy link
Contributor

COMOUT is /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/evs/v2.0. The interrupted runs each ran for 5 min.

1. jevs_global_ens_cmce_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o209329557
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.209329557.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_cmce_atmos_precip_stats.o209330424
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_cmce_atmos_precip_stats.209330424.dbqs01

2. jevs_global_ens_ecme_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209329612
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.209329612.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209330432
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_ecme_atmos_precip_stats.209330432.dbqs01

3. jevs_global_ens_gefs_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209329612
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.209329621.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_gefs_atmos_precip_stats.o209330478
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_gefs_atmos_precip_stats.209330478.dbqs01

4. jevs_global_ens_naefs_atmos_precip_stats.sh

Interrupted Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_ecme_atmos_precip_stats.o209329612
Interrupted DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.209329642.dbqs01
Restart Log File: /lfs/h2/emc/vpppg/noscrub/mallory.row/verification/EVS_PRs/pr604/EVS/dev/drivers/scripts/stats/global_ens/jevs_global_ens_naefs_atmos_precip_stats.o209330504
Restart DATA: /lfs/h2/emc/stmp/mallory.row/evs_test/prod/tmp/jevs_global_ens_naefs_atmos_precip_stats.209330504.dbqs01

@GwenChen-NOAA
Copy link
Contributor Author

The precip test run is successful. I think this PR is ready to be merged.

Copy link
Contributor

@malloryprow malloryprow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good and testing successful.

Thank you @GwenChen-NOAA!

Copy link
Contributor

@AliciaBentley-NOAA AliciaBentley-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malloryprow @GwenChen-NOAA Before I approve this PR to be merged, can we confirm that the ecme and naefs .ecf scripts that were changed in this PR match in their corresponding dev drivers? There are four .ecf scripts changed and only two dev drivers. Once this is confirmed, please go ahead and merge.

CC @malloryprow @GwenChen-NOAA

@malloryprow
Copy link
Contributor

@malloryprow malloryprow merged commit 0e2b25a into NOAA-EMC:develop Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants