Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/mesoscale restart NAM/RAP stats #652

Open
wants to merge 20 commits into
base: develop
Choose a base branch
from

Conversation

PerryShafran-NOAA
Copy link
Contributor

Note to developers: You must use this PR template!

Description of Changes

Please include a summary of the changes and the related GitHub issue(s). Please also include relevant motivation and context.

Relating to issue #533 , restart for NAM and RAP stats, which is related to Bugzilla 1547. This PR updates the restart so each file is copied to the restart directory as it is created.

Developer Questions and Checklist

  • Is this a high priority PR? If so, why and is there a date it needs to be merged by?

No.

  • Do you have any planned upcoming annual leave/PTO?

No.

  • Are there any changes needed in the times when the jobs are supposed to run/kick-off?

There might be but that will not be tested in this PR and will be considered for a future PR.

  • The code changes follow NCO's EE2 Standards.
  • Developer's name is removed throughout the code and have used ${USER} where necessary throughout the code.
  • References the feature branch for HOMEevs are removed from the code.
  • J-Job environment variables, COMIN and COMOUT directories, and output follow what has been defined for EVS.
  • Jobs over 15 minutes in runtime have restart capability.
  • If applicable, changes in the dev/drivers/scripts or dev/modulefiles have been made in the corresponding ecf/scripts and ecf/defs/evs-nco.def?
  • Jobs contain the appropriate file checking and don't run METplus for any missing data.
  • Code is using METplus wrappers structure and not calling MET executables directly.
  • Log is free of any ERRORs or WARNINGs.

Testing Instructions

Please include testing instructions for the PR assignee. Include all relevant input datasets needed to run the tests.

  1. Clone the repository.
  2. Checkout the branch feature/mesoscale_restart.
  3. Link the fix directory.
  4. Go to the directory dev/drivers/scripts/stats/mesoscale
  5. In the files jevs_mesoscale_nam_stats.sh and jevs_mesoscale_rap_stats.sh, add in the line:
    export EVSINspcotlk=/lfs/h2/emc/vpppg/noscrub/emc.vpppg/evs/v2.0/prep/cam
  6. Run each of these scripts, using the options qsub -v vhr=07
    a) Run as normal to get a baseline for the final size of the stats file. Save this file to compare to the final file for restart.
    b) Set the wallclock to 15 minutes to get an interrupted run. You know the run is interrupted when you see the wallclock exceeded message and some SIGTERM error messages, and there is no final stats file written.
    c) Set the wallclock to the standard time (1 hr for NAM, 1 hr 30 min for RAP) and run again, without clearing the small stats directory. This will get you the final stats file to compare to the baseline from step 6a).

@malloryprow malloryprow self-assigned this Jan 16, 2025
@malloryprow malloryprow linked an issue Jan 16, 2025 that may be closed by this pull request
@malloryprow malloryprow added this to the EVS v2.0.0 milestone Jan 16, 2025
@malloryprow
Copy link
Contributor

A few things:

  1. I noticed this in Feature/href MPMD and other fixes  #647. The completed files for restart are being written directly to COMOUT. It should be written to DATA and then copied to COMOUT if SENDCOM=YES.

  2. Since the function copy_data_to_restart is copying to COMOUT. There should be a check for SENDCOM=YES.

  3. There are many lines in ush/mesoscale/mesoscale_stats_grid2obs_create_job_script.py that start with #python -c or #{metplus_launcher}. If this is being written to the job file as a commented out line, is it needed at all?

@PerryShafran-NOAA
Copy link
Contributor Author

PerryShafran-NOAA commented Jan 16, 2025

I'd like @MarcelCaron-NOAA to comment on these items. He had already addressed 3 elsewhere, but it's worth repeating here, but I don't recall the reason for that, but there was a reason.

I also thought that files were generated in the working directories and then copied to COMOUT, again the impression that Marcel gave me. Is this not the case?

@malloryprow
Copy link
Contributor

For 1., I see export RESTART_DIR="${COMOUTsmall}/restart/c${vhr}" in parm/evs_config/mesoscale/config.evs.prod.stats.mesoscale.atmos.grid2obs.[nam][rap]. Then in ush/mesoscale/mesoscale_stats_grid2obs_create_job_script.py one line 38 this get read in by RESTART_DIR = os.environ['RESTART_DIR']. Then on lines 413-418, for example, is mesoscale_util.mark_job_completed("+ f"\"{os.path.join(RESTART_DIR, COMPLETED_JOBS_FILE)}. The same sort of logic and be followed for 2. but look at copy_data_to_restart.

@PerryShafran-NOAA
Copy link
Contributor Author

@malloryprow I'm looking through one of my .o files, and I can only see instances of data being copied from the working directory DATA to the COMOUT restart directory.

However, I do believe that you may be thinking about this?

26 + python -c 'import mesoscale_util; mesoscale_util.mark_job_completed("/lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/stats/
mesoscale/atmos.20250113/nam/grid2obs/restart/c07/completed_jobs.txt", "job36", job_type="generate")'

So basically the completed_jobs.txt file is actually the file that's being directly written in COMOUT, but data itself is being generated in DATA and copied to COMOUT. Just this file is the one that is being written directly. Is this file the issue?

@PerryShafran-NOAA
Copy link
Contributor Author

OK, your comment confirms that it is the completed_jobs.txt file that's at issue. I'll need to figure out how this file can be written in DATA and then copied over to COMOUT at the end of the run.

@malloryprow
Copy link
Contributor

Okay about the restart; the code looks like it is doing that but it must not be! Is the code checking that SENDCOM=YES before copying to COMOUT?

And yup, the completed file can't be written directly to COMOUT just like all other files.

@PerryShafran-NOAA
Copy link
Contributor Author

As far as I know there aren't any checks for SENDCOM anywhere in the python scripts for restart.

So there are two issues here to deal with:

  1. The completed_jobs.txt file is being written directly to COMOUT and it shouldn't be.
  2. There are no checks for SENDCOM.

But all data itself (obs files, stat files, etc) are being written in DATA and then copied over to COMOUT. Only the completed_jobs.txt file is being written directly in COMOUT.

@malloryprow
Copy link
Contributor

Yes, sounds good! Let me know if you need any help!

@PerryShafran-NOAA
Copy link
Contributor Author

I think I may need help but I'll contact you off this PR thread and not muddle this thread with coding assistance.

@MarcelCaron-NOAA
Copy link
Contributor

I'd like @MarcelCaron-NOAA to comment on these items. He had already addressed 3 elsewhere, but it's worth repeating here, but I don't recall the reason for that, but there was a reason.

The commented-out commands help with debugging. If only "generate/job2" completed successfully, we could skip writing it in a restart run (though that would require code changes). However, during development and debugging, it's useful to see on the card what was skipped and what the environment would have been, for example when trying to track down missing data or test-run the job on its own. This way, the successfully completed job card reflects what would have run, and, when submitted during restart, completes quickly.

@MarcelCaron-NOAA
Copy link
Contributor

  • The completed_jobs.txt file is being written directly to COMOUT and it shouldn't be.
  • There are no checks for SENDCOM.
  • I haven't looked deeply but this seems to be the case. The fix would likely be here (for cam, same function in the equivalent file for mesoscale). To fix, we could maybe write SENDCOM to each job card, feed in the DATA and SENDCOM values to the above function, grab the filename from "completed_jobs_file" path, write the file to DATA/filename rather than to completed_jobs_file, then check SENDCOM and copy DATA/filename to completed_jobs_file.
  • Yup, cam also copies to comout without checking if SENDCOM=YES. The fix may be ~simple if we export "SENDCOM" in each job card, then feed that value into "copy_data_to_restart()", checking the value maybe before lines 718-720 in ush/cam/cam_util.py and the equivalent lines in ush/mesoscale/mesoscale_util.py.

@MarcelCaron-NOAA
Copy link
Contributor

MarcelCaron-NOAA commented Jan 16, 2025

  • The completed_jobs.txt file is being written directly to COMOUT and it shouldn't be.
  • There are no checks for SENDCOM.
  • I haven't looked deeply but this seems to be the case. The fix would likely be here (for cam, same function in the equivalent file for mesoscale). To fix, we could maybe write SENDCOM to each job card, feed in the DATA and SENDCOM values to the above function, grab the filename from "completed_jobs_file" path, write the file to DATA/filename rather than to completed_jobs_file, then check SENDCOM and copy DATA/filename to completed_jobs_file.
  • Yup, cam also copies to comout without checking if SENDCOM=YES. The fix may be ~simple if we export "SENDCOM" in each job card, then feed that value into "copy_data_to_restart()", checking the value maybe before lines 718-720 in ush/cam/cam_util.py and the equivalent lines in ush/mesoscale/mesoscale_util.py.

On second thought, SENDCOM can be checked in the job card itself ("if SENDCOM, then copy to restart")

@PerryShafran-NOAA
Copy link
Contributor Author

  • I haven't looked deeply but this seems to be the case. The fix would likely be here (for cam, same function in the equivalent file for mesoscale). To fix, we could maybe write SENDCOM to each job card, feed in the DATA and SENDCOM values to the above function, grab the filename from "completed_jobs_file" path, write the file to DATA/filename rather than to completed_jobs_file, then check SENDCOM and copy DATA/filename to completed_jobs_file.

@MarcelCaron-NOAA Just wanting to confirm, the link "here" for the fix would be the fix to ensure that the completed_jobs.txt file is written in DATA and then copied to COMOUT?

@PerryShafran-NOAA
Copy link
Contributor Author

@MarcelCaron-NOAA It should be noted that the block of code highlighted in your link to the fix already exists in my mesoscale_util.py file. See line 382 here:

https://github.com/PerryShafran-NOAA/EVS/blob/feature/mesoscale_restart/ush/mesoscale/mesoscale_util.py

@MarcelCaron-NOAA
Copy link
Contributor

MarcelCaron-NOAA commented Jan 16, 2025

  • I haven't looked deeply but this seems to be the case. The fix would likely be here (for cam, same function in the equivalent file for mesoscale). To fix, we could maybe write SENDCOM to each job card, feed in the DATA and SENDCOM values to the above function, grab the filename from "completed_jobs_file" path, write the file to DATA/filename rather than to completed_jobs_file, then check SENDCOM and copy DATA/filename to completed_jobs_file.

@MarcelCaron-NOAA Just wanting to confirm, the link "here" for the fix would be the fix to ensure that the completed_jobs.txt file is written in DATA and then copied to COMOUT?

@PerryShafran-NOAA Yes, the fix to write in DATA then copy to COMOUT

@PerryShafran-NOAA
Copy link
Contributor Author

@malloryprow @MarcelCaron-NOAA I think the way to solve this would be to do the following, and I think it would be in mesoscale_stats_grid2obs_create_job_script.py is to do the following:

In each case where we see something like:

                  job_cmd_list_iterative.append(
                       "python -c "
                       + f"'import mesoscale_util; mesoscale_util.mark_job_completed("
                       + f"\"{os.path.join(RESTART_DIR, COMPLETED_JOBS_FILE)}\", "
                       + f"\"job{njob}\", job_type=\"{job_type}\")'"

Which is where the completed jobs file is currently written to the restart directory RESTART_DIR, we need to change RESTART_DIR to something else, maybe a COMPLETED_JOBS_DIR where we can define in the parm file and read in to this script. Then after this line, we can add a copy_data_to_restart line that starts like this:

                   job_cmd_list_iterative.append(
                       f'python -c '
                       + '\"import mesoscale_util as cutil; cutil.copy_data_to_restart('
                       + '\\\"${DATA}\\\", \\\"${RESTART_DIR}\\\", '

The additional copy_data_to_restart would then copy this completed jobs file each time to the RESTART_DIR directory, so it continues to grow in COMOUT like it does now. It would have to be done every time a job is completed, as the file contains critical information as to when a job is completed which would be needed in a restart case.

The question I have is I'm not to sure how to formulate the copy_data_to_restart command for the completed jobs file. If I could have some assistance there from either of you, that would be great.

@MarcelCaron-NOAA
Copy link
Contributor

@MarcelCaron-NOAA It should be noted that the block of code highlighted in your link to the fix already exists in my mesoscale_util.py file. See line 382 here:

https://github.com/PerryShafran-NOAA/EVS/blob/feature/mesoscale_restart/ush/mesoscale/mesoscale_util.py

@PerryShafran-NOAA yes the mesoscale bit of code would be the relevant bit for the fix to mesoscale. I only linked cam because it's not currently being worked on and hasn't been modified from develop!

@PerryShafran-NOAA
Copy link
Contributor Author

@MarcelCaron-NOAA It should be noted that the block of code highlighted in your link to the fix already exists in my mesoscale_util.py file. See line 382 here:
https://github.com/PerryShafran-NOAA/EVS/blob/feature/mesoscale_restart/ush/mesoscale/mesoscale_util.py

@PerryShafran-NOAA yes the mesoscale bit of code would be the relevant bit for the fix to mesoscale. I only linked cam because it's not currently being worked on and hasn't been modified from develop!

I'm saying that your fix from the cam script already exists in the mesoscale script, if I'm looking at my own code correctly.

@PerryShafran-NOAA
Copy link
Contributor Author

To clarify, this block of code is what you cite as the fix to write the completed jobs file in DATA and then copy to restart:

def mark_job_completed(completed_jobs_file, job_name, job_type=""):
    with open(completed_jobs_file, 'a') as f:
        if job_type:
            f.write(job_type + "_" + job_name + "\n")
        else:
            f.write(job_name + "\n")

This block of code is already in the mesoscale_util.py script. And as far as I can tell, the file is still being written directly to COMOUT even with this block of code in there.

Mallory had identified the issue as stemming from the block of code in mesoscale_stats_grid2obs_create_job_script.py that writes directly to COMOUT, which is this:

        job_cmd_list.append(
            "python -c "
            + f"'import mesoscale_util; mesoscale_util.mark_job_completed("
            + f"\"{os.path.join(RESTART_DIR, COMPLETED_JOBS_FILE)}\", "
            + f"\"job{njob}\", job_type=\"{job_type}\")'"

This needs to be modified I think.

@malloryprow
Copy link
Contributor

The second block of code is calling the first block of code. Marcel is saying the the first block needs to be modified.

@PerryShafran-NOAA
Copy link
Contributor Author

OK, so the block of code, the def_mark_job_completed block, is what needs to be modified from what it is now?

@malloryprow
Copy link
Contributor

Yes, I believe so.

@PerryShafran-NOAA
Copy link
Contributor Author

Oh, OK, that makes this more clear. Though I'm not entirely sure how this is fixed, but it sounds like Marcel is developing the fix in cam.

Can we start to run the review now, though? We could get the restart working and merged, and then potentially add the remaining issues to the Fixes and Additions list for a future PR. I'm also OK with taking the time to get these fixes in now so we get it done and don't have to worry about it later.

@malloryprow
Copy link
Contributor

I think I'd like to see this fix in this PR then we can say that restart for grid2obs for mesoscale is complete rather than waiting on another PR that addresses issues with the restart.

@PerryShafran-NOAA
Copy link
Contributor Author

OK, fair enough.

@MarcelCaron-NOAA I think you said you're working on the fix for CAM?

@MarcelCaron-NOAA
Copy link
Contributor

@PerryShafran-NOAA No I am not currently working on this fix, but it does need to be added to cam. I don't think it will be a complicated fix, but I'm not sure how quickly I'll be able to get to it.

@PerryShafran-NOAA
Copy link
Contributor Author

Understood. Perhaps I can work on this with @malloryprow to get a fix in this code.

@PerryShafran-NOAA
Copy link
Contributor Author

@malloryprow I have successfully addressed the issues from above:

  1. The completed jobs file is now being written in DATA, the working directory, and is sent to COMOUT instead of being written directly to COMOUT.
  2. There are SENDCOM checks in place for both the data and also the completed jobs file.

I think we can now re-start testing for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mesoscale: Add restart for mesoscale grid2obs stats
3 participants