-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/mesoscale restart NAM/RAP stats #652
base: develop
Are you sure you want to change the base?
Feature/mesoscale restart NAM/RAP stats #652
Conversation
A few things:
|
I'd like @MarcelCaron-NOAA to comment on these items. He had already addressed 3 elsewhere, but it's worth repeating here, but I don't recall the reason for that, but there was a reason. I also thought that files were generated in the working directories and then copied to COMOUT, again the impression that Marcel gave me. Is this not the case? |
For 1., I see export |
@malloryprow I'm looking through one of my .o files, and I can only see instances of data being copied from the working directory DATA to the COMOUT restart directory. However, I do believe that you may be thinking about this?
So basically the completed_jobs.txt file is actually the file that's being directly written in COMOUT, but data itself is being generated in DATA and copied to COMOUT. Just this file is the one that is being written directly. Is this file the issue? |
OK, your comment confirms that it is the completed_jobs.txt file that's at issue. I'll need to figure out how this file can be written in DATA and then copied over to COMOUT at the end of the run. |
Okay about the restart; the code looks like it is doing that but it must not be! Is the code checking that And yup, the completed file can't be written directly to COMOUT just like all other files. |
As far as I know there aren't any checks for SENDCOM anywhere in the python scripts for restart. So there are two issues here to deal with:
But all data itself (obs files, stat files, etc) are being written in DATA and then copied over to COMOUT. Only the completed_jobs.txt file is being written directly in COMOUT. |
Yes, sounds good! Let me know if you need any help! |
I think I may need help but I'll contact you off this PR thread and not muddle this thread with coding assistance. |
The commented-out commands help with debugging. If only "generate/job2" completed successfully, we could skip writing it in a restart run (though that would require code changes). However, during development and debugging, it's useful to see on the card what was skipped and what the environment would have been, for example when trying to track down missing data or test-run the job on its own. This way, the successfully completed job card reflects what would have run, and, when submitted during restart, completes quickly. |
|
On second thought, |
@MarcelCaron-NOAA Just wanting to confirm, the link "here" for the fix would be the fix to ensure that the completed_jobs.txt file is written in DATA and then copied to COMOUT? |
@MarcelCaron-NOAA It should be noted that the block of code highlighted in your link to the fix already exists in my mesoscale_util.py file. See line 382 here: |
@PerryShafran-NOAA Yes, the fix to write in DATA then copy to COMOUT |
@malloryprow @MarcelCaron-NOAA I think the way to solve this would be to do the following, and I think it would be in mesoscale_stats_grid2obs_create_job_script.py is to do the following: In each case where we see something like:
Which is where the completed jobs file is currently written to the restart directory RESTART_DIR, we need to change RESTART_DIR to something else, maybe a COMPLETED_JOBS_DIR where we can define in the parm file and read in to this script. Then after this line, we can add a copy_data_to_restart line that starts like this:
The additional copy_data_to_restart would then copy this completed jobs file each time to the RESTART_DIR directory, so it continues to grow in COMOUT like it does now. It would have to be done every time a job is completed, as the file contains critical information as to when a job is completed which would be needed in a restart case. The question I have is I'm not to sure how to formulate the copy_data_to_restart command for the completed jobs file. If I could have some assistance there from either of you, that would be great. |
@PerryShafran-NOAA yes the mesoscale bit of code would be the relevant bit for the fix to mesoscale. I only linked cam because it's not currently being worked on and hasn't been modified from develop! |
I'm saying that your fix from the cam script already exists in the mesoscale script, if I'm looking at my own code correctly. |
To clarify, this block of code is what you cite as the fix to write the completed jobs file in DATA and then copy to restart:
This block of code is already in the mesoscale_util.py script. And as far as I can tell, the file is still being written directly to COMOUT even with this block of code in there. Mallory had identified the issue as stemming from the block of code in mesoscale_stats_grid2obs_create_job_script.py that writes directly to COMOUT, which is this:
This needs to be modified I think. |
The second block of code is calling the first block of code. Marcel is saying the the first block needs to be modified. |
OK, so the block of code, the def_mark_job_completed block, is what needs to be modified from what it is now? |
Yes, I believe so. |
Oh, OK, that makes this more clear. Though I'm not entirely sure how this is fixed, but it sounds like Marcel is developing the fix in cam. Can we start to run the review now, though? We could get the restart working and merged, and then potentially add the remaining issues to the Fixes and Additions list for a future PR. I'm also OK with taking the time to get these fixes in now so we get it done and don't have to worry about it later. |
I think I'd like to see this fix in this PR then we can say that restart for grid2obs for mesoscale is complete rather than waiting on another PR that addresses issues with the restart. |
OK, fair enough. @MarcelCaron-NOAA I think you said you're working on the fix for CAM? |
@PerryShafran-NOAA No I am not currently working on this fix, but it does need to be added to cam. I don't think it will be a complicated fix, but I'm not sure how quickly I'll be able to get to it. |
Understood. Perhaps I can work on this with @malloryprow to get a fix in this code. |
@malloryprow I have successfully addressed the issues from above:
I think we can now re-start testing for this PR. |
Note to developers: You must use this PR template!
Description of Changes
Relating to issue #533 , restart for NAM and RAP stats, which is related to Bugzilla 1547. This PR updates the restart so each file is copied to the restart directory as it is created.
Developer Questions and Checklist
No.
No.
There might be but that will not be tested in this PR and will be considered for a future PR.
${USER}
where necessary throughout the code.HOMEevs
are removed from the code.dev/drivers/scripts
ordev/modulefiles
have been made in the correspondingecf/scripts
andecf/defs/evs-nco.def
?Testing Instructions
export EVSINspcotlk=/lfs/h2/emc/vpppg/noscrub/emc.vpppg/evs/v2.0/prep/cam
qsub -v vhr=07
a) Run as normal to get a baseline for the final size of the stats file. Save this file to compare to the final file for restart.
b) Set the wallclock to 15 minutes to get an interrupted run. You know the run is interrupted when you see the wallclock exceeded message and some SIGTERM error messages, and there is no final stats file written.
c) Set the wallclock to the standard time (1 hr for NAM, 1 hr 30 min for RAP) and run again, without clearing the small stats directory. This will get you the final stats file to compare to the baseline from step 6a).