Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go through UFS-CAT tutorial steps for running SCHISM #75

Closed
janahaddad opened this issue Apr 29, 2024 · 41 comments
Closed

Go through UFS-CAT tutorial steps for running SCHISM #75

janahaddad opened this issue Apr 29, 2024 · 41 comments
Assignees

Comments

@janahaddad
Copy link
Collaborator

janahaddad commented Apr 29, 2024

https://drive.google.com/drive/folders/1uQt3x8_O2dV7g9nD9y5bSuSRHoPOqsEb?hl=en UFS-CAT tutorial docs

@Armaghan-NOAA
Copy link

Armaghan-NOAA commented Apr 29, 2024

@uturuncoglu and/or @pvelissariou1 I do not have access permission to the folder below:

/work2/noaa/nems/tufuk/RT which I believe is needed for running rt according to:

(https://drive.google.com/drive/folders/1uQt3x8_O2dV7g9nD9y5bSuSRHoPOqsEb?hl=en)

Could you please open the permission? (I am not part of nems project though so maybe you can copy the needed folder (RT) to any other projects that I am a part of):
noaa-hpc nosofs nos-surge

for this specific example I am using this path:
/work2/noaa/nos-surge/aabed/rt

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA , Armaghan try to use /work/noaa/nems/tufuk/RT instead, see if that works for you. Edit the rt.sh script and change the line
from: DISKNM=/work/noaa/nosofs/${USER}/RT
to: DISKNM=/work/noaa/nems/tufuk/RT
that is, search in rt,sh for "orion" and "hercules" to find the DISKNM= lines for these two platforms.

@Armaghan-NOAA
Copy link

Armaghan-NOAA commented Apr 30, 2024

@pvelissariou1 @uturuncoglu I got this error:
(base) [armaghan@hercules-login-3 tests]$ ./rt.sh -l rt_coastal.conf -a nems -k -n coastal_ike_shinnecock_atm2sch intel
Regression Testing Script Started
hercules-login-3.hpc.msstate.edu
The -n option needs [testname] AND [compiler] in quotes, i.e. -n "control_p8 intel"
rt.sh finished
rt.sh: Cleaning up...
rt.sh: Exiting.

Please let me know how to solve this in hercules. Also I do not have any modules loaded yet following the recorded training. is it not needed?

@uturuncoglu
Copy link
Collaborator

@Armaghan-NOAA Please try with following, ./rt.sh -l rt_coastal.conf -a nems -k -n "coastal_ike_shinnecock_atm2sch intel". The way of running regression test slightly changed with the recent sync. Please document all issues and then we could update the app level documentation. We have also issues with -l along with -n. Let me sync the model again to get the fix from ufs-weather-model level.

@pvelissariou1
Copy link
Collaborator

@uturuncoglu She doesn't have a nema account: most likely: ./rt.sh -l rt_coastal.conf -a coast(or coastal) -k -n "coastal_ike_shinnecock_atm2sch intel"

@uturuncoglu
Copy link
Collaborator

@pvelissariou1 Yes. Correct. Sorry for confusion. I also sync the model again and push the recent changes soon. That will fix the issue with rt.sh

@pvelissariou1
Copy link
Collaborator

@uturuncoglu Thanks, after the sync I wiill check on my side as well.

@uturuncoglu
Copy link
Collaborator

@pvelissariou1 BTW, it seems there is also minor type issue on uf-weather-model develop - https://github.com/ufs-community/ufs-weather-model/blob/04bbc15f9abfb25a8864cdeaebdbd439c4332c95/tests/rt.sh#L809 but since we are modifying the folder, it is fine. I am trying to add it to my land PR in UFS level.

@Armaghan-NOAA
Copy link

Armaghan-NOAA commented Apr 30, 2024

@pvelissariou1 @uturuncoglu just to inform I also tried coastal and coast none worked:

(base) [armaghan@hercules-login-3 tests]$ ./rt.sh -l rt_coastal.conf -a coastal -k -n coastal_ike_shinnecock_atm2sch intel
Regression Testing Script Started
hercules-login-3.hpc.msstate.edu
The -n option needs [testname] AND [compiler] in quotes, i.e. -n "control_p8 intel"
rt.sh finished
rt.sh: Cleaning up...
rt.sh: Exiting.

(base) [armaghan@hercules-login-3 tests]$ ./rt.sh -l rt_coastal.conf -a coast -k -n coastal_ike_shinnecock_atm2sch intel
Regression Testing Script Started
hercules-login-3.hpc.msstate.edu
The -n option needs [testname] AND [compiler] in quotes, i.e. -n "control_p8 intel"
rt.sh finished
rt.sh: Cleaning up...
rt.sh: Exiting.

Please let me know whenever you think the system is ready for me to try again. Thank you.

@pvelissariou1 @uturuncoglu, I am a member of this projects using the following command:
(base) [armaghan@hercules-login-3 tests]$ groups
noaa-hpc nosofs nos-surge

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA You should have access to at leat one of nos-surge, coast accounts. To check what accounts you have access to, do:

  • module load contrib noaatools
  • run from the commandline: saccount_params
    The last command will show you the list of the accounts you are in.

@Armaghan-NOAA
Copy link

Armaghan-NOAA commented Apr 30, 2024

@Armaghan-NOAA You should have access to at leat one of nos-surge, coast accounts. To check what accounts you have access to, do:

  • module load contrib noaatools
  • run from the commandline: saccount_params
    The last command will show you the list of the accounts you are in.

Here is what I get:
(base) [armaghan@hercules-login-3 tests]$ module load contrib noaatools
(base) [armaghan@hercules-login-3 tests]$ saccount_params
Account Params -- Information regarding project associations for armaghan
Home Quota (/home/armaghan) Used: 642 MB Quota: 8192 MB Grace: 10240

    Project: noaatest
            Directory: /work/noaa/noaatest DiskInUse=0 GB, Quota=0 GB, Files=0, FileQUota=0
            Directory: /work2/noaa/noaatest DiskInUse=0 GB, Quota=0 GB, Files=0, FileQUota=0

    Project: nos-surge
            FairShare=0.053 (47/53)
            Partition Access: ALL
            Available QOSes: batch,debug,novel,ood,urgent,windfall

            Directory: /work2/noaa/nos-surge DiskInUse=78870 GB, Quota=95000 GB, Files=10256451, FileQUota=0

    Project: nosofs
            FairShare=0.630 (39/53)
            Partition Access: ALL
            Available QOSes: batch,debug,novel,ood,urgent,windfall

            Directory: /work/noaa/nosofs DiskInUse=45385 GB, Quota=47500 GB, Files=7021336, FileQUota=0
            Directory: /work2/noaa/nosofs DiskInUse=5199 GB, Quota=71250 GB, Files=30614, FileQUota=0

Note: for an explanation of the meaning of these values and general scheduling information see:
https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php/SLURM_Fair-share
Note: the parenthetical values after project fairshare indiciate the rank
of the project with respect to all other allocated projects. If the first
number is lower, your project is likely to have higher priority than
other projects. (Of course, other factors weigh in to scheduling.)

@pvelissariou1 so I see noaatest, nos-surge, and nosofs

@pvelissariou1
Copy link
Collaborator

Ok, try to use the nos-surge account

@uturuncoglu
Copy link
Collaborator

@Armaghan-NOAA BTW, I think you are still not putting test name along with compiler between quotas. Also, I did not push the sync (incl. fix related with rt.sh) yet. Still waiting to finish the tests on Orion.

@Armaghan-NOAA
Copy link

Ok, try to use the nos-surge account

got the same error:
(base) [armaghan@hercules-login-3 tests]$ ./rt.sh -l rt_coastal.conf -a nos-surge -k -n coastal_ike_shinnecock_atm2sch intel
Regression Testing Script Started
hercules-login-3.hpc.msstate.edu
The -n option needs [testname] AND [compiler] in quotes, i.e. -n "control_p8 intel"
rt.sh finished
rt.sh: Cleaning up...
rt.sh: Exiting.

Is the command I am using correct?

@pvelissariou1
Copy link
Collaborator

Try: ./rt.sh -l rt_coastal.conf -a nos-surge -k -n "coastal_ike_shinnecock_atm2sch intel"

@uturuncoglu
Copy link
Collaborator

@Armaghan-NOAA @pvelissariou1 Again. This will not work until I push the sync. It will run the first job in the file.

@uturuncoglu
Copy link
Collaborator

uturuncoglu commented Apr 30, 2024

@Armaghan-NOAA @pvelissariou1 Okay. I have just synced again. So, we have the fix in ufs-coastal too.

@Armaghan-NOAA
Copy link

@pvelissariou1 @uturuncoglu it seems this time it runs but I am getting two other errors (codes 127 & 843) shown below. Could you help?

(base) [armaghan@hercules-login-3 tests]$ ./rt.sh -l rt_coastal.conf -a nos-surge -k -n "coastal_ike_shinnecock_atm2sch intel"
Regression Testing Script Started
hercules-login-3.hpc.msstate.edu
Machine: hercules
Account: nos-surge
rt.sh: Setting up hercules...

  • [[ false == true ]]
  • module use /work/noaa/epic/role-epic/spack-stack/hercules/modulefiles
  • '[' -z '' ']'
  • case "$-" in
  • __lmod_sh_dbg=x
  • '[' -n x ']'
  • set +x
    Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output
    Shell debugging restarted
  • unset __lmod_sh_dbg
  • return 0
  • [[ false == true ]]
  • QUEUE=batch
  • COMPILE_QUEUE=batch
  • PARTITION=hercules
  • dprefix=/work2/noaa/stmp/armaghan
  • DISKNM=/work/noaa/epic/hercules/UFS-WM_RT
  • DISKNM = DISKNM=/work/noaa/nems/tufuk/RT
    ./rt.sh: line 843: DISKNM: command not found
    ++ handle_error 127 843
    ++ echo 'rt.sh: Getting error information...'
    rt.sh: Getting error information...
    ++ local exit_code=127
    ++ local exit_line=843
    ++ echo 'Exited at line 843 having code 127'
    Exited at line 843 having code 127
    ++ rt_trap
    ++ echo 'rt.sh: Exited abnormally, killing workflow and cleaning up'
    rt.sh: Exited abnormally, killing workflow and cleaning up
    ++ [[ false == true ]]
    ++ [[ false == true ]]
    ++ cleanup
    ++ echo 'rt.sh: Cleaning up...'
    rt.sh: Cleaning up...
    +++ awk '{print $2}'
    ++ awk_info=511157
    ++ [[ 511157 == \5\1\1\1\5\7 ]]
    ++ rm -rf /work2/noaa/nos-surge/aabed/rt/ufs-coastal/tests/lock
    ++ [[ false == true ]]
    ++ trap 0
    ++ echo 'rt.sh: Exiting.'
    rt.sh: Exiting.
    ++ Exit

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA what is this: DISKNM = DISKNM=/work/noaa/nems/tufuk/RT I see in your log?
just replace DISKNM=/work/noaa/epic/hercules/UFS-WM_RT by:
DISKNM=/work/noaa/nems/tufuk/RT
have you done that?
Do you want to set a meeting Thursday after 11:00 CTD?
Have cloned the updated ufs-coastal?

@Armaghan-NOAA
Copy link

@Armaghan-NOAA what is this: DISKNM = DISKNM=/work/noaa/nems/tufuk/RT I see in your log? just replace DISKNM=/work/noaa/epic/hercules/UFS-WM_RT by: DISKNM=/work/noaa/nems/tufuk/RT have you done that? Do you want to set a meeting Thursday after 11:00 CTD? Have cloned the updated ufs-coastal?

Thanks for pointing that out. I am now getting this error:
ERROR: STMP: /work2/noaa/stmp/armaghan/stmp -- DOES NOT EXIST
Could you guide? @pvelissariou1 @uturuncoglu

@uturuncoglu
Copy link
Collaborator

I think you set the DISKNM wrong. In the log it is something like
DISKNM = DISKNM=/work/noaa/nems/tufuk/RT
So, it needs to be DISKNM=/work/noaa/nems/tufuk/RT.

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA You better change it to: /work2/noaa/armaghan/stmp. In your rt.sh file go to hercules block and change:
dprefix=/work2/noaa/stmp/${USER}
to:
dprefix=/work2/noaa/${USER}/stmp

@uturuncoglu
Copy link
Collaborator

uturuncoglu commented Apr 30, 2024

@Armaghan-NOAA you could create that directory by your hand. Just issue mkdir -p /work2/noaa/stmp/armaghan/stmp

@Armaghan-NOAA
Copy link

Armaghan-NOAA commented Apr 30, 2024

@Armaghan-NOAA you could create that directory by your hand. Just issue mkdir -p /work2/noaa/stmp/armaghan/stmp

@uturuncoglu I do not have permission to make a folder there:

(base) [armaghan@hercules-login-3 stmp]$ pwd
/work2/noaa/stmp
(base) [armaghan@hercules-login-3 stmp]$ mkdir armaghan
mkdir: cannot create directory ‘armaghan’: Permission denied

@Armaghan-NOAA
Copy link

@Armaghan-NOAA You better change it to: /work2/noaa/armaghan/stmp. In your rt.sh file go to hercules block and change: dprefix=/work2/noaa/stmp/${USER} to: dprefix=/work2/noaa/${USER}/stmp

@pvelissariou1 got this error:
ERROR: STMP: /work2/noaa/armaghan/stmp/stmp -- DOES NOT EXIST

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA As the error says you need to create the parent directory manually.
Anyway, I checked hercules and orion to find out where your user directory in /work and /work2 is. You should have an armaghan (this is your user name) folder in /work, /work2 filesystems
hercules/orion
/work/noaa/nosofs : no_user_folder_found, need to create one
/work2/noaa/nos-surge : found aabed, need to rename it to: armaghan
/work2/noaa/nosofs : no_user_folder_found, need to create one
commands:
mkdir -p /work/noaa/nosofs/${USER}
mkdir -p /work2/noaa/nosofs/${USER}
mv /work2/noaa/nos-surge/aabed /work2/noaa/nos-surge/${USER}

Then adjust the dprefix variable for orion/hercules in your rt.sh accordingly, as:
dprefix=/work2/noaa/nos-surge/${USER} (the stmp folder will be created for you)
OR if you prefer:
dprefix=/work2/noaa/nosofs/${USER}
NOTE: /work2/noaa/nos-surge has about 15 TB disk space left and /work2/noaa/nosofs has about 65TB disk space left

@Armaghan-NOAA
Copy link

@Armaghan-NOAA As the error says you need to create the parent directory manually. Anyway, I checked hercules and orion to find out where your user directory in /work and /work2 is. You should have an armaghan (this is your user name) folder in /work, /work2 filesystems hercules/orion /work/noaa/nosofs : no_user_folder_found, need to create one /work2/noaa/nos-surge : found aabed, need to rename it to: armaghan /work2/noaa/nosofs : no_user_folder_found, need to create one commands: mkdir -p /work/noaa/nosofs/${USER} mkdir -p /work2/noaa/nosofs/${USER} mv /work2/noaa/nos-surge/aabed /work2/noaa/nos-surge/${USER}

Then adjust the dprefix variable for orion/hercules in your rt.sh accordingly, as: dprefix=/work2/noaa/nos-surge/${USER} (the stmp folder will be created for you) OR if you prefer: dprefix=/work2/noaa/nosofs/${USER} NOTE: /work2/noaa/nos-surge has about 15 TB disk space left and /work2/noaa/nosofs has about 65TB disk space left

@pvelissariou1 I am getting this error now:
ERROR: STMP: /work2/noaa/armaghan/stmp/stmp -- DOES NOT EXIST
why is it trying to access armaghan folder in noaa although I specify it is in the nos-surge?

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA You are setting the variables in your rt.sh incorectrly. All paths are like: /work2/noaa/nos-surge/armaghan or /work2/noaa/nosofs/armaghan
set the dprefix variable in your rt.sh like: dprefix=/work2/noaa/nos-surge/${USER}
If somehow still complains that it can not find the stmp directory than create it manually: mkdir -p /work2/noaa/nos-surge/${USER}/stmp

@Armaghan-NOAA
Copy link

Armaghan-NOAA commented May 1, 2024

@pvelissariou1 I could fix the issue and make the path needed.
Here is the new error:
find: ‘/work/noaa/nems/tufuk/RT/NEMSfv3gfs/develop-20240417/’: No such file or directory
rt.sh: Getting error information...

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA , @janahaddad If you want we can have a meeting tomorrow (after 11:00pm CTD) to go through this and other items in UFS-Coastal

@Armaghan-NOAA
Copy link

@pvelissariou1 that would be great if we can meet after 11 tomorrow.

@janahaddad
Copy link
Collaborator Author

@pvelissariou1 @Armaghan-NOAA may I suggest we review at Monday's UFS-Coastal tag-up, if Takis is able to join?

@pvelissariou1
Copy link
Collaborator

I will join Monday as well as tomorrow

@janahaddad
Copy link
Collaborator Author

Ok, thanks Takis

@janahaddad
Copy link
Collaborator Author

Per our meeting earlier, this now works for me on Hera using -c option to create new baseline

a couple notes:

  • When Takis walked me through this process a couple months ago, it threw a warning letting me know that it wasn't finding the baseline folder. Looks like that's gone, but would be nice to have that so the user knows a new baseline is needed.
  • it matters which order you put the option flags. It seems to only matter for the flags in front of the test name and compiler. Was it always like this? Either way, good to keep in mind. see below:
(base) [Jana.Haddad@hfe10 tests]$ ./rt.sh -l rt_coastal.conf -a coastal -k -n -c "coastal_ike_shinnecock_atm2sch intel"
******Regression Testing Script Started******
hfe10
The -n option needs [testname] AND [compiler] in quotes, i.e. -n "control_p8 intel"
rt.sh finished
rt.sh: Cleaning up...
rt.sh: Exiting.

correct flag order:

(base) [Jana.Haddad@hfe10 tests]$ ./rt.sh -l rt_coastal.conf -a coastal -c -k -n "coastal_ike_shinnecock_atm2sch intel"
******Regression Testing Script Started******
hfe10
Machine: hera
Account: coastal
rt.sh: Setting up hera...

@Armaghan-NOAA
Copy link

I also ran the UFS-CAT in Hercules. I have the outputs. Next steps would be merging outputs. @pvelissariou1 @janahaddad could you explain more what merging outputs do and how it helps?

@pvelissariou1
Copy link
Collaborator

@Armaghan-NOAA The program to use for merging the outputs is combine_output11. Make sure that you load the same modules you loaded when compiling ufs-coastal before running the above program. Example: module use YOUR_UFS_COASTAL_DIR/modulefiles and then module load ufs_hercules.intel.
Run combine_output11 -h to see the available options. Change directory where the outputs directory is loacated (not inside outputs); combine_outpt11 is searching for an outputs directory. Then run the program to generate the combined output files.

@pvelissariou1
Copy link
Collaborator

pvelissariou1 commented May 9, 2024 via email

@janahaddad
Copy link
Collaborator Author

@Armaghan-NOAA successfully ran this RT, closing with new ticket #94

@github-project-automation github-project-automation bot moved this from In Progress to Done in ufs-coastal project May 17, 2024
@Armaghan-NOAA
Copy link

@Armaghan-NOAA successfully ran this RT, closing with new ticket #94

The last outputs of running UFS-CAT is in this path: /work2/noaa/nosofs/armaghan/stmp/armaghan/FV3_RT/REGRESSION_TEST/coastal_ike_shinnecock_atm2sch_intel/outputs

@Armaghan-NOAA
Copy link

https://drive.google.com/drive/folders/1uQt3x8_O2dV7g9nD9y5bSuSRHoPOqsEb?hl=en UFS-CAT tutorial docs

another resource would be: #46

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants