Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GEOSgcm GNU Debug no longer runs after MAPL 2.48 #3073

Closed
mathomp4 opened this issue Oct 4, 2024 · 2 comments · Fixed by #3107
Closed

GEOSgcm GNU Debug no longer runs after MAPL 2.48 #3073

mathomp4 opened this issue Oct 4, 2024 · 2 comments · Fixed by #3107
Assignees
Labels
🪲 Bug Something isn't working ❗ High Priority This is a high priority PR

Comments

@mathomp4
Copy link
Member

mathomp4 commented Oct 4, 2024

Runs of GEOSgcm with GNU (13 or 14) are now failing with:

        EXTDATA: DEBUG: ExtData Run_: READ_LOOP: Done
[borgj101:222883] *** An error occurred in MPI_Wait
[borgj101:222883] *** reported by process [2176581633,0]
[borgj101:222883] *** on communicator MPI COMMUNICATOR 22 CREATE FROM 21
[borgj101:222883] *** MPI_ERR_TRUNCATE: message truncated
[borgj101:222883] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[borgj101:222883] ***    and potentially your MPI job)
>> Error << /discover/swdev/gmao_SIteam/MPI/openmpi/4.1.6-SLES15/gcc-13.2.0/bin/mpirun  -np 96 /discover/nobackup/mathomp4/Experiments/stock-2024Oct04-1day-c24-GNU-DEBUG-ExtDataDebug/scratch/GEOSgcm.x --logging_config logging.yaml: status = 15; at /gpfsm/dnb34/mathomp4/SystemTests/builds/AGCM_GNU/CURRENT/GEOSgcm/install-Debug/bin/esma_mpirun line 377.
GEOSgcm Run Status: -1

So after ExtData runs, we crash. Running with DDT showed it crashing here:

if (spec%regrid_method /= REGRID_METHOD_NEAREST_STOD) then
call ESMF_FieldRegrid(src_field, dst_field, &
& routeHandle=route_handle, &
& dynamicMask=this%dynamic_mask, &
& termorderflag=ESMF_TERMORDER_SRCSEQ, &
& zeroregion=ESMF_REGION_SELECT, &
& rc=status)
_VERIFY(status)

with an error code of ESMC_RC_NOT_IMPL (which seems to be the catchall of ESMF_FieldRegrid failure).

Now, tracking down in my nightly tests when GNU develop Debug runs failed I happed on the same exact time @bena-nasa converted ExtData to containers and things became non-zero-diff (see #3025), aka PR #3007

So, as a final test, I did a cherry-pick of commit d888902 (aka before #3007):

git checkout d888902b764e3c5d6cf7971eec2fe766ff3b1d2a -- gridcomps/ExtData2G/CMakeLists.txt
git checkout d888902b764e3c5d6cf7971eec2fe766ff3b1d2a -- gridcomps/ExtData2G/ExtDataGridCompNG.F90
git checkout d888902b764e3c5d6cf7971eec2fe766ff3b1d2a -- gridcomps/ExtData2G/ExtDataOldTypesCreator.F90

and that does run. So something between d888902...cae216f GNU does not like. :(

@mathomp4 mathomp4 added 🪲 Bug Something isn't working ❗ High Priority This is a high priority PR labels Oct 4, 2024
@mathomp4 mathomp4 pinned this issue Oct 4, 2024
@mathomp4
Copy link
Member Author

mathomp4 commented Oct 4, 2024

Note @tclune reminded me of ESMF Logging and I did try that but all we saw were the errors due to #1976 (which we should fix as well).

@darianboggs darianboggs unpinned this issue Oct 7, 2024
@darianboggs darianboggs pinned this issue Oct 7, 2024
@bena-nasa
Copy link
Collaborator

I think I figured it out, PR coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bug Something isn't working ❗ High Priority This is a high priority PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants