Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Jacobian runs not completing with MERRA2 meteorology in v2.0 #281

Open
sabourbaray opened this issue Oct 9, 2024 · 5 comments
Open

Comments

@sabourbaray
Copy link
Contributor

Name: Sabour Baray
Institution: Environment and Climate Change Canada


Version: I am using a forked version of the IMI v2.0 localized for compatibility with the ECCC HPC system (operational-eccc/v2.0a) and GCClassic v.14.4.1.

Description: The IMI runs normally when running using GEOS-FP meteorology at 0.25° resolution. When running with MERRA2 meteorology at 0.5° resolution, the inversion stops at the Jacobian component. Jacobian 0000 completes successfully indicating no issue with the MERRA2 met files, but any of the subsequent simulations halt immediately after "beginning time stepping" (see below). There is no error message produced by GEOS-Chem, but I do get a stack smashing detected warning before a segmentation fault. This error also persists when setting NumJacobianTracers to 1.

R E S T A R T   F I L E   I N P U T

Min and Max of each species in restart file [mol/mol]:
Species   1,      CH4: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   2, CH4_0031: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   3, CH4_0032: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   4, CH4_0033: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   5, CH4_0034: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   6, CH4_0035: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   7, CH4_0036: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   8, CH4_0037: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species   9, CH4_0038: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species  10, CH4_0039: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
Species  11, CH4_0040: Min = 9.999999717E-10  Max = 9.999999717E-10  Sum = 8.958291728E-04
===============================================================================
Min and Max of each species in BC file [mol/mol]:
Species   1,      CH4: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   2, CH4_0031: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   3, CH4_0032: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   4, CH4_0033: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   5, CH4_0034: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   6, CH4_0035: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   7, CH4_0036: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   8, CH4_0037: Min = 9.999999717E-10  Max = 1.000000083E-09
Species   9, CH4_0038: Min = 9.999999717E-10  Max = 1.000000083E-09
Species  10, CH4_0039: Min = 9.999999717E-10  Max = 1.000000083E-09
Species  11, CH4_0040: Min = 9.999999717E-10  Max = 1.000000083E-09
 GET_BOUNDARY_CONDITIONS: Done reading BCs at 2018/05/01 00:00 using 
           0           1

********************************************
* B e g i n   T i m e   S t e p p i n g !! *
********************************************

---> DATE: 2018/05/01  UTC: 00:00
 HEMCO already called for this timestep. Returning.
NASA-GSFC Tracer Transport Module successfully initialized

Rollback and input files testing: Note that this issue begins with IMI versions after the update to Jacobian tracers. When rolling back to IMI v1.2 (operational-eccc/v1.2) and GCClassic v12.0.1, I am able to run with MERRA2 meteorology and 0.5° resolution for any time period between 2018–2023 normally.

Follow-up questions: Are any other users able to reproduce this issue on their machines? I would like to know if this may be due to an error on my end, or if there's an bug with the way the tracers are programmed for this grid resolution in the new Jacobian runs.

@msulprizio
Copy link
Collaborator

Hi @sabourbaray. I was able to run the IMI with MERRA-2 using the latest state of dev (commit 573f4a4). My configuration file is attached. I used the same setup as the default out-of-the-box Permian case, just switching met to MERRA2 and resolution to 0.5x0.625. I've also attached my resulting visualization notebook for reference.

Can you try to run the IMI with the settings in my configuration file and see if you get the same errors? I noticed you made several changes to your config file (domain, buffer cells, etc.), so verifying if you can at least get the out-of-the-box example run going will help us narrow down what the issue is.

If you're still having the issue, can you try turning on verbose in both geoschem_config.rc and HEMCO_Config.rc for your jacobian runs. It may be easiest to do this in the template_run files, then regenerate your jacobian_run directories. That should help provide additional printout so we can see where exactly the model is seg faulting.

@sabourbaray
Copy link
Contributor Author

Thanks @msulprizio. I'm running into the same bug but I have a lot of extra information to help narrow it down a bit. See below:


  1. I have tried again using the out-of-the-box config file (with small modifications for our HPC environment), and have also tried updating my code to dev (commit 573f4a4) with the same bug at the same point.

  2. Using the verbose output from geoschem_config.rc and HEMCO_Config.rc, I am noticing incorrect values in HEMCO when processing the Statevector.nc file and CH4_Emis_Prior. For example when HEMCO is Checking CH4_STATE_VECTOR:

Register_Base: Checking CH4_STATE_VECTOR
HEMCO: Entering Get_targetID (HCO_CONFIG_MOD.F90) ( 4)
HEMCO: Leaving Get_targetID (HCO_CONFIG_MOD.F90) ( 4)
 Container ID     :          243
 Assigned targetID:          243
HEMCO: Entering ReadList_Set (HCO_READLIST_MOD.F90) ( 4)
 New container set to ReadList:
Container CH4_STATE_VECTOR
    -->Data type       :            1
    -->Container ID    :          243
    -->Target ID       :          243
    -->File data home?           -999
    -->Source file     : /home/rsb001/geoschem/imi_output_dir/Test_IMI_Permian_merra2/StateVector.nc
    -->ncRead?            T
    -->Shared data file?  F
    -->Source parameter: StateVector
    -->Year range      :         2009        2009
    -->Month range     :            1           1
    -->Day range       :            1           1
    -->Hour range      :            0           0
    -->SpaceDim        :            2
    -->Array dimension :            0           0
    -->Array sum       :   0.0000000E+00
    -->Array min & max :   0.0000000E+00  0.0000000E+00
    -->Time dimension  :            0
    -->Delta t[h]      :            0
    -->Local time?        F
    -->OrigUnit        : 1
    -->Concentration?     F
    -->Coverage        :         -999
    -->Extension Nr    :         -999
    -->Species name    : *
    -->HEMCO species ID:            0
    -->Category        :            1
    -->Hierarchy       :            1
    -->2D emitted into :    1.00000000000000       and    1.00000000000000
HEMCO: Leaving ReadList_Set (HCO_READLIST_MOD.F90) ( 4)
 Base field registered: CH4_STATE_VECTOR

But when it is processing the container:

 Processing container: CH4_STATE_VECTOR
Parsing source file and replacing tokens
Opening file: /home/rsb001/geoschem/imi_output_dir/Test_IMI_Permian_merra2/StateVector.nc
HEMCO: Opening /home/rsb001/geoschem/imi_output_dir/Test_IMI_Permian_merra2/StateVector.nc
HEMCO: Entering GET_TIMEIDX (HCOIO_UTIL_MOD.F90) ( 6)
Number of time slices found:            0
         preferred datetime:  200901010000.
             selected tidx1:              0
       assigned delta t [h]:              0
local time?  F
HEMCO: Leaving GET_TIMEIDX (HCOIO_UTIL_MOD.F90) ( 6)
Reading variable StateVector
HEMCO WARNING: Data is treated as unitless, but file attribute suggests it is not: none. File: /home/rsb001/geoschem/imi_output_dir/Test_IMI_Permian_merra2/StateVector.nc
--> LOCATION: HCOIO_READ (HCOIO_READ_STD_MOD.F90)
Based on srcUnit attribute (1), no unit conversion is performed.
  ==> Use map_a2a regridding
HEMCO: Leaving HCOIO_READ (HCOIO_READ_STD_MOD.F90) ( 5)
HEMCO: Leaving HCOIO_DATAREAD (HCOIO_DATAREAD_MOD.F90) ( 4)
HEMCO: Entering tIDx_Assign (HCO_TIDX_MOD.F90) ( 4)
HEMCO: Leaving tIDx_Assign (HCO_TIDX_MOD.F90) ( 4)
HEMCO: Entering EmisList_Pass (HCO_EMISLIST_MOD.F90) ( 4)
HEMCO: Entering EmisList_Add (HCO_EMISLIST.F90) ( 5)
HEMCO: Entering Add2EmisList (HCO_EMISLIST_MOD.F90) ( 6)
HEMCO: Leaving Add2EmisList (HCO_EMISLIST_MOD.F90) ( 6)
Container added to EmisList:
Container CH4_STATE_VECTOR
   -->Data type       :            1
   -->Container ID    :          243
   -->Target ID       :          243
   -->File data home?              1
   -->Source file     : /home/rsb001/geoschem/imi_output_dir/Test_IMI_Permian_merra2/StateVector.nc
   -->ncRead?            T
   -->Shared data file?  F
   -->Source parameter: StateVector
   -->Year range      :         2009        2009
   -->Month range     :            1           1
   -->Day range       :            1           1
   -->Hour range      :            0           0
   -->SpaceDim        :            2
   -->Array dimension :          160         120
   -->Array sum       :  -1.8627670E+35
   -->Array min & max :  -9.9999998E+30   28.00000
   -->Time dimension  :            1
   -->Delta t[h]      :            0
   -->Local time?        F
   -->Tempres         : Constant
   -->OrigUnit        : 1
   -->Concentration?     F
   -->Coverage        :         -999
   -->Extension Nr    :         -999
   -->Species name    : *
   -->HEMCO species ID:            0
   -->Category        :            1
   -->Hierarchy       :            1
   -->2D emitted into :    1.00000000000000       and    1.00000000000000

Similarly, I am seeing incorrect values when HEMCO is processing CH4_Emis_Prior:

 Processing container: CH4_Emis_Prior
 Parsing source file and replacing tokens
 Opening file: ../../hemco_prior_emis/OutputDir/HEMCO_sa_diagnostics.201805010000.nc
HEMCO: Opening ../../hemco_prior_emis/OutputDir/HEMCO_sa_diagnostics.201805010000.nc
HEMCO: Entering GET_TIMEIDX (HCOIO_UTIL_MOD.F90) ( 6)
 Number of time slices found:            1
 Time slice range :    201805010000.000        201805010000.000
          preferred datetime:  201805010000.
              selected tidx1:              1
    corresponding datetime 1:  201805010000.
        assigned delta t [h]:              0
 local time?  F
HEMCO: Leaving GET_TIMEIDX (HCOIO_UTIL_MOD.F90) ( 6)
 Reading variable EmisCH4_Total_ExclSoilAbs
 Unit conversion settings:
 - Year, month        :         2018           5
 Data was in units of kg/m2/s - unit conversion factor is    1.00000000000000
   ==> Use map_a2a regridding
HEMCO: Leaving HCOIO_READ (HCOIO_READ_STD_MOD.F90) ( 5)
HEMCO: Leaving HCOIO_DATAREAD (HCOIO_DATAREAD_MOD.F90) ( 4)
HEMCO: Entering tIDx_Assign (HCO_TIDX_MOD.F90) ( 4)
HEMCO: Leaving tIDx_Assign (HCO_TIDX_MOD.F90) ( 4)
HEMCO: Entering EmisList_Pass (HCO_EMISLIST_MOD.F90) ( 4)
HEMCO: Entering EmisList_Add (HCO_EMISLIST.F90) ( 5)
HEMCO: Entering Add2EmisList (HCO_EMISLIST_MOD.F90) ( 6)
HEMCO: Leaving Add2EmisList (HCO_EMISLIST_MOD.F90) ( 6)
Container added to EmisList:
Container CH4_Emis_Prior
    -->Data type       :            1
    -->Container ID    :            1
    -->Target ID       :            1
    -->File data home?              1
    -->Source file     : ../../hemco_prior_emis/OutputDir/HEMCO_sa_diagnostics.$YYYY$MM$DD0000.nc
    -->ncRead?            T
    -->Shared data file?  F
    -->Source parameter: EmisCH4_Total_ExclSoilAbs
    -->Year range      :         1900        2050
    -->Month range     :            1          12
    -->Day range       :            1          31
    -->Hour range      :            0           0
    -->SpaceDim        :            2
    -->Array dimension :          160         120
    -->Array sum       :  -1.8626670E+35
    -->Array min & max :  -9.9999998E+30  1.1463717E-09
    -->Time dimension  :            1
    -->Delta t[h]      :            0
    -->Local time?        F
    -->Tempres         : Constant
    -->OrigUnit        : kg/m2/s
    -->Concentration?     F
    -->Coverage        :         -999
    -->Extension Nr    :            0
    -->Species name    : CH4
    -->HEMCO species ID:            1
    -->Category        :            1
    -->Hierarchy       :          500
    -->2D emitted into :    1.00000000000000       and    1.00000000000000
HEMCO: Leaving EmisList_Add (HCO_EMISLIST.F90) ( 5)
HEMCO: Leaving EmisList_Pass (HCO_EMISLIST_MOD.F90) ( 4)
HEMCO: Entering tIDx_Assign (HCO_TIDX_MOD.F90) ( 4)
HEMCO: Leaving tIDx_Assign (HCO_TIDX_MOD.F90) ( 4)
HEMCO: Entering EmisList_Pass (HCO_EMISLIST_MOD.F90) ( 4)
HEMCO: Entering EmisList_Add (HCO_EMISLIST.F90) ( 5)
HEMCO: Entering Add2EmisList (HCO_EMISLIST_MOD.F90) ( 6)
HEMCO: Leaving Add2EmisList (HCO_EMISLIST_MOD.F90) ( 6)

Investigating both of these .nc files I don't see any obvious indication of spurious values. Although I did notice Statevector.nc had one NaN value in this run directory. But when I swapped out for a different state vector file produced by an earlier 1.2 version of the IMI without the NaN value, the same result occurred. I also don't notice anything unusual about the Perturbations_000x.txt files.

  1. I was suspicious this might be related to the conda environment (maybe a bug in my version of the Python libraries were causing an issue in the state vector generation) so I replicated the most recent imi_env.yml environment file but also got the same result.

  2. I tried running gcclassic with -DCMAKE_BUILD_TYPE=Debug and can give the location of the arithmetic error.

Use country-specific values for SCALE_ELEM_0001
- Source file: Perturbations_0001.txt

Thread 1 "gcclassic" received signal SIGFPE, Arithmetic exception.
0x00000000010e39dc in hcoio_util_mod::hcoio_readcountryvalues (hcostate=0x15554a25f240, 
    lct=0x1553c984e320, rc=0)
    at /home/rsb001/geoschem/imi_output_dir/Test_Permian_1week_merra2_2.0a/jacobian_runs/Test_Permian_1week_merra2_2.0a_0001/CodeDir/src/HEMCO/src/Core/hcoio_util_mod.F90:2292
2292	          CIDS = NINT(CNTR)

So in summary the issue appears to be related to how HEMCO is processing the StateVector.nc and HEMCO_sa_diagnostics.$YYYY$MM$DD0000.nc files at this grid resolution, which results in unusual -9.99E+30 fill values causing the model to crash when it begins time stepping.

It's strange this doesn't happen on your machines at Harvard or on our machines when running at 0.25° resolution. I am hoping there is something simple on my end that I've overlooked. Open to any suggestions for what to try next.

@msulprizio
Copy link
Collaborator

Thanks @sabourbaray. Can you confirm that you're using the latest HEMCO version in your runs? I assume when you did the out-of-the-box run you cloned the correct version of GCClassic pegged to that version of the IMI and did git submodule update --init --recursive to automatically check out the correct versions of GEOS-Chem and HEMCO. I just want to confirm to ensure you have the latest fixes in HEMCO. There was bug in earlier HEMCO versions where lat/lon bounds were offset for nested grids.

@sabourbaray
Copy link
Contributor Author

Hi @msulprizio. The version of GCClassic is 14.4.1 and HEMCO is 3.9.1.

@msulprizio
Copy link
Collaborator

OK. The fix I was referring to is geoschem/HEMCO#229 and went into HEMCO 3.7.1 so that should not impact you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants