[Bug]: Investigate differences in xESMF and ESMF regridding that lead to "Model - Observation" plot diffs #938

tomvothecoder · 2025-02-14T23:13:13Z

What happened?

We found various "Model - Observation" plot differences in #931. I contributed it to differences in how xESMF and ESMF are implemented, which we should expect.

A few potential causes:

Expected differences in xESMF and ESMF as mentioned in [Bug]: Debug zppy diffs for v3.0.0 #931
Using conservative instead of conservative_normed in _align_grids_to_lower_res() results in abnormal data at the latitude edge for polar plot.

Found by @chengzhuzhang in comment below -- Fixed by Standardize regridding to conservative_normed #940.
3. Misaligned bounds dimension names results in slightly different outputs (opened [Bug]: Misaligned bound vlaues can affect regridded variable results slightly with xESMF #945 to address this separately, if desired). This is a non-issue. We should expect different bounds values since grid sizes are different between the datasets.
- With further debugging with this polar plot case, I found the test and reference datasets have different bounds dimension name (test has nbnd, ref has bnd). Since bounds dimension names don't align, xESMF seems to produce different results with regridding (maybe different interpretation of grid cells?). This contributes to larger floating point differences compared to the CDAT code.
- ~~However, if I drop those pre-existing bounds and do the regridding, xESMF is much closer to the CDAT code. I still noticed the abnormal data at the latitude edge of the plot though (due to 2. above).~~

What did you expect to happen? Are there are possible answers you came across?

Minimal Complete Verifiable Example (MVCE)

"""
This script demonstrates the difference between regridding with and without
pre-existing lat/lon bounds.

It uses two datasets, one with lat/lon bounds and one without. The script
finds that the regridded arrays are not exactly the same, even though the
input arrays are the same.

The name of the bounds dimensions is different in the two datasets, which
causes the regridding to be different.
"""
import xcdat as xc
import numpy as np

ds_a = xc.open_dataset("/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-14-branch-930-polar-after/polar/GPCP_v3.2/test.nc")
ds_b = xc.open_dataset("/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-14-branch-930-polar-after/polar/GPCP_v3.2/ref.nc")

# 1. Check the bounds dimension names
print("Bounds in ds_a:", ds_a["lon_bnds"].dims, ds_a["lat_bnds"].dims)
print("Bounds in ds_b:", ds_b["lon_bnds"].dims, ds_b["lat_bnds"].dims)

"""
Bounds in ds_a: ('lon', 'nbnd') ('lat', 'nbnd')
Bounds in ds_b: ('lon', 'bnds') ('lat', 'bnds')

Notice how the bounds dimensions are different ('nbnd' vs. 'bnd'.). This
seems to affect results when regridding.
"""

# 2. Regrid with pre-existing lat/lon bounds. -- This one produces larger diffs compared to CDAT
output_grid = ds_a.regridder.grid
prect_a = ds_b.regridder.horizontal("PRECT", output_grid, tool="xesmf", method="conservative")["PRECT"]

# 3. Regrid without pre-existng bounds. -- This one is close to CDAT.
ds_a_no_bnds = ds_a.drop_vars(["lon_bnds", "lat_bnds"])
ds_b_no_bnds = ds_b.drop_vars(["lon_bnds", "lat_bnds"])
output_grid_no_bnds = ds_a_no_bnds.regridder.grid
prect_b = ds_b_no_bnds.regridder.horizontal("PRECT", output_grid_no_bnds, tool="xesmf", method="conservative")["PRECT"]

# 4. Print the stats -- they are "relatively" close but still different (which we shouldn't expect).
def print_stats(arr1, arr2, label1="Array 1", label2="Array 2"):
    stats = {
        "Min": (np.min(arr1), np.min(arr2)),
        "Max": (np.max(arr1), np.max(arr2)),
        "Mean": (np.mean(arr1), np.mean(arr2)),
        "Sum": (np.sum(arr1), np.sum(arr2)),
        "Std": (np.std(arr1), np.std(arr2)),
    }

    print(f"\n{'Stat':<10} {label1:<15} {label2:<15}")
    print("-" * 40)
    for stat, values in stats.items():
        print(f"{stat:<10} {values[0]:<15.6f} {values[1]:<15.6f}")

print_stats(prect_a.values, prect_b.values, "PRECT (with bounds)", "PRECT (no bounds)")

"""
Stat       PRECT (with bounds) PRECT (no bounds)
----------------------------------------
Min        0.165452        0.164364
Max        9.341431        10.046874
Mean       1.394370        1.398573
Sum        20078.931641    20139.449219
Std        1.016806        1.039437
"""

# 5. Compare arrays for floating point differences -- notice how they don't align
try:
    np.testing.assert_allclose(prect_a.values, prect_b.values, rtol=1e-5, atol=0)
except AssertionError as e:
    print("Arrays are not within rtol 1e-5")
    print(e)
else:
    print("Arrays are within rtol 1e-5")

"""
Arrays are not within rtol 1e-5

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 14392 / 14400 (99.9%)
Max absolute difference among violations: 3.6998303
Max relative difference among violations: 1.2256833
 ACTUAL: array([[1.123808, 1.435736, 1.278921, ..., 1.565927, 1.16721 , 1.126871],
       [2.288256, 2.331007, 2.508708, ..., 2.673229, 2.40844 , 2.271105],
       [1.978262, 2.111453, 1.774192, ..., 2.471611, 2.190455, 2.001195],...
 DESIRED: array([[2.356618, 2.652923, 2.578529, ..., 2.912792, 2.402574, 2.316819],
       [2.049907, 2.211725, 2.076397, ..., 2.646537, 2.309698, 2.113865],
       [1.979461, 2.010734, 1.869982, ..., 2.347431, 2.147887, 1.941646],...

"""

Relevant log output

2025-02-14 17:10:58,325 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
2025-02-14 17:10:58,325 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
2025-02-14 17:10:58,356 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
2025-02-14 17:10:58,356 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
Bounds in ds_a: ('lon', 'nbnd') ('lat', 'nbnd')
Bounds in ds_b: ('lon', 'bnds') ('lat', 'bnds')

Stat       PRECT (with bounds) PRECT (no bounds)
----------------------------------------
Min        0.165452        0.164364       
Max        9.341431        10.046874      
Mean       1.394370        1.398573       
Sum        20078.931641    20139.449219   
Std        1.016806        1.039437       
Arrays are not within rtol 1e-5

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 14392 / 14400 (99.9%)
Max absolute difference among violations: 3.6998303
Max relative difference among violations: 1.2256833
 ACTUAL: array([[1.123808, 1.435736, 1.278921, ..., 1.565927, 1.16721 , 1.126871],
       [2.288256, 2.331007, 2.508708, ..., 2.673229, 2.40844 , 2.271105],
       [1.978262, 2.111453, 1.774192, ..., 2.471611, 2.190455, 2.001195],...
 DESIRED: array([[2.356618, 2.652923, 2.578529, ..., 2.912792, 2.402574, 2.316819],
       [2.049907, 2.211725, 2.076397, ..., 2.646537, 2.309698, 2.113865],
       [1.979461, 2.010734, 1.869982, ..., 2.347431, 2.147887, 1.941646],...

Anything else we need to know?

No response

Environment

v3.0.0rc2 and latest main

The text was updated successfully, but these errors were encountered:

chengzhuzhang · 2025-02-17T21:11:50Z

@tomvothecoder thank you for keeping investigating this issue. Since the comment on the polar image diffs , shows the abnormal values over the boundary, it reminded me that, i had to use the conservative_normed regridding method instead of conservative when remapping masked datasets. In the polar set, with subset data, we should also use normalized method to taking account of fractional area over the boundary when remapping. I ran a command line test to use conservative_normed, and the seemingly extreme values over the edge lat no longer exist.

The command line I use is as follows:

e3sm_diags polar --no_viewer --reference_data_path '/lcrc/group/e3sm/diagnostics/observations/Atm/climatology/' --test_data_path '/lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test_pr651_both_commits_20250117/v2.LR.historical_0201/post/atm/180x360_aave/clim/2yr' --results_dir '/lcrc/group/e3sm/public_html/zhang40/cdat-migration-fy24/25-02-11-branch-930-zppy-diffs-regrid-contour-conservative_norm' --case_id 'GPCP_v3.2' --run_type 'model_vs_obs' --sets 'polar' --variables 'PRECT' --seasons 'ANN' --regions 'polar_S' --regrid_tool 'esmf' --regrid_method 'conservative_normed' --multiprocessing --num_workers '8' --main_title 'PRECT ANN polar_S' --backend 'cartopy' --save_netcdf --output_format 'png' --canvas_size_w '1212' --canvas_size_h '1628' --figsize '8.5' '11.0' --dpi '150' --arrows  --test_name 'v2.LR.historical_0201' --short_test_name 'v2.LR.historical_0201' --test_colormap 'WhiteBlueGreenYellowRed.rgb' --ref_name 'GPCP_v3.2' --reference_name 'GPCP v3.2' --reference_colormap 'WhiteBlueGreenYellowRed.rgb' --diff_title 'Model - Observations' --diff_colormap 'BrBG' --diff_levels '-2' '-1.5' '-1' '-0.75' '-0.5' '-0.25' '0.25' '0.5' '0.75' '1' '1.5' '2' --granulate 'variables' 'seasons' 'plevs' 'regions' --selectors 'sets' 'seasons'

I would suggest we can use conservative_normed as the standard regridding method, however based on the performance benchmarked here, it will cause some performance degradation. I think we can test for a full run, if it helps significantly with addressing the diffs, we can go with it.

However, I needed to use the 3.0.0rc2 tag for running the command line, it appears to be a bug to support --no-viewer flag in 3.0.0rc3, which I will file an issue.

tomvothecoder · 2025-02-18T17:52:08Z

I would suggest we can use conservative_normed as the standard regridding method, however based on the performance benchmarked here, it will cause some performance degradation. I think we can test for a full run, if it helps significantly with addressing the diffs, we can go with it.

@chengzhuzhang Thank you for figuring this out! I will implement the fix, do a full run, then perform a regression test.

However, I needed to use the 3.0.0rc2 tag for running the command line, it appears to be a bug to support --no-viewer flag in 3.0.0rc3, which I will file an issue.

Thanks for filing the issue here. I'll try to address before E3SM Unified is released.

chengzhuzhang · 2025-02-18T21:12:39Z

@tomvothecoder in the dataset we saw diffs, I noticed, most of those are using bilinear regridding method, specified in the cfg file.
It also reminded me that we had to use bilinear to workaround the cdms regriding problem (discussion). With the new back end, I think we should revert changes made here , that is to get back to use conservative_normed for all variables.

tomvothecoder · 2025-02-18T21:31:52Z

@tomvothecoder in the dataset we saw diffs, I noticed, most of those are using bilinear regridding method, specified in the cfg file. It also reminded me that we had to use bilinear to workaround the cdms regriding problem (discussion). With the new back end, I think we should revert changes made here , that is to get back to use conservative_normed for all variables.

Got it. I just pushed changes to #940 to remove "bilinear" for all applicable variables.

I am doing another test run to see if that improves the diffs for variables in #931.

tomvothecoder · 2025-02-18T21:41:19Z

@tomvothecoder in the dataset we saw diffs, I noticed, most of those are using bilinear regridding method, specified in the cfg file. It also reminded me that we had to use bilinear to workaround the cdms regriding problem (discussion). With the new back end, I think we should revert changes made here , that is to get back to use conservative_normed for all variables.

Got it. I just pushed changes to #940 to remove "bilinear" for all applicable variables.

I am doing another test run to see if that improves the diffs for variables in #931.

Actually for these variables, I don't know if comparing diffs makes sense. Shouldn't xESMF with conservative_normed be more correct/expected than CDAT with bilinear?

Links to the runs: #940 (comment)

chengzhuzhang · 2025-02-18T22:38:56Z

@tomvothecoder could you help me to update e3sm_diags/driver/init.py to use the file e3sm_ne30_ocean_land_mask.nc which you can find in /home/ac.zhang40/test_zppy/. And run the all conservative_normed (no bilinear) test again?

Update: Actually, we don't really need this change for now. because, the program in this case just use the land/ocean fraction included in the climatology dataset. I don't know how to explain why conservative_norm gives smaller land area.

chengzhuzhang · 2025-02-18T23:11:03Z

@tomvothecoder I think the regrid_methods in _apply_land_sea_mask function needs to be updated as well:

e3sm_diags/e3sm_diags/driver/utils/regrid.py

Line 34 in ee134b8

def subset_and_align_datasets(

.

Maybe we just should go ahead and use conservative_normed as default, and test?

tomvothecoder · 2025-02-18T23:34:36Z

@tomvothecoder could you help me to update e3sm_diags/driver/init.py to use the file e3sm_ne30_ocean_land_mask.nc which you can find in /home/ac.zhang40/test_zppy/. And run the all conservative_normed (no bilinear) test again?

Update: Actually, we don't really need this change for now. because, the program in this case just use the land/ocean fraction included in the climatology dataset. I don't know how to explain why conservative_norm gives smaller land area.

I ended up running with the new default land-sea mask and don't see any obvious changes to the plot. This makes sense based on what you're saying about the land/ocean frac from the climatology dataset being used.

@tomvothecoder I think the regrid_methods in _apply_land_sea_mask function needs to be updated as well:

The default regridding method is conservative which already gets converted to conservative_normed in _apply_land_sea_mask() (code below):

e3sm_diags/e3sm_diags/driver/utils/regrid.py

Lines 177 to 233 in ee134b8

    
           def _apply_land_sea_mask( 
        
               ds: xr.Dataset, 
        
               ds_mask: xr.Dataset, 
        
               var_key: str, 
        
               region: Literal["land", "ocean"], 
        
               regrid_tool: str, 
        
               regrid_method: str, 
        
           ) -> xr.Dataset: 
        
               """Apply a land or sea mask based on the region ("land" or "ocean"). 
        
               Parameters 
        
               ---------- 
        
               ds: xr.Dataset 
        
                   The dataset containing the variable. 
        
               ds_mask : xr.Dataset 
        
                   The dataset containing the land sea region mask variable(s). 
        
               var_key : str 
        
                   The key the variable 
        
               region : Literal["land", "ocean"] 
        
                   The region to mask. 
        
               regrid_tool : {"esmf", "xesmf", "regrid2"} 
        
                   The regridding tool to use. Note, "esmf" is accepted for backwards 
        
                   compatibility with e3sm_diags and is simply updated to "xesmf". 
        
               regrid_method : str 
        
                   The regridding method to use. Refer to [1]_ for more information on 
        
                   these options. 
        
                   esmf/xesmf options: 
        
                     - "bilinear" 
        
                     - "conservative" 
        
                     - "conservative_normed" -- equivalent to "conservative" in cdms2 ESMF 
        
                     - "patch" 
        
                     - "nearest_s2d" 
        
                     - "nearest_d2s" 
        
                   regrid2 options: 
        
                     - "conservative" 
        
               Returns 
        
               ------- 
        
               xr.Dataset 
        
                   The Dataset with the land or sea mask applied to the variable. 
        
               """ 
        
               # TODO: Remove this conditional once "esmf" references are updated to 
        
               # "xesmf" throughout the codebase. 
        
               if regrid_tool == "esmf": 
        
                   regrid_tool = "xesmf" 
        
               # TODO: Remove this conditional once "conservative" references are updated 
        
               # to "conservative_normed" throughout the codebase. 
        
               # NOTE: this is equivalent to "conservative" in cdms2 ESMF. If 
        
               # "conservative" is chosen, it is updated to "conservative_normed". This 
        
               # logic can be removed once the CoreParameter.regrid_method default 
        
               # value is updated to "conservative_normed" and all sets have been 
        
               # refactored to use this function. 
        
               if regrid_method == "conservative": 
        
                   regrid_method = "conservative_normed"

chengzhuzhang · 2025-02-19T00:14:56Z

@tomvothecoder thanks for being thorough, I should look more carefully...I'm trying to come up with other potential explanations.

chengzhuzhang · 2025-02-19T02:54:55Z

The symptom of missing values bleeding into the regridded field is inline with what is described here:https://xesmf.readthedocs.io/en/latest/notebooks/Masking.html I don't remember if adding a mask variable is natively treated in xcdat. Also there is a comment in xesmf repo, that suggested to add skipna=True to the regridder which is said to be able to fill in more grids.

tomvothecoder added the bug Bug fix (will increment patch version) label Feb 14, 2025

tomvothecoder added this to the FY25 Q2 (01/01/25 - 03/31/25) milestone Feb 14, 2025

tomvothecoder self-assigned this Feb 14, 2025

This was referenced Feb 14, 2025

[Bug]: Debug zppy diffs for v3.0.0 #931

Merged

Standardize regridding to conservative_normed #940

Open

tomvothecoder mentioned this issue Feb 19, 2025

[Bug]: Misaligned bound vlaues can affect regridded variable results slightly with xESMF #945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Investigate differences in xESMF and ESMF regridding that lead to "Model - Observation" plot diffs #938

[Bug]: Investigate differences in xESMF and ESMF regridding that lead to "Model - Observation" plot diffs #938

tomvothecoder commented Feb 14, 2025 •

edited

Loading

chengzhuzhang commented Feb 17, 2025 •

edited

Loading

tomvothecoder commented Feb 18, 2025

chengzhuzhang commented Feb 18, 2025

tomvothecoder commented Feb 18, 2025

tomvothecoder commented Feb 18, 2025 •

edited

Loading

chengzhuzhang commented Feb 18, 2025 •

edited

Loading

chengzhuzhang commented Feb 18, 2025 •

edited

Loading

tomvothecoder commented Feb 18, 2025

chengzhuzhang commented Feb 19, 2025

chengzhuzhang commented Feb 19, 2025

[Bug]: Investigate differences in xESMF and ESMF regridding that lead to "Model - Observation" plot diffs #938

[Bug]: Investigate differences in xESMF and ESMF regridding that lead to "Model - Observation" plot diffs #938

Comments

tomvothecoder commented Feb 14, 2025 • edited Loading

What happened?

What did you expect to happen? Are there are possible answers you came across?

Minimal Complete Verifiable Example (MVCE)

Relevant log output

Anything else we need to know?

Environment

chengzhuzhang commented Feb 17, 2025 • edited Loading

tomvothecoder commented Feb 18, 2025

chengzhuzhang commented Feb 18, 2025

tomvothecoder commented Feb 18, 2025

tomvothecoder commented Feb 18, 2025 • edited Loading

chengzhuzhang commented Feb 18, 2025 • edited Loading

chengzhuzhang commented Feb 18, 2025 • edited Loading

tomvothecoder commented Feb 18, 2025

chengzhuzhang commented Feb 19, 2025

chengzhuzhang commented Feb 19, 2025

tomvothecoder commented Feb 14, 2025 •

edited

Loading

chengzhuzhang commented Feb 17, 2025 •

edited

Loading

tomvothecoder commented Feb 18, 2025 •

edited

Loading

chengzhuzhang commented Feb 18, 2025 •

edited

Loading

chengzhuzhang commented Feb 18, 2025 •

edited

Loading