Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Investigate differences in xESMF and ESMF regridding that lead to "Model - Observation" plot diffs #938

Open
tomvothecoder opened this issue Feb 14, 2025 · 10 comments · May be fixed by #940
Assignees
Labels
bug Bug fix (will increment patch version)

Comments

@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Feb 14, 2025

What happened?

We found various "Model - Observation" plot differences in #931. I contributed it to differences in how xESMF and ESMF are implemented, which we should expect.

A few potential causes:

  1. Expected differences in xESMF and ESMF as mentioned in [Bug]: Debug zppy diffs for v3.0.0 #931
  2. Using conservative instead of conservative_normed in _align_grids_to_lower_res() results in abnormal data at the latitude edge for polar plot.
  • Found by @chengzhuzhang in comment below -- Fixed by Standardize regridding to conservative_normed #940.
    3. Misaligned bounds dimension names results in slightly different outputs (opened [Bug]: Misaligned bound vlaues can affect regridded variable results slightly with xESMF #945 to address this separately, if desired). This is a non-issue. We should expect different bounds values since grid sizes are different between the datasets.
    • With further debugging with this polar plot case, I found the test and reference datasets have different bounds dimension name (test has nbnd, ref has bnd). Since bounds dimension names don't align, xESMF seems to produce different results with regridding (maybe different interpretation of grid cells?). This contributes to larger floating point differences compared to the CDAT code.
    • However, if I drop those pre-existing bounds and do the regridding, xESMF is much closer to the CDAT code. I still noticed the abnormal data at the latitude edge of the plot though (due to 2. above).

What did you expect to happen? Are there are possible answers you came across?

Minimal Complete Verifiable Example (MVCE)

"""
This script demonstrates the difference between regridding with and without
pre-existing lat/lon bounds.

It uses two datasets, one with lat/lon bounds and one without. The script
finds that the regridded arrays are not exactly the same, even though the
input arrays are the same.

The name of the bounds dimensions is different in the two datasets, which
causes the regridding to be different.
"""
import xcdat as xc
import numpy as np

ds_a = xc.open_dataset("/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-14-branch-930-polar-after/polar/GPCP_v3.2/test.nc")
ds_b = xc.open_dataset("/lcrc/group/e3sm/public_html/cdat-migration-fy24/25-02-14-branch-930-polar-after/polar/GPCP_v3.2/ref.nc")

# 1. Check the bounds dimension names
print("Bounds in ds_a:", ds_a["lon_bnds"].dims, ds_a["lat_bnds"].dims)
print("Bounds in ds_b:", ds_b["lon_bnds"].dims, ds_b["lat_bnds"].dims)

"""
Bounds in ds_a: ('lon', 'nbnd') ('lat', 'nbnd')
Bounds in ds_b: ('lon', 'bnds') ('lat', 'bnds')

Notice how the bounds dimensions are different ('nbnd' vs. 'bnd'.). This
seems to affect results when regridding.
"""

# 2. Regrid with pre-existing lat/lon bounds. -- This one produces larger diffs compared to CDAT
output_grid = ds_a.regridder.grid
prect_a = ds_b.regridder.horizontal("PRECT", output_grid, tool="xesmf", method="conservative")["PRECT"]

# 3. Regrid without pre-existng bounds. -- This one is close to CDAT.
ds_a_no_bnds = ds_a.drop_vars(["lon_bnds", "lat_bnds"])
ds_b_no_bnds = ds_b.drop_vars(["lon_bnds", "lat_bnds"])
output_grid_no_bnds = ds_a_no_bnds.regridder.grid
prect_b = ds_b_no_bnds.regridder.horizontal("PRECT", output_grid_no_bnds, tool="xesmf", method="conservative")["PRECT"]

# 4. Print the stats -- they are "relatively" close but still different (which we shouldn't expect).
def print_stats(arr1, arr2, label1="Array 1", label2="Array 2"):
    stats = {
        "Min": (np.min(arr1), np.min(arr2)),
        "Max": (np.max(arr1), np.max(arr2)),
        "Mean": (np.mean(arr1), np.mean(arr2)),
        "Sum": (np.sum(arr1), np.sum(arr2)),
        "Std": (np.std(arr1), np.std(arr2)),
    }

    print(f"\n{'Stat':<10} {label1:<15} {label2:<15}")
    print("-" * 40)
    for stat, values in stats.items():
        print(f"{stat:<10} {values[0]:<15.6f} {values[1]:<15.6f}")

print_stats(prect_a.values, prect_b.values, "PRECT (with bounds)", "PRECT (no bounds)")

"""
Stat       PRECT (with bounds) PRECT (no bounds)
----------------------------------------
Min        0.165452        0.164364
Max        9.341431        10.046874
Mean       1.394370        1.398573
Sum        20078.931641    20139.449219
Std        1.016806        1.039437
"""

# 5. Compare arrays for floating point differences -- notice how they don't align
try:
    np.testing.assert_allclose(prect_a.values, prect_b.values, rtol=1e-5, atol=0)
except AssertionError as e:
    print("Arrays are not within rtol 1e-5")
    print(e)
else:
    print("Arrays are within rtol 1e-5")

"""
Arrays are not within rtol 1e-5

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 14392 / 14400 (99.9%)
Max absolute difference among violations: 3.6998303
Max relative difference among violations: 1.2256833
 ACTUAL: array([[1.123808, 1.435736, 1.278921, ..., 1.565927, 1.16721 , 1.126871],
       [2.288256, 2.331007, 2.508708, ..., 2.673229, 2.40844 , 2.271105],
       [1.978262, 2.111453, 1.774192, ..., 2.471611, 2.190455, 2.001195],...
 DESIRED: array([[2.356618, 2.652923, 2.578529, ..., 2.912792, 2.402574, 2.316819],
       [2.049907, 2.211725, 2.076397, ..., 2.646537, 2.309698, 2.113865],
       [1.979461, 2.010734, 1.869982, ..., 2.347431, 2.147887, 1.941646],...

"""

Relevant log output

2025-02-14 17:10:58,325 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
2025-02-14 17:10:58,325 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
2025-02-14 17:10:58,356 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
2025-02-14 17:10:58,356 [WARNING]: dataset.py(open_dataset:120) >> "No time coordinates were found in this dataset to decode. If time coordinates were expected to exist, make sure they are detectable by setting the CF 'axis' or 'standard_name' attribute (e.g., ds['time'].attrs['axis'] = 'T' or ds['time'].attrs['standard_name'] = 'time'). Afterwards, try decoding again with `xcdat.decode_time`."
Bounds in ds_a: ('lon', 'nbnd') ('lat', 'nbnd')
Bounds in ds_b: ('lon', 'bnds') ('lat', 'bnds')

Stat       PRECT (with bounds) PRECT (no bounds)
----------------------------------------
Min        0.165452        0.164364       
Max        9.341431        10.046874      
Mean       1.394370        1.398573       
Sum        20078.931641    20139.449219   
Std        1.016806        1.039437       
Arrays are not within rtol 1e-5

Not equal to tolerance rtol=1e-05, atol=0

Mismatched elements: 14392 / 14400 (99.9%)
Max absolute difference among violations: 3.6998303
Max relative difference among violations: 1.2256833
 ACTUAL: array([[1.123808, 1.435736, 1.278921, ..., 1.565927, 1.16721 , 1.126871],
       [2.288256, 2.331007, 2.508708, ..., 2.673229, 2.40844 , 2.271105],
       [1.978262, 2.111453, 1.774192, ..., 2.471611, 2.190455, 2.001195],...
 DESIRED: array([[2.356618, 2.652923, 2.578529, ..., 2.912792, 2.402574, 2.316819],
       [2.049907, 2.211725, 2.076397, ..., 2.646537, 2.309698, 2.113865],
       [1.979461, 2.010734, 1.869982, ..., 2.347431, 2.147887, 1.941646],...

Anything else we need to know?

No response

Environment

v3.0.0rc2 and latest main

@tomvothecoder tomvothecoder added the bug Bug fix (will increment patch version) label Feb 14, 2025
@tomvothecoder tomvothecoder self-assigned this Feb 14, 2025
@chengzhuzhang
Copy link
Contributor

chengzhuzhang commented Feb 17, 2025

@tomvothecoder thank you for keeping investigating this issue. Since the comment on the polar image diffs , shows the abnormal values over the boundary, it reminded me that, i had to use the conservative_normed regridding method instead of conservative when remapping masked datasets. In the polar set, with subset data, we should also use normalized method to taking account of fractional area over the boundary when remapping. I ran a command line test to use conservative_normed, and the seemingly extreme values over the edge lat no longer exist.

Image

The command line I use is as follows:

e3sm_diags polar --no_viewer --reference_data_path '/lcrc/group/e3sm/diagnostics/observations/Atm/climatology/' --test_data_path '/lcrc/group/e3sm/ac.forsyth2/zppy_weekly_comprehensive_v2_output/test_pr651_both_commits_20250117/v2.LR.historical_0201/post/atm/180x360_aave/clim/2yr' --results_dir '/lcrc/group/e3sm/public_html/zhang40/cdat-migration-fy24/25-02-11-branch-930-zppy-diffs-regrid-contour-conservative_norm' --case_id 'GPCP_v3.2' --run_type 'model_vs_obs' --sets 'polar' --variables 'PRECT' --seasons 'ANN' --regions 'polar_S' --regrid_tool 'esmf' --regrid_method 'conservative_normed' --multiprocessing --num_workers '8' --main_title 'PRECT ANN polar_S' --backend 'cartopy' --save_netcdf --output_format 'png' --canvas_size_w '1212' --canvas_size_h '1628' --figsize '8.5' '11.0' --dpi '150' --arrows  --test_name 'v2.LR.historical_0201' --short_test_name 'v2.LR.historical_0201' --test_colormap 'WhiteBlueGreenYellowRed.rgb' --ref_name 'GPCP_v3.2' --reference_name 'GPCP v3.2' --reference_colormap 'WhiteBlueGreenYellowRed.rgb' --diff_title 'Model - Observations' --diff_colormap 'BrBG' --diff_levels '-2' '-1.5' '-1' '-0.75' '-0.5' '-0.25' '0.25' '0.5' '0.75' '1' '1.5' '2' --granulate 'variables' 'seasons' 'plevs' 'regions' --selectors 'sets' 'seasons'

I would suggest we can use conservative_normed as the standard regridding method, however based on the performance benchmarked here, it will cause some performance degradation. I think we can test for a full run, if it helps significantly with addressing the diffs, we can go with it.

However, I needed to use the 3.0.0rc2 tag for running the command line, it appears to be a bug to support --no-viewer flag in 3.0.0rc3, which I will file an issue.

@tomvothecoder
Copy link
Collaborator Author

I would suggest we can use conservative_normed as the standard regridding method, however based on the performance benchmarked here, it will cause some performance degradation. I think we can test for a full run, if it helps significantly with addressing the diffs, we can go with it.

@chengzhuzhang Thank you for figuring this out! I will implement the fix, do a full run, then perform a regression test.

However, I needed to use the 3.0.0rc2 tag for running the command line, it appears to be a bug to support --no-viewer flag in 3.0.0rc3, which I will file an issue.

Thanks for filing the issue here. I'll try to address before E3SM Unified is released.

@chengzhuzhang
Copy link
Contributor

@tomvothecoder in the dataset we saw diffs, I noticed, most of those are using bilinear regridding method, specified in the cfg file.
It also reminded me that we had to use bilinear to workaround the cdms regriding problem (discussion). With the new back end, I think we should revert changes made here , that is to get back to use conservative_normed for all variables.

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder in the dataset we saw diffs, I noticed, most of those are using bilinear regridding method, specified in the cfg file. It also reminded me that we had to use bilinear to workaround the cdms regriding problem (discussion). With the new back end, I think we should revert changes made here , that is to get back to use conservative_normed for all variables.

Got it. I just pushed changes to #940 to remove "bilinear" for all applicable variables.

I am doing another test run to see if that improves the diffs for variables in #931.

@tomvothecoder
Copy link
Collaborator Author

tomvothecoder commented Feb 18, 2025

@tomvothecoder in the dataset we saw diffs, I noticed, most of those are using bilinear regridding method, specified in the cfg file. It also reminded me that we had to use bilinear to workaround the cdms regriding problem (discussion). With the new back end, I think we should revert changes made here , that is to get back to use conservative_normed for all variables.

Got it. I just pushed changes to #940 to remove "bilinear" for all applicable variables.

I am doing another test run to see if that improves the diffs for variables in #931.

Actually for these variables, I don't know if comparing diffs makes sense. Shouldn't xESMF with conservative_normed be more correct/expected than CDAT with bilinear?

Links to the runs: #940 (comment)

@chengzhuzhang
Copy link
Contributor

chengzhuzhang commented Feb 18, 2025

@tomvothecoder could you help me to update e3sm_diags/driver/init.py to use the file e3sm_ne30_ocean_land_mask.nc which you can find in /home/ac.zhang40/test_zppy/. And run the all conservative_normed (no bilinear) test again?

Update: Actually, we don't really need this change for now. because, the program in this case just use the land/ocean fraction included in the climatology dataset. I don't know how to explain why conservative_norm gives smaller land area.

@chengzhuzhang
Copy link
Contributor

chengzhuzhang commented Feb 18, 2025

@tomvothecoder I think the regrid_methods in _apply_land_sea_mask function needs to be updated as well:

def subset_and_align_datasets(
.

Maybe we just should go ahead and use conservative_normed as default, and test?

@tomvothecoder
Copy link
Collaborator Author

@tomvothecoder could you help me to update e3sm_diags/driver/init.py to use the file e3sm_ne30_ocean_land_mask.nc which you can find in /home/ac.zhang40/test_zppy/. And run the all conservative_normed (no bilinear) test again?

Update: Actually, we don't really need this change for now. because, the program in this case just use the land/ocean fraction included in the climatology dataset. I don't know how to explain why conservative_norm gives smaller land area.

I ended up running with the new default land-sea mask and don't see any obvious changes to the plot. This makes sense based on what you're saying about the land/ocean frac from the climatology dataset being used.

@tomvothecoder I think the regrid_methods in _apply_land_sea_mask function needs to be updated as well:

The default regridding method is conservative which already gets converted to conservative_normed in _apply_land_sea_mask() (code below):

def _apply_land_sea_mask(
ds: xr.Dataset,
ds_mask: xr.Dataset,
var_key: str,
region: Literal["land", "ocean"],
regrid_tool: str,
regrid_method: str,
) -> xr.Dataset:
"""Apply a land or sea mask based on the region ("land" or "ocean").
Parameters
----------
ds: xr.Dataset
The dataset containing the variable.
ds_mask : xr.Dataset
The dataset containing the land sea region mask variable(s).
var_key : str
The key the variable
region : Literal["land", "ocean"]
The region to mask.
regrid_tool : {"esmf", "xesmf", "regrid2"}
The regridding tool to use. Note, "esmf" is accepted for backwards
compatibility with e3sm_diags and is simply updated to "xesmf".
regrid_method : str
The regridding method to use. Refer to [1]_ for more information on
these options.
esmf/xesmf options:
- "bilinear"
- "conservative"
- "conservative_normed" -- equivalent to "conservative" in cdms2 ESMF
- "patch"
- "nearest_s2d"
- "nearest_d2s"
regrid2 options:
- "conservative"
Returns
-------
xr.Dataset
The Dataset with the land or sea mask applied to the variable.
"""
# TODO: Remove this conditional once "esmf" references are updated to
# "xesmf" throughout the codebase.
if regrid_tool == "esmf":
regrid_tool = "xesmf"
# TODO: Remove this conditional once "conservative" references are updated
# to "conservative_normed" throughout the codebase.
# NOTE: this is equivalent to "conservative" in cdms2 ESMF. If
# "conservative" is chosen, it is updated to "conservative_normed". This
# logic can be removed once the CoreParameter.regrid_method default
# value is updated to "conservative_normed" and all sets have been
# refactored to use this function.
if regrid_method == "conservative":
regrid_method = "conservative_normed"

@chengzhuzhang
Copy link
Contributor

@tomvothecoder thanks for being thorough, I should look more carefully...I'm trying to come up with other potential explanations.

@chengzhuzhang
Copy link
Contributor

The symptom of missing values bleeding into the regridded field is inline with what is described here:https://xesmf.readthedocs.io/en/latest/notebooks/Masking.html I don't remember if adding a mask variable is natively treated in xcdat. Also there is a comment in xesmf repo, that suggested to add skipna=True to the regridder which is said to be able to fill in more grids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants