Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand checks in compare_ts_and_hist notebooks #33

Open
mnlevy1981 opened this issue Oct 9, 2020 · 1 comment
Open

Expand checks in compare_ts_and_hist notebooks #33

mnlevy1981 opened this issue Oct 9, 2020 · 1 comment

Comments

@mnlevy1981
Copy link
Contributor

These notebooks are designed to verify that the time series files we generate are bit-for-bit identical with the history files produced by the model. Right now, the notebooks rely on diag_metadata.yaml to determine which variables are compared, which means

  1. Only a subset of variables from pop.h are checked
  2. For the 3D fields listed in the YAML file, we only check a subset of the vertical levels
  3. The other streams (pop.h.nday1, pop.h.nyear1, cice.h, cice.h1) are not checked at all

Perhaps a smart parallelization technique would make it feasible to check all variables across all streams?

@mnlevy1981
Copy link
Contributor Author

As of d604c92 in #29 I am no longer running da.identical() to compare data, but I am verifying that time series files for every variable in the CESM history files exist. This is done for all five streams: pop.h, pop.h.nday1, pop.h.nyear1, cice.h, and cice.h1.

I tried running

history_filenames = case.get_history_files(year, stream)
# open_mfdataset_kwargs: data_vars="minimal", compat="override", coords="minimal", parallel=True
ds_hist = xr.open_mfdataset(history_filenames, **open_mfdataset_kwargs)
# vars_to_check = [var for var in ds_hist.data_vars if "time" in ds_hist[var].coords and var != "time_bound"]
vars_to_check = ["TEMP"]
for var in vars_to_check:
    timeseries_filenames = case.get_timeseries_files(year, stream, var)
    ds_ts = xr.open_mfdataset(timeseries_filenames, **open_mfdataset_kwargs)
#   limiting comparison to single level works fine
#    da_hist = ds_hist[var].isel(z_t=0)
#    da_ts = ds_ts[var].isel(z_t=0)
#   comparing full 3D field blows memory, even with dask (cluster.scale(12))
    da_hist = ds_hist[var]
    da_ts = ds_ts[var]
    if da_hist.identical(da_ts):
        print(f"{var} is the same in history and time series")
    else:
        print(f"{var} is DIFFERENT in history and time series")

and, as the inline comments indicate, was blowing memory even with cluster.scale(12) while comparing a single level was fine in serial or parallel. In fact, I saw modest performance gains from running in parallel:

with isel(z_t=0)
----
Parallel, cluster.scale(n=8):
CPU times: user 4.28 s, sys: 92.3 ms, total: 4.38 s
Wall time: 16.4 s

Serial:
CPU times: user 19.7 s, sys: 3.17 s, total: 22.9 s
Wall time: 25.1 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant