Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for pickling #125

Open
norlandrhagen opened this issue Jan 8, 2025 · 6 comments
Open

Support for pickling #125

norlandrhagen opened this issue Jan 8, 2025 · 6 comments

Comments

@norlandrhagen
Copy link
Contributor

norlandrhagen commented Jan 8, 2025

Hi there,

@jbusecke and I were playing around with using obstore to read netcdf files into Xarray as part of apache-beam data ingestion pipeline. This requires the serialization of objects that are passed between stages. We ran into an error when beam was trying to pickle a dataset. We reproduced the behavior without beam.

Anyhow, wondering if it's possible to pickle obstore-backed Xarray datasets or if this is a limitation that we don't understand.

Create the example dataset

import xarray as xr 
ds = xr.tutorial.open_dataset('air_temperature')
ds.to_netcdf('air.nc')

This works with fsspec

import fsspec 
import xarray as xr 
import cloudpickle 

fs = fsspec.filesystem('local')
ds_fsspec = xr.open_dataset(fs.open('air.nc'), engine='h5netcdf', chunks={})

with open('fsspec.pkl', 'wb') as f:
    cloudpickle.dump(ds_fsspec, f)

This fails with obstore's fsspec wrapper

from obstore.fsspec import AsyncFsspecStore
from obstore.store import LocalStore 
from pathlib import Path
import xarray as xr 

store = LocalStore(prefix=Path("."))
fss = AsyncFsspecStore(store)
ds = xr.open_dataset(fss.open('air.nc'), engine='h5netcdf', chunks={})


import cloudpickle 
with open('obstore.pkl', 'wb') as f:
    cloudpickle.dump(ds, f)

TypeError: cannot pickle 'builtins.LocalStore' object

traceback:

{
	"name": "TypeError",
	"message": "cannot pickle 'builtins.LocalStore' object",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 14
     12 import cloudpickle 
     13 with open('obstore.pkl', 'wb') as f:
---> 14     cloudpickle.dump(ds, f)

File ~/miniforge3/envs/obstore/lib/python3.12/site-packages/cloudpickle/cloudpickle.py:1511, in dump(obj, file, protocol, buffer_callback)
   1498 def dump(obj, file, protocol=None, buffer_callback=None):
   1499     \"\"\"Serialize obj as bytes streamed into file
   1500 
   1501     protocol defaults to cloudpickle.DEFAULT_PROTOCOL which is an alias to
   (...)
   1509     next).
   1510     \"\"\"
-> 1511     Pickler(file, protocol=protocol, buffer_callback=buffer_callback).dump(obj)

File ~/miniforge3/envs/obstore/lib/python3.12/site-packages/cloudpickle/cloudpickle.py:1295, in Pickler.dump(self, obj)
   1293 def dump(self, obj):
   1294     try:
-> 1295         return super().dump(obj)
   1296     except RuntimeError as e:
   1297         if len(e.args) > 0 and \"recursion\" in e.args[0]:

TypeError: cannot pickle 'builtins.LocalStore' object"
}

We get the same error when trying to pickle the LocalStore(prefix=Path(".")) object.

Thanks!

  • Raphael & Julius

cc @sharkinsspatial

@kylebarron
Copy link
Member

No, pickling isn't currently supported.

It looks like in general pickling is possible to implement via the __getstate__ and __setstate__ dunders (https://github.com/pola-rs/polars/pull/1119/files#diff-dba26b8ff6a62c615f608293aa15bccf907a10e949f69082ab5f5678ceb6582dR1145-R1157).

However ObjectStore doesn't provide any serialization methods of its instances because it would leak a lot of internal implementation details.

It is possible to take a builder and access its string-compatible key-value config. However not all confg parameters can be represented as strings. E.g. the RetryConfig isn't accessible from the builder or possible to set via a string.

So I'd like to say that pickling is out of scope for this library. I don't see any easy way to implement pickling aside from re-implementing all the configuration that object-store already has internally.

@kylebarron kylebarron changed the title Usage question around pickling Support for pickling Jan 8, 2025
@kylebarron
Copy link
Member

I think the most likely approach to support pickling would be to have a separate class like S3StoreConfig that manages a HashMap<AmazonS3ConfigKey, String>, and that could be serializable. But again it wouldn't be able to accurately persist all configuration values, just the ones representable by strings. And even then, we wouldn't be able to persist client options, retry options, or encryption options.

I'm not sure if that half-baked solution would be good enough or more trouble than it's worth

@jbusecke
Copy link

jbusecke commented Jan 9, 2025

Thanks for the detailed explanation @kylebarron! Ill have to digest this for a bit and am wondering what would be feasible directions for pangeo-forge given this constraint...

Generally I would think this can be closed as an issue if you agree?

@kylebarron
Copy link
Member

We can leave it open; a few people have asked about it already.

@maxrjones
Copy link
Member

Some form of pickling will also be required for obstore-backed Zarr storage (as noted in the description of the failing tests zarr-developers/zarr-python#1661 (comment)). Joe Hamman mentioned having a vision for a MVP for serializing Zarr.Storage.ObjectStore at the Pangeo hackday. @jhamman, do you mind sharing if you have any insight into whether pickling could be implemented in zarr-developers/zarr-python#1661 rather than obstore itself? I think that would at least allow NetCDF -> VirtualiZarr -> Obstore + Zarr -> Dask/Beam even if NetCDF -> Obstore + h5netcdf -> Dask/Beam isn't practical.

@jhamman
Copy link

jhamman commented Jan 9, 2025

It shouldn't be a problem to do this in Zarr. we just need to package up the config needed to recreate the ObjectStore class in __getstate__ and __setstate__. I'll leave some comments over in the Zarr PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants