-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support obstore as storage for df.to_parquet() #164
Comments
A PR for this would be great! cc @martindurant who implemented it in #63 |
Does obstore support multi-part uploads, or would whole files need to be buffered in memory before a call to put()? |
It uses multipart uploads automatically (there's some mention of this in https://developmentseed.org/obstore/latest/api/put/) (unless You can pass a sync or async iterator to |
That's not much use from the file API, I'm afraid - you need to be able to write, send some data and return; so the iterator would be the wrong way around. |
Hi @kylebarron , I tried it out using the iterator. However, just like what @martindurant metioned, the whole file will be bufferd in the memory, which is not ideal. Probably the Rust part needs to be updated to support this. |
At the moment, obstore doesn't implement a file API at all, I guess this is in the name. As well as buffering in memory, you could also use fsspec's |
obstore implements a file API for reading but it's not yet implemented for writing. See https://developmentseed.org/obstore/v0.3.0/api/file/ I'm open to having a similar API for writing that wraps https://docs.rs/object_store/latest/object_store/buffered/struct.BufWriter.html |
I'm not sure I fully understand this use case, but if the iterator is the wrong way around, then maybe instead of a "push-based" iterator, you could pass in a "pull-based" file-like object? That's already supported. |
Hi, Please let me know if I understand anything wrong. Thanks! |
Ah that makes sense (sorry, going through my notifications while pretty tired). I think the easiest way to solve this is to implement the writable file API that https://docs.rs/object_store/latest/object_store/buffered/struct.BufWriter.html implements. (It should be pretty similar to the existing readable file API) |
#167 is a stub that adds writable file support. We basically just need constructors to create the |
Also, once we get this working, we should add an example to the docs that it works with pandas (I assume your |
Currently, I'm facing a small problem. What s3fs, gcsfs did is to split the path into bucket name and path for all functions that used it. Should we do something like that? And where should we add split_path, in fsspec.py or Rust? |
Can we add that as a helper function in the |
The filesystems here are not yet registered with fsspec, so you can only use them explicitly. The use them with fsspec.open , you would yes, need to split off the bucket part to pass to obsstore, but also map the URL protocols to implementations. |
I've added the |
Where is the PR? I can have a look. It depends on whether we want to allow creating, say, an "obs-s3" instance which doesn't yet have an obstore instance with bucket configured, but creates them on demand. I assume your _open does something like this, so that if there were an fsspec.open() with a list of various buckets, you would end up making a bunch of obstore instances all at once. |
We support that for most URL protocols now: https://developmentseed.org/obstore/v0.4.0-beta.1/api/store/#obstore.store.from_url
It's the same one: #165
That could make sense. We could store a cache of |
The way I use the fsspec with obstore is to create the store instance first and pass the store instance to store = S3Store.from_env(
"my-s3-bucket",
endpoint="http://localhost:30002",
access_key_id="minio",
secret_access_key="miniostorage",
virtual_hosted_style_request=False, # not include bucket name in the domain name
client_options={"timeout": "99999s", "allow_http": "true"},
)
fsspec.register_implementation("s3", AsyncFsspecStore)
obstore_fs = fsspec.filesystem("s3", store=store)
path = "s3://my-s3-bucket/uploaded/file.txt"
with obstore_fs.open(path, "wb") as f:
f.write(b"something") I think creating an instance in AsyncFsspecStore is better, will have a look on how to implement this and maybe open new PR? Adding it into #165 may make this PR too large. |
Hi @kylebarron @martindurant , I just created another PR here: #198, which use Thanks! |
Currently, using obstore's
AsyncFsspecStore
fordf.to_parquet()
does not work (no error but also no files uploaded). It seems like we only need to addwrite()
method in theBufferedFileSimple
class and use theobstore.put()
there.obstore/obstore/python/obstore/fsspec.py
Lines 177 to 186 in b40d59b
I would like ask if there is any existing progress on it? If not, I can help working on this.
The text was updated successfully, but these errors were encountered: