Replies: 8 comments 1 reply
-
Some answers to the first part
Indeed, the interface has gone through various changes, and most of what you describe is a failure of the docs to keep up. We no longer attempt to support async use in a non-async context or vice-versa; we no longer support setting the loop to anything other than the current one (in a coroutine). These changes came in piecemeal and implemented differently in different backends.
|
Beta Was this translation helpful? Give feedback.
-
True. These file-like classes are supposed to look like There should be an AsyncGCSFile for streamed reading, which would look very much the same as the http version, since both use aiohttp.
What does this workflow look like? Do you know a priori the ranges you will need, or is it more random within chunks of the file? It the latter, it's a pretty hard problem. |
Beta Was this translation helpful? Give feedback.
-
dask does this routinely. The async backend stuff happens in it's own dedicated IO thread in this case, and other threads wait on it. So hogging the GIL is not a problem during IO-dominated work, but doing IO while another thread does CPU-bound work could be. |
Beta Was this translation helpful? Give feedback.
-
Have you seen BackgroundBlockCache ( |
Beta Was this translation helpful? Give feedback.
-
In this context, this means an individual file has a concrete state (current position) but write() operations on it are async. Since the calls on remote are already async (init upload, upload chunk, finalise), this wouldn't be too hard to do to have multiple files concurrently writing. However a truly streaming API is different. aiohttp doesn't have push async, only pull (i.e., you supply an async generator), so there can be no btw: put_file could be made into concurrent requests (since we know the input size and local seeking is cheap), I think the API allows it, but when dealing with blocksizes big enough, we suppose the latency becomes unimportant. |
Beta Was this translation helpful? Give feedback.
-
I'm not convinces that this is necessary
Correct. |
Beta Was this translation helpful? Give feedback.
-
(I'm happy to talk about this live to make a work plan) |
Beta Was this translation helpful? Give feedback.
-
Hey @martindurant, sorry for the delayed response! Let's chat. What's the best way to arrange this? I can contact via LinkedIn. |
Beta Was this translation helpful? Give feedback.
-
Hi @martindurant 👋🏼
Thanks for all your hard work on the
fsspec
ecosystem. I have been using it a lot and dug into a lot of the source code. I have some advanced use cases, and it makes life so much better!I am working on a library to read/write and parse structured binary data (SEG-Y; i.e. seismology data) from object stores using
fsspec
andgcsfs
. I want to have full async compatibility in my API (to be used in async web backends) and started looking into async implementation withgcsfs
. During development, I encountered limitations and UX issues with the Async APIs, caching, and semi-random file operations. My library is not open source yet, but soon to be. This involves various access patterns:I can give you some reproducible code anytime; there are open-source datasets on an HTTP file system. However, it may not apply apple to apples to
gcsfs
.Problem Definition
I will break this down into sections.
GCSFileSystem
async behavior is ambiguous and not well-documentedThe documentation states that we set the
asynchronous
flag toTrue
orFalse
if we are going to use theGCSFileSystem
in an async context. However, the async functions with the
_*
prefix seem to work even when theasynchronous
flag is set toFalse
. Both of the code snippets below work. Also, theGCSFileSystem
doesn't have an awaitable_connect*
method, as stated in the source code when an exception is being raised. Also, it appears when we setasynchronous
toTrue
, sets the_loop
attribute of the file system to None in the__init__
ofAsyncFileSystem
. What loop are we using now, etc?In summary, how the
GCSFileSystem
creates/re-uses theloop
when async flag is on/off is unclear. This could be more on the topic of documentation enhancement.GCSFile
doesn't supportasync
It is stated in the documentation that opening a file asynchronously should be done after fetching the file to a local store. This is fair for smaller files, but I am working with 2TB files on the object store, which is not feasible. This is where the
AsyncStreamedFile
with caching implementation becomes useful.In my library, I need to have cached and async byte range fetching features of
gcsfs.
Unfortunately:GCSFile
objects don't support concurrent byte-range requests or reads.GCSFileSystem
's_cat_ranges
in this context to fetch concurrently, but caching is unavailable.Can't use caching without
open
ing a file, making some operations slowAs mentioned above, I can read with
GCSFileSystem
's_*
async methods. However, the caching is a part ofGCSFile
and unavailable at this level. It makes me have to choose between concurrent reads vs. caching.The small semi-random and close byte-ranges but random access cases could benefit from caching significantly.
Proposed Solution(s)
I am sure I have many blind spots, so before considering these solutions, do you have any recommendations to make my use case possible? I haven't tried, but would something like using
sync
parts of thegcsfs
library with asyncio+threads from my client code solve this? I am afraid this may not work due to some race conditions in caching (lru_cache
claims it is thread-safe. However, other cache implementations don't seem to be) and blocking code that might hog the GIL, etc. Or maybe there are much better solutions out there? Another user side thing may be to implement a separate cache that works withGCSFileSystem
+ async.Besides that, here are my thoughts for a more unified feature set, and could potentially be used with other
fsspec
implementations.Implement
GCSAsyncStreamedFile
The reference HTTP implementation,
s3fs
, appears to implement this. We can do it in a similar way here?This will not get around caching issues. However, please take a look at the section below.
Implement cache based on
aio-libs/async-lru
aio-libs/async-lru implements an async version of the
functools.lru_cache
.We can use this to implement async versions of some of the existing caching algorithms.
If I understand correctly, the caching implementations live at
fsspec
and must be done in that project.Resources
I can spend some time implementing the POC
GCSAsyncStreamedFile
and initialAsyncReadAheadCache
; however, I may need some guidance/help/assistance in testing, documentation, and structure of the code to be implemented.I also haven't given much thought to file writing, so I may also need some guidance there.
Cheers!
Beta Was this translation helpful? Give feedback.
All reactions