Questions and Improvement Suggestions on Async Functionality #602

tasansal · 2024-01-09T16:22:59Z

tasansal
Jan 9, 2024

Thanks for all your hard work on the fsspec ecosystem. I have been using it a lot and dug into a lot of the source code. I have some advanced use cases, and it makes life so much better!

I am working on a library to read/write and parse structured binary data (SEG-Y; i.e. seismology data) from object stores using fsspec and gcsfs. I want to have full async compatibility in my API (to be used in async web backends) and started looking into async implementation with gcsfs. During development, I encountered limitations and UX issues with the Async APIs, caching, and semi-random file operations. My library is not open source yet, but soon to be. This involves various access patterns:

Sequential, from start to finish.
Semi-random sequential (disjoint but sequential random regions)
Fully random but smaller reads

I can give you some reproducible code anytime; there are open-source datasets on an HTTP file system. However, it may not apply apple to apples to gcsfs.

Problem Definition

I will break this down into sections.

`GCSFileSystem` async behavior is ambiguous and not well-documented

The documentation states that we set the asynchronous flag to True or False if we are going to use the GCSFileSystem
in an async context. However, the async functions with the _* prefix seem to work even when the asynchronous flag is set to False. Both of the code snippets below work. Also, the GCSFileSystem doesn't have an awaitable _connect* method, as stated in the source code when an exception is being raised. Also, it appears when we set asynchronous to True, sets the _loop attribute of the file system to None in the __init__ of AsyncFileSystem. What loop are we using now, etc?

async def main():
    fs = GCSFileSystem(asynchronous=False)
    result = await fs._cat_file(path, 0, 3200)

async def main():
    fs = GCSFileSystem(asynchronous=True)
    result = await fs._cat_file(path, 0, 3200)

In summary, how the GCSFileSystem creates/re-uses the loop when async flag is on/off is unclear. This could be more on the topic of documentation enhancement.

`GCSFile` doesn't support `async`

It is stated in the documentation that opening a file asynchronously should be done after fetching the file to a local store. This is fair for smaller files, but I am working with 2TB files on the object store, which is not feasible. This is where the AsyncStreamedFile with caching implementation becomes useful.

In my library, I need to have cached and async byte range fetching features of gcsfs. Unfortunately:

GCSFile objects don't support concurrent byte-range requests or reads.
There is a workaround to use the GCSFileSystem's _cat_ranges in this context to fetch concurrently, but caching is unavailable.

Can't use caching without `open`ing a file, making some operations slow

As mentioned above, I can read with GCSFileSystem's _* async methods. However, the caching is a part of GCSFile and unavailable at this level. It makes me have to choose between concurrent reads vs. caching.

The small semi-random and close byte-ranges but random access cases could benefit from caching significantly.

Proposed Solution(s)

I am sure I have many blind spots, so before considering these solutions, do you have any recommendations to make my use case possible? I haven't tried, but would something like using sync parts of the gcsfs library with asyncio+threads from my client code solve this? I am afraid this may not work due to some race conditions in caching (lru_cache claims it is thread-safe. However, other cache implementations don't seem to be) and blocking code that might hog the GIL, etc. Or maybe there are much better solutions out there? Another user side thing may be to implement a separate cache that works with GCSFileSystem + async.

Besides that, here are my thoughts for a more unified feature set, and could potentially be used with other fsspec implementations.

Implement `GCSAsyncStreamedFile`

The reference HTTP implementation, s3fs, appears to implement this. We can do it in a similar way here?

This will not get around caching issues. However, please take a look at the section below.

Implement cache based on `aio-libs/async-lru`

aio-libs/async-lru implements an async version of the functools.lru_cache.
We can use this to implement async versions of some of the existing caching algorithms.

If I understand correctly, the caching implementations live at fsspec and must be done in that project.

Resources

I can spend some time implementing the POC GCSAsyncStreamedFile and initial AsyncReadAheadCache; however, I may need some guidance/help/assistance in testing, documentation, and structure of the code to be implemented.

I also haven't given much thought to file writing, so I may also need some guidance there.

Cheers!

martindurant · 2024-01-10T15:51:32Z

martindurant
Jan 10, 2024
Maintainer

Some answers to the first part

async behavior is ambiguous and not well-documented

Indeed, the interface has gone through various changes, and most of what you describe is a failure of the docs to keep up. We no longer attempt to support async use in a non-async context or vice-versa; we no longer support setting the loop to anything other than the current one (in a coroutine). These changes came in piecemeal and implemented differently in different backends.

The asynchronous and loop kwargs have no effect. The only reason to mark an instance as asynchronous, is so that the instance cache system doesn't reuse an instance that was created with the same arguments in a non-async context
calling connect/set_session methods should not be necessary (the connect_ methods moved into the credentials module and are sync, called in __init__)
close_session will get called automatically at object cleanup, but that does not run as a coroutine, so it's better to call this explicitly to avoid shutdown warnings

0 replies

martindurant · 2024-01-10T15:59:57Z

martindurant
Jan 10, 2024
Maintainer

GCSFile doesn't support async

True. These file-like classes are supposed to look like io.IOBase and similar. That means that read(), write(), etc are all necessarily sync calls. So we need calling code to explicitly opt in to async and get a different class back, no?

There should be an AsyncGCSFile for streamed reading, which would look very much the same as the http version, since both use aiohttp.

I need to have cached and async byte range fetching

What does this workflow look like? Do you know a priori the ranges you will need, or is it more random within chunks of the file? It the latter, it's a pretty hard problem.

0 replies

martindurant · 2024-01-10T16:03:45Z

martindurant
Jan 10, 2024
Maintainer

using sync parts of the gcsfs library with asyncio+threads from my client code

dask does this routinely. The async backend stuff happens in it's own dedicated IO thread in this case, and other threads wait on it. So hogging the GIL is not a problem during IO-dominated work, but doing IO while another thread does CPU-bound work could be.

0 replies

martindurant · 2024-01-10T18:39:21Z

martindurant
Jan 10, 2024
Maintainer

initial AsyncReadAheadCache

Have you seen BackgroundBlockCache (cache_type="background") ?

0 replies

martindurant · 2024-01-10T18:49:44Z

martindurant
Jan 10, 2024
Maintainer

haven't given much thought to file writing

In this context, this means an individual file has a concrete state (current position) but write() operations on it are async. Since the calls on remote are already async (init upload, upload chunk, finalise), this wouldn't be too hard to do to have multiple files concurrently writing.

However a truly streaming API is different. aiohttp doesn't have push async, only pull (i.e., you supply an async generator), so there can be no await f2.write(await f1.read(chunk)). Some of this was considered in fsspec.generic.rsync, which uses streaming files via temporary local files to split up the two sides of the async.

btw: put_file could be made into concurrent requests (since we know the input size and local seeking is cheap), I think the API allows it, but when dealing with blocksizes big enough, we suppose the latency becomes unimportant.

0 replies

martindurant · 2024-01-10T18:51:29Z

martindurant
Jan 10, 2024
Maintainer

async-lru implements an async version of the functools.lru_cache.
We can use this to implement async versions of some of the existing caching algorithms.

I'm not convinces that this is necessary

the caching implementations live at fsspec and must be done in that project.

Correct.

0 replies

martindurant · 2024-01-15T15:13:54Z

martindurant
Jan 15, 2024
Maintainer

(I'm happy to talk about this live to make a work plan)

0 replies

tasansal · 2024-03-31T21:56:20Z

tasansal
Mar 31, 2024
Author

Hey @martindurant, sorry for the delayed response! Let's chat. What's the best way to arrange this? I can contact via LinkedIn.

1 reply

martindurant Apr 1, 2024
Maintainer

That would be fine, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions and Improvement Suggestions on Async Functionality #602

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions and Improvement Suggestions on Async Functionality #602

tasansal Jan 9, 2024

Problem Definition

GCSFileSystem async behavior is ambiguous and not well-documented

GCSFile doesn't support async

Can't use caching without opening a file, making some operations slow

Proposed Solution(s)

Implement GCSAsyncStreamedFile

Implement cache based on aio-libs/async-lru

Resources

Replies: 8 comments · 1 reply

martindurant Jan 10, 2024 Maintainer

martindurant Jan 10, 2024 Maintainer

martindurant Jan 10, 2024 Maintainer

martindurant Jan 10, 2024 Maintainer

martindurant Jan 10, 2024 Maintainer

martindurant Jan 10, 2024 Maintainer

martindurant Jan 15, 2024 Maintainer

tasansal Mar 31, 2024 Author

martindurant Apr 1, 2024 Maintainer

tasansal
Jan 9, 2024

`GCSFileSystem` async behavior is ambiguous and not well-documented

`GCSFile` doesn't support `async`

Can't use caching without `open`ing a file, making some operations slow

Implement `GCSAsyncStreamedFile`

Implement cache based on `aio-libs/async-lru`

Replies: 8 comments 1 reply

martindurant
Jan 10, 2024
Maintainer

martindurant
Jan 10, 2024
Maintainer

martindurant
Jan 10, 2024
Maintainer

martindurant
Jan 10, 2024
Maintainer

martindurant
Jan 10, 2024
Maintainer

martindurant
Jan 10, 2024
Maintainer

martindurant
Jan 15, 2024
Maintainer

tasansal
Mar 31, 2024
Author

martindurant Apr 1, 2024
Maintainer