Chunking support #1783

eschnett · 2024-04-11T20:02:33Z

When a large ndarray is stored as binary block with compression, then the (beginning of) the whole block needs to be read and decompressed even when only a small subarray is read. "Chunking" remedies this; instead of storing an ndarray as a single binary block, it is stored as a set of smaller blocks that are compressed and stored independently.

Are there plans to support this? Can this be implemented as extension?

One simple approach would be to introduce a new yaml tag core/chunked-ndarray that consists of a yaml map that maps offsets to ndarrays, for example

chunky: !core/chunked-ndarray-1.0.0
  - !core/ndarray-chunk-1.0.0
    offset: [0,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [100,0]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  - !core/ndarray-chunk-1.0.0
    offset: [0,100]
     data: !core/ndarray-1.0.0
       source: ... # the usual ndarray stuff here
  # possibly more chunks here

Has there been any work in this direction?

The text was updated successfully, but these errors were encountered:

braingram · 2024-04-11T20:12:44Z

Thanks for opening this issue.

There has been some work adding support for the zarr storage format within ASDF. This is implemented via an extension: https://github.com/asdf-format/asdf-zarr It's a new package so please let me know if it's something you plan to use "in production" (so we can give it another review, also feel free to give it a try and open issues if you find anything). The extension offers a few options:

storing the zarr data inside ASDF blocks (with a chunk per block, I think most similar to what you described)
referencing external zarr storage (either DirectoryStore "flat files", S3 stores, or any of the many formats zarr supports).

The use of zarr also opens up a second place where compression can be controlled (which can get a bit confusing).

eschnett · 2024-04-11T21:30:01Z

@braingram Nice! We are currently discussing storage formats, and both ASDF and Zarr are contenders that have various advantages and disadvantages. On the surface, using Zarr chunking with ASDF single-file storage seems like an excellent choice. I will have a look.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunking support #1783

Chunking support #1783

eschnett commented Apr 11, 2024

braingram commented Apr 11, 2024

eschnett commented Apr 11, 2024

Chunking support #1783

Chunking support #1783

Comments

eschnett commented Apr 11, 2024

braingram commented Apr 11, 2024

eschnett commented Apr 11, 2024