designing the `Array` object in zarr-python 3.x #2052

d-v-b · 2024-07-23T12:08:42Z

d-v-b
Jul 23, 2024
Maintainer

I don't think we have settled the overall shape of the top-level Array object in zarr-python 3.x. This should be a priority for stable release. So, I'm opening this discussion so people can suggest ideas / brainstorm / vent about how the Array object should look and behave in v3. I will open with a summary of the specific challenges we need to solve, and some ideas I have for each one.

The `Array` API in 3.x

Here's an annotated outline of the shape of the Array object in v3 today. I'm just going to enumerate the properties of the class and ignore the methods for now. You can see the source code for this object here.

class Array:
  # the Array class delegates all of its attributes and IO behavior to this object 
  # which has async methods for IO
  _async_array: AsyncArray 
  # a model of the metadata document for the array. this attribute encapsulates 
  # the structural differences between v2 and v3 arrays
  metadata: ArrayV3Metadata | ArrayV2Metadata
  shape: tuple[int, ...]
  dtype: np.dtype
  chunks: tuple[int, ...]
  order: Literal["C", "F"]
  fill_value: Any
  # there's a proposal to move this attribute to the storage layer
  read_only: bool 
 # basically a combination of store and path
  store_path: str
  path: str
  name: str
  basename: str
  ndim: int
  size: int
  attrs: dict[str, JSON]

  # __init__ is curt and not recommended for users
  def __init__(self, _async_array):
    ...

# this is the method users will probably use
  def create(
        cls,
        store: StoreLike,
        *,
        # v2 and v3
        shape: ChunkCoords,
        dtype: npt.DTypeLike,
        zarr_format: ZarrFormat = 3,
        fill_value: Any | None = None,
        attributes: dict[str, JSON] | None = None,
        # v3 only
        chunk_shape: ChunkCoords | None = None,
        chunk_key_encoding: (
            ChunkKeyEncoding
            | tuple[Literal["default"], Literal[".", "/"]]
            | tuple[Literal["v2"], Literal[".", "/"]]
            | None
        ) = None,
        codecs: Iterable[Codec | dict[str, JSON]] | None = None,
        dimension_names: Iterable[str] | None = None,
        # v2 only
        chunks: ChunkCoords | None = None,
        dimension_separator: Literal[".", "/"] | None = None,
        order: Literal["C", "F"] | None = None,
        filters: list[dict[str, JSON]] | None = None,
        compressor: dict[str, JSON] | None = None,
        exists_ok: bool = False) -> Array:
    ...

v2-only attributes

The following attributes are present in the zarr-python 2.x Array class (source) but not present in the 3.x Array class.

The reasons for these not being implemented in 3.x vary from "we haven't figured out v3 semantics for this" (filters, compressor, synchronizer), to "we haven't gotten around to it yet (write_empty_chunks), and also "this triggers traversal over all the chunks and might not be a good idea for an array attribute" (nchunks_initialized)

I'm happy to discuss the v3 future for any one of these attributes. We may need to spin those discussions out into separate issues.

specific challenges

I will enumerate some specific challenges with the Array API that we need to solve in 3.x.

sharding

Zarr V3 introduces the possibility of creating sharded chunks, i.e. chunks that contains subchunks that can be addressed to a contiguous byte range in a chunk. If you are a reading from a sharded array, you will want to iterate over the subchunks. This means we need to make this property of an array simple to specify when creating an array, and simple to access when an array is already created.

Neither of these things are true today. We do not have an Array attribute that conveys the subchunk size. Instead, here is how you would get the subchunks of an array:

# This function is not part of `zarr-python`, I wrote it for this discussion.  
# If the array is not sharded, this will return the chunk size of the array. 
def get_subchunks(array: Array):
  array_bytes_codec = array._async_array.codec_pipeline.array_bytes_codec
  if isinstance(array_bytes_codec, ShardingCodec):
    return array_bytes_codec.chunk_shape
  else:
    return array.chunks

How should we make this subchunk information specifiable and accessible from the Array object? The simple solution would be to add a Array.subchunks attribute that uses the get_subchunks routine I sketched out, and to add a subchunks keyword argument to Array.create. Maybe people have other ideas, or proposals for a better name than "subchunks". Note that this subchunks property is defined inside the array serialization / compression routines, which are also specified in Array.create, so adding a subchunks keyword argument for array creation would impact other parts of the Array.create API.

`chunks` and `chunk_shape`

Array.create specifies the same information with two keyword arguments: chunks is a v2-specific argument, and chunk_shape is a v3-specific argument. we should pick one, and use it for both v2 and v3. but see the next point.

chunk grids

Zarr v2 uses a regular chunk grid, which means that all chunks are the same size, which means that a single chunk shape is a complete description of the chunk grid. Hence, in 2.x, chunks is just a tuple of ints. But in Zarr v3, the chunk grid is an extension point and there is an active proposal to add support for a rectilinear chunk grid, i.e. a chunk grid where the chunks do not have the same shape. In this case, there are two ways to specify the chunk shape, an explicit list of chunk sizes: [(10,10), (20,10), (20,10), (20, 20)] or a list of chunk sizes per axis: [(10, 20), (10, 20)].

So for zarr-python 3.x, we should have a plan for what the chunks attribute will look like for rectilinear chunk grids. We could also consider solving the sharding problem with the chunks attribute, e.g. by defining an object with specific attributes for chunks (shards) and subchunks.

serialization: compressor, filters, codecs

Zarr V2 metadata defines filters (a collection of chunk <-> chunk transformations) and compressor (a single chunk <-> byte array transformation)
Zarr V3 metadata instead has a single codecs attribute, which is a structured list that may contain some number of chunk <-> chunk transformation (ArrayArrayCodec), must contain one and only one chunk <-> byte array transformation (termed ArrayBytesCodec), and may contain some number of byte array <-> bye array transformations (BytesBytesCodec).

The zarr v2 concept of a "filter" is similar to the zarr v3 concept of an ArrayArrayCodec
The Zarr v2 "compressor" concept can be construed as both a ArrayBytesCodec and BytesBytesCodec.
The Zarr v3 codec definition is recursive. Sharding is a particular ArrayBytesCodec, which contains its own collection of codecs.

We have a few challenges for the v3 api. Each one of these is a potential discussion point:

How can we express v2 array serialization (list[filter], compressor) and v3 array serialization (list[ArrayArrayCodec], ArrayBytesCodec[subcodecs], list[BytesBytesCodec]) with the same API?
Since V3 codecs are recursive, the basic information about how data is compressed might be found in the outer list of codecs, or inside a ShardingCodec. We need to provide a uniform interface to this information.
How can we make it simple to create sharded arrays? In v3 right now, Array.create takes all the codecs in a single list, and it isn't obvious how to construct that list so that sharding happens. I don't think users will be happy with this (see the previous points about sharding).

discussion

I am curious to hear what the community thinks about any of these points. The Array object has perhaps the most user contact out of any other class in zarr-python. It's imperative that we end up with a design that most users can be happy with (or, a design that they are not unhappy with).

d-v-b · 2024-07-24T06:54:13Z

d-v-b
Jul 24, 2024
Maintainer Author

cc @zarr-developers/python-core-devs

0 replies

jni · 2024-07-24T09:40:30Z

jni
Jul 24, 2024
Maintainer

Maybe people have other ideas, or proposals for a better name than "subchunks".

um, shards???

7 replies

d-v-b Jul 25, 2024
Maintainer Author

I've been using the language from the spec of the indexed sharding codec:

This specification defines a Zarr array -> bytes codec for sharding.

Sharding logically splits chunks ("shards") into sub-chunks ("inner chunks") that can be individually compressed and accessed. This allows to colocate multiple chunks within one storage object, bundling them in shards.

(the spec uses the word "chunks" rather loosely: later on it says "Sharding solves this by allowing to store multiple chunks in one storage key, which is called a shard:")

In database / distributed systems terminology, "sharding" means "splitting a database into separate self-contained pieces", and so "shards" denotes "the things you get when you split". Here's a wikipedia article about it. In an alternate universe where we used database terminology for this project, separate files would be called "shards", and the individually accessible subunits in those files would be called "partitions".

But Zarr was heavily inspired by hdf5, which uses the word "chunks" to denote "sub-arrays that get compressed independently". Zarr v3 formalizes "packing multiple independently compressed sub-arrays into a single object" as the behavior of a particular codec, i.e. a function that determines how array data gets written to a single chunk. In zarr v3, the introduction of sharding did not affect the definition of what a chunk is. For a given chunk size, sharding is just one particular codec configuration. This is part of the problem I want to solve, because for users sharding fundamentally changes how they access data, and so operationally "sharded vs unsharded" isn't the same as "gzip vs bzip".

I would like the top-level array object to have a dead-simple way for users to know how to read and write in parallel. Parallel writes are largely handled by a chunks attribute. If we added a subchunks (name tbd) attribute that was either identical to chunks (in the case of no sharding) or indicated the subchunk size (when sharding is configured) then I think this would be one solution.

normanrz Jul 25, 2024
Maintainer

Sharding is commonly introduce to add another layer of chunking to distinguish between units of reading data and writing data.

In terms of terminology, "shards" would be units of writing (i.e. the larger chunks) and "chunks" (or subchunks, inner chunks etc.) would be the unit of reading. Theoretically, more layers of chunking are possible, but I haven't seen practical use cases yet.

Tensorstore makes the chunking quite explicit in the API by using the terms "write_chunk" and "read_chunk"

d-v-b Jul 25, 2024
Maintainer Author

Zooming out a bit, I suspect most users don't actually care much about the chunk size or the subchunk size in isolation; instead, these values only matter to users as an implementation detail for parallel reads and writes.

This suggests an alternative array API where we background static properties like chunks and subchunks (or whatever we call them), and foreground iteration routines over sub-arrays that can be parametrized with respect to their read or write alignment, e.g.

# parallelizable writes
for region, sub_array in array.iter(chunks='write'):
   sub_array[:] = source_data[region]

# parallelizable reads
for region, sub_array in array.iter(chunks='read'):
   target_data[region] = sub_array

# why not
for region, sub_array in array.iter(chunks=(slice(0, 10), slice(0, 10)):
   target_data[region] = sub_array

With a chunk manifest representation for each array, then we wouldn't need an explicit region parameter here, because each sub-array would have a formal representation of its domain. cc @TomNicholas

At the same time, i think putting the static chunking information in one top-level array attribute is attractive. Naming by the operation (read vs write) helps us avoid the confusing shard / chunk / subchunk nomenclature. Here's a rough example:

TChunkShape = tuple[int, ...] | tuple[tuple[int, ...], ...]

class ChunkingSpec:
  read: TChunkShape # based on the sharding config
  write: TChunkShape # based on the basic array metadata

class ChunkingSpecD(TypedDict):
  read: TChunkShape
  write: TChunkShape

# the `chunking` parameter is annotated as `ChunkingSpec` | `ChunkingSpecD` so that users
# can use a raw dict but get autocomplete.
x = Array.create(...chunking={'read': (10,10,10), 'write': (20,20,20)})

curious to hear thoughts on this angle.

jeromekelleher Jul 25, 2024

I just want to weigh in here with a +1 on the idea of using "read_chunks" and "write_chunks", where read_chunks can be smaller than "write_chunks". I find the additional concept of "shards" hard to keep straight in my sleep-deprived mind (are shards bigger or smaller than chunks - let me check the spec again...)

d-v-b Sep 11, 2024
Maintainer Author

see #2170 for an issue relevant to this discussion, specifically focused on the array creation API.
see #2169 for a PR introducing something analogous to read_chunks and write_chunks, again for the array creation API. I would appreciate input on both of these!

jni · 2024-07-24T09:47:14Z

jni
Jul 24, 2024
Maintainer

So for zarr-python 3.x, we should have a plan for what the chunks attribute will look like for rectilinear chunk grids.

I would be in favour of either a tuple-of-tuples (one tuple per dimension) (a-la dask), or an array of the same dimension as the zarr array where chunks[(i, j, k)] would contain the chunk shape of the chunk that is at position i along dim 0, j along dim 1, and k along dim 2.

Both forms have the same information, so probably the former is better — we could provide function(s) for converting between the various forms. (And potentially, having chunks return a simple tuple-of-int for the most common use case of uniform chunk size might not be a bad idea. But I understand that it complicates the API significantly.)

1 reply

TomNicholas Jul 26, 2024

Whilst I also think that the Array.chunks attribute should probably return the former (i.e. tuple[tuple[int, ...], ...]) so as to follow dask.Array, note also that this pattern

an array of the same dimension as the zarr array where chunks[(i, j, k)] would contain the chunk shape of the chunk that is at position i along dim 0, j along dim 1, and k along dim 2.

keeps coming up again and again. Thinking of some information about each chunk in a zarr array as an element of a smaller "meta array" is the entire basis for:

the chunk manifest idea (Manifest storage transformer zarr-specs#287), and hence all of virtualizarr (see the ManifestArray, which literally contains smaller numpy arrays)
chunk-level summary statistics Extension proposal for zarr accumulation zarr-specs#205
the MetaArray idea in Incrementally-populated Zarr Arrays zarr-specs#300
as well as variable-length chunking.

jni · 2024-07-24T09:54:03Z

jni
Jul 24, 2024
Maintainer

Zarr V2 metadata defines filters (a collection of chunk <-> chunk transformations) and compressor (a single chunk <-> byte array transformation)

TIL 😅 Very handy!

I actually have ~no experience with the filters and codecs APIs so I don't have useful input here... But seeing a diversity of example uses would help me form some ideas about common APIs...

0 replies

martindurant · 2024-07-24T13:53:25Z

martindurant
Jul 24, 2024
Maintainer

For the record, ZEP0003 proposes chunks in the per-axis list style, as dask uses. If we want to change that, we should do that by updating the ZEP (and accept it). I would vote to keep the current structure for simplicity, and because it's what dask uses (one of the primary consumers) and because it leaves the possibility of mixing regular and variable-length chunking for the axes of a given array.

0 replies

martindurant · 2024-07-24T14:04:23Z

martindurant
Jul 24, 2024
Maintainer

Zarr V2 metadata defines filters (a collection of chunk <-> chunk transformations) and compressor (a single chunk <-> byte array transformation)

This was essentially a convention and there was no technical difference between filters and compressors - the compressor object would be evaluated just the same (and can be used) as a filter as if it were the last item in the filter list. I think we probably even have a "silent encoder" of numpy->bytes if the output of the last in the chain isn't already 1D, and "silent decoder" if decoding yields bytes of the right number to fit in the output array, but not certain on these points.

The inputs and outputs are also "array-like" (i.e., python buffer protocol, effectively), so the real difference is the dimensionality requirement of the encoded and decoded data on each codec. That isn't defined anywhere except by trying it and in the codec documentation.

Nevertheless, I think we can characterise all existing codecs and put them in the three V3 codec categories. There are not that many! For some unknown third party codec appearing in entrypoints of registered at runtime (imagecodecs?), if the author doesn't have the time to provide the characterisation, we have options:

don't use these in v3 at all
guess reasonable default classification (numpy->bytes is most common)
make a runtime guessing mechanism to determine the codec's use

0 replies

dstansby · 2024-07-24T17:36:05Z

dstansby
Jul 24, 2024
Maintainer

Array.create specifies the same information with two keyword arguments: chunks is a v2-specific argument, and chunk_shape is a v3-specific argument. we should pick one, and use it for both v2 and v3. but see the next point.

I think the answer to this (partly) comes from whether v3 aims to be as backwards compatible with v2 as possible, or is being seen as a big break and a chance to introduce (well documented, but breaking) API changes. There seems to be a lot of back and forth on this in various issues, and it seems to me like it would be helpful for the core devs to discuss and make a decision on this.

1 reply

d-v-b Jul 25, 2024
Maintainer Author

We are absolutely in a position where we can make breaking changes. For example, we completely changed the store APIs. There has been no back and forth on this. The resulting stores are different, but for the better.

Personally, my goal is to balance fixing design problems from v2 + supporting zarr v3 fluently on one hand, vs minimizing pain for users on the other hand. Some painful changes will have to be measured against the benefit they offer. The Array class is particularly sensitive, because that's what most Zarr users interact with. Hence this discussion about what (possibly breaking, possibly backwards compatible) changes we should make to that API to ensure that it works well for v2 and v3 arrays.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

designing the `Array` object in zarr-python 3.x #2052

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

designing the Array object in zarr-python 3.x #2052

d-v-b Jul 23, 2024 Maintainer

The Array API in 3.x

v2-only attributes

specific challenges

sharding

chunks and chunk_shape

chunk grids

serialization: compressor, filters, codecs

discussion

Replies: 7 comments · 9 replies

d-v-b Jul 24, 2024 Maintainer Author

jni Jul 24, 2024 Maintainer

d-v-b Jul 25, 2024 Maintainer Author

normanrz Jul 25, 2024 Maintainer

d-v-b Jul 25, 2024 Maintainer Author

jeromekelleher Jul 25, 2024

d-v-b Sep 11, 2024 Maintainer Author

jni Jul 24, 2024 Maintainer

TomNicholas Jul 26, 2024

jni Jul 24, 2024 Maintainer

martindurant Jul 24, 2024 Maintainer

martindurant Jul 24, 2024 Maintainer

dstansby Jul 24, 2024 Maintainer

d-v-b Jul 25, 2024 Maintainer Author

designing the `Array` object in zarr-python 3.x #2052

d-v-b
Jul 23, 2024
Maintainer

The `Array` API in 3.x

`chunks` and `chunk_shape`

Replies: 7 comments 9 replies

d-v-b
Jul 24, 2024
Maintainer Author

jni
Jul 24, 2024
Maintainer

d-v-b Jul 25, 2024
Maintainer Author

normanrz Jul 25, 2024
Maintainer

d-v-b Jul 25, 2024
Maintainer Author

d-v-b Sep 11, 2024
Maintainer Author

jni
Jul 24, 2024
Maintainer

jni
Jul 24, 2024
Maintainer

martindurant
Jul 24, 2024
Maintainer

martindurant
Jul 24, 2024
Maintainer

dstansby
Jul 24, 2024
Maintainer

d-v-b Jul 25, 2024
Maintainer Author