Replies: 7 comments 9 replies
-
cc @zarr-developers/python-core-devs |
Beta Was this translation helpful? Give feedback.
-
um, shards??? |
Beta Was this translation helpful? Give feedback.
-
I would be in favour of either a tuple-of-tuples (one tuple per dimension) (a-la dask), or an array of the same dimension as the zarr array where chunks[(i, j, k)] would contain the chunk shape of the chunk that is at position i along dim 0, j along dim 1, and k along dim 2. Both forms have the same information, so probably the former is better — we could provide function(s) for converting between the various forms. (And potentially, having chunks return a simple tuple-of-int for the most common use case of uniform chunk size might not be a bad idea. But I understand that it complicates the API significantly.) |
Beta Was this translation helpful? Give feedback.
-
TIL 😅 Very handy! I actually have ~no experience with the filters and codecs APIs so I don't have useful input here... But seeing a diversity of example uses would help me form some ideas about common APIs... |
Beta Was this translation helpful? Give feedback.
-
For the record, ZEP0003 proposes chunks in the per-axis list style, as dask uses. If we want to change that, we should do that by updating the ZEP (and accept it). I would vote to keep the current structure for simplicity, and because it's what dask uses (one of the primary consumers) and because it leaves the possibility of mixing regular and variable-length chunking for the axes of a given array. |
Beta Was this translation helpful? Give feedback.
-
This was essentially a convention and there was no technical difference between filters and compressors - the compressor object would be evaluated just the same (and can be used) as a filter as if it were the last item in the filter list. I think we probably even have a "silent encoder" of numpy->bytes if the output of the last in the chain isn't already 1D, and "silent decoder" if decoding yields bytes of the right number to fit in the output array, but not certain on these points. The inputs and outputs are also "array-like" (i.e., python buffer protocol, effectively), so the real difference is the dimensionality requirement of the encoded and decoded data on each codec. That isn't defined anywhere except by trying it and in the codec documentation. Nevertheless, I think we can characterise all existing codecs and put them in the three V3 codec categories. There are not that many! For some unknown third party codec appearing in entrypoints of registered at runtime (imagecodecs?), if the author doesn't have the time to provide the characterisation, we have options:
|
Beta Was this translation helpful? Give feedback.
-
I think the answer to this (partly) comes from whether v3 aims to be as backwards compatible with v2 as possible, or is being seen as a big break and a chance to introduce (well documented, but breaking) API changes. There seems to be a lot of back and forth on this in various issues, and it seems to me like it would be helpful for the core devs to discuss and make a decision on this. |
Beta Was this translation helpful? Give feedback.
-
I don't think we have settled the overall shape of the top-level
Array
object inzarr-python
3.x. This should be a priority for stable release. So, I'm opening this discussion so people can suggest ideas / brainstorm / vent about how theArray
object should look and behave in v3. I will open with a summary of the specific challenges we need to solve, and some ideas I have for each one.The
Array
API in 3.xHere's an annotated outline of the shape of the
Array
object in v3 today. I'm just going to enumerate the properties of the class and ignore the methods for now. You can see the source code for this object here.v2-only attributes
The following attributes are present in the
zarr-python
2.xArray
class (source) but not present in the 3.xArray
class.The reasons for these not being implemented in 3.x vary from "we haven't figured out v3 semantics for this" (
filters
,compressor
,synchronizer
), to "we haven't gotten around to it yet (write_empty_chunks
), and also "this triggers traversal over all the chunks and might not be a good idea for an array attribute" (nchunks_initialized
)chunk_store
compressor
filters
synchronizer
itemsize
nbytes
nbytes_stored
cdata_shape
nchunks
nchunks_initialized
is_view
oindex
vindex
blocks
write_empty_chunks
meta_array
I'm happy to discuss the v3 future for any one of these attributes. We may need to spin those discussions out into separate issues.
specific challenges
I will enumerate some specific challenges with the
Array
API that we need to solve in 3.x.sharding
Zarr V3 introduces the possibility of creating sharded chunks, i.e. chunks that contains subchunks that can be addressed to a contiguous byte range in a chunk. If you are a reading from a sharded array, you will want to iterate over the subchunks. This means we need to make this property of an array simple to specify when creating an array, and simple to access when an array is already created.
Neither of these things are true today. We do not have an
Array
attribute that conveys the subchunk size. Instead, here is how you would get the subchunks of an array:How should we make this subchunk information specifiable and accessible from the
Array
object? The simple solution would be to add aArray.subchunks
attribute that uses theget_subchunks
routine I sketched out, and to add asubchunks
keyword argument toArray.create
. Maybe people have other ideas, or proposals for a better name than "subchunks". Note that thissubchunks
property is defined inside the array serialization / compression routines, which are also specified inArray.create
, so adding asubchunks
keyword argument for array creation would impact other parts of theArray.create
API.chunks
andchunk_shape
Array.create
specifies the same information with two keyword arguments:chunks
is a v2-specific argument, andchunk_shape
is a v3-specific argument. we should pick one, and use it for both v2 and v3. but see the next point.chunk grids
Zarr v2 uses a regular chunk grid, which means that all chunks are the same size, which means that a single chunk shape is a complete description of the chunk grid. Hence, in 2.x,
chunks
is just a tuple of ints. But in Zarr v3, the chunk grid is an extension point and there is an active proposal to add support for a rectilinear chunk grid, i.e. a chunk grid where the chunks do not have the same shape. In this case, there are two ways to specify the chunk shape, an explicit list of chunk sizes:[(10,10), (20,10), (20,10), (20, 20)]
or a list of chunk sizes per axis:[(10, 20), (10, 20)]
.So for zarr-python 3.x, we should have a plan for what the
chunks
attribute will look like for rectilinear chunk grids. We could also consider solving the sharding problem with thechunks
attribute, e.g. by defining an object with specific attributes for chunks (shards) and subchunks.serialization: compressor, filters, codecs
Zarr V2 metadata defines
filters
(a collection of chunk <-> chunk transformations) andcompressor
(a single chunk <-> byte array transformation)Zarr V3 metadata instead has a single
codecs
attribute, which is a structured list that may contain some number of chunk <-> chunk transformation (ArrayArrayCodec
), must contain one and only one chunk <-> byte array transformation (termedArrayBytesCodec
), and may contain some number of byte array <-> bye array transformations (BytesBytesCodec
).ArrayArrayCodec
ArrayBytesCodec
andBytesBytesCodec
.ArrayBytesCodec
, which contains its own collection of codecs.We have a few challenges for the v3 api. Each one of these is a potential discussion point:
list[filter]
,compressor
) and v3 array serialization (list[ArrayArrayCodec]
,ArrayBytesCodec[subcodecs]
,list[BytesBytesCodec]
) with the same API?ShardingCodec
. We need to provide a uniform interface to this information.Array.create
takes all the codecs in a single list, and it isn't obvious how to construct that list so that sharding happens. I don't think users will be happy with this (see the previous points about sharding).discussion
I am curious to hear what the community thinks about any of these points. The
Array
object has perhaps the most user contact out of any other class inzarr-python
. It's imperative that we end up with a design that most users can be happy with (or, a design that they are not unhappy with).Beta Was this translation helpful? Give feedback.
All reactions