-
-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: change array creation signature to allow sharding specification [do not merge] #2169
base: main
Are you sure you want to change the base?
Conversation
_codecs = tuple(codecs) if codecs is not None else (BytesCodec(),) | ||
|
||
if shard_shape is not None: | ||
_codecs = (ShardingCodec(chunk_shape=shard_shape, codecs=_codecs),) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This effectively hard-codes sharding into the spec, something like a sharded=True
flag that might have existed on the CHunkSpec object. How do you expect this to extend to variable chunking or other schemes that might be created in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal can use whatever specification for variable length chunks we come up with, e.g. tuples of tuples of ints. You could specify variable length chunking with no sharding via something like chunks = {'write_shape': ((10,5), (1,2,3)}
, and variable length chunking with sharding via something like chunks = {'write_shape: ((10,5), (1,2,3)), 'read_shape': (1,1)}
. The read shape would have to checked for consistency with all the unique chunk shapes in this case. We would of course need to widen the type of ChunkSpec
for this to accept tuple[tuple[int, ...]]
for the write_shape
keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like a sharded=True flag that might have existed on the CHunkSpec object.
If we had such a flag on the chunkspec object, then it would semantically collide with read_shape
: {'write_shape': (10,10), 'read_shape': (2,2), sharding: False}
would not be valid, because there's no way to have read_shape
and write_shape
differ without sharding. BTW when I say "sharding" i don't mean "the sharding codec", I mean the general concept of packing multiple subchunks into a single file. If a non-codec implementation of sharding emerges, then I would like to imagine that this API could wrap that.
The goal of this PR is to demonstrate one strategy to simplify the creation of arrays that use sharding. Don't consider merging this until we get a good look at some alternatives.
This PR alters the
Array.create
routine, removing thechunk_shape
kwarg and instead beefing up the semantics of thechunks
kwarg. Specifically, thechunks
kwarg supports a new variant,ChunkSpec
, which aims to compactly specify both the chunk shape of an array as well as the (optional) sub-chunk shape.ChunkSpec
is a typed dictionary with two keys:read_shape
andwrite_shape
.write_shape
specifies the shape of array chunks that can be written concurrently, i.e. the shape in array coordinates of the chunk files.read_shape
specifies the shape of array chunks that can be read concurrently, i.e. the shape in array coordinates of the sub-chunks contained in a chunk constructed with a sharding codec.chunks = None
orchunks = {}
(we support the latter case because of how non-total typeddicts work) toArray.create
will automatically specify chunks using old v2 logic.chunks = {'write_shape': (20, 20)}
ORchunks = {'read_shape': (20, 20)}
toArray.create
will configure that array with no sharding and a chunk size of (20,20).chunks = {'write_shape': (20, 20), 'read_shape': (10,10)}
toArray.create
will configure that array with sharding, with a sub-chunk size of (10,10), and a chunk size of (20,20). This will also route all the of the user-specifiedcodecs
, if any, to the sharding codec.Note that this PR does not change the signature of the array class itself. That would be a separate effort.
addresses #2170
TODO: