feature(group): add group setitem api #2393

jhamman · 2024-10-17T05:29:00Z

I think #251 was fixed in 2.x at some point but it this PR also makes it work with v3.

closes #251
closes #508

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

jni · 2024-10-17T09:33:34Z

Are you aiming to get this merged before 3.0 @jhamman? And if not, what is the recommended way to do this with the current beta? My naive attempt in napari/napari#7215 (here to translate:

grp[key] = arr

is:

grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)

(which is a little clunky, maybe shape and dtype should be optional if data is provided?)

d-v-b · 2024-10-17T09:37:43Z

putting aside the value of v2 compatibility, I'm not a fan of this API for a few reasons:

a zarr group is two mappings: its attributes and the members. Using __setitem__ for just one of those two mappings is arbitrary and unfair 🧑‍⚖️
using __setitem__ to create an array hides the full configuration options of array creation and can lead users to over-rely on zarr defaults
using __getitem__ to retrieve an array similarly offers no place to configure the runtime array-access properties like write_empty_chunks, caching, etc.

for the first point, we could scope all the hierarchy mutation stuff to a group.members attribute, e.g. group.members['array'] = array. but I would still want a way for users to configure the default array parameters, e.g. via our new config object.

grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)
(which is a little clunky, maybe shape and dtype should be optional if data is provided?)

@jni I agree that this is clunky, and it's silly to expose data and dtype. I think it would be better if we separated "create a brand new array" from "create an array like this one" into two different methods.

jni · 2024-10-17T09:49:10Z

Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.

As to defaults, I am a big fan of them, let's have more of those. (See also #2083 😉)

Either way, I am looking for guidance as to what I should do in napari for now, and also, that guidance should probably go into #2102. 🙏

d-v-b · 2024-10-17T10:00:21Z

Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.

I'm not aware of the hierarchical API ecosystem; I'm coming from a more basic "how do we model the data" perspective, and from that POV, arbitrary symmetry breaking is almost always a mistake. For example, if you had to make a JSON serializable model of a zarr group, you have three options. (Recall that a zarr group is a dict of attributes and a dict of arrays / groups)

option 1:
```
group: {"array_1": {... model of array_1}, "group_1": {...model of group_1}, "attrs": {... the attributes}
```
This sucks because we can have name collisions between arrays / groups and our attributes. Namely, an array or group named attrs will break this model.
option 2:
```
group: {"attr_key_1": {... attrs under attribute key 1}, "members": {...the arrays and groups}}`
```
This sucks for the same reason as above. We are mixing our attribute namespace with our group model namespace.
option 3:
```
group: {"attrs": {... the attributes}, "members": {...the arrays and groups}}`
```
This separates the attributes and members namespaces. This is the only one of the three that's a usable JSON data model.

We could of course ignore this in our Group API and pretend that a group can be modeled as dict[str, Array | Group], but that's a decision motivated by either laziness or deference to bad decisions made in the past. The right motivation is fairness :) which should lead us to explicitly separate the members and the attributes IF we want a dict-like data model for a group.

jni · 2024-10-17T11:23:45Z

that's a decision motivated by either laziness or deference to bad decisions made in the past.

ok, fair. ;)

jni · 2024-10-17T11:26:40Z

(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)

d-v-b · 2024-10-17T11:37:14Z

(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)

and I would argue it's a mistake in all three cases. FWIW I didn't think about this much until I did pydantic-zarr, i.e. I had to embed the zarr data model into JSON, at which point I realized these problems with the Zarr v2 API (and the APIs that it was inspired from). since then I can't unsee it 🙈

jhamman · 2024-10-17T12:36:02Z

My opinion here is that we should include this in 3.0 but consider the following options:

rename AsyncGroup.setitem to something more in line with where we want the GroupAPI to go
on the sync Group
turn Group.arrays into a property
use the arrays.setter as a fast path
deprecate Group.__setitem__ in favor of Group.arrays['a'] = arr

@d-v-b - how would this API feel to you?

d-v-b · 2024-10-17T12:54:17Z

My opinion here is that we should include this in 3.0 but consider the following options:
* rename `AsyncGroup.setitem` to something more in line with where we want the GroupAPI to go

* on the sync `Group`

* turn `Group.arrays` into a property

* use the `arrays.setter` as a fast path

* deprecate `Group.__setitem__` in favor of `Group.arrays['a'] = arr`
@d-v-b - how would this API feel to you?

This is definitely an improvement, and I don't think I'm opposed to it. I'm most excited about the deprecation part :)

Should we put default array codec settings + chunking in the config, and use that for invocations like Group.arrays['foo'] = arr? And similarly for __getitem__, although I don't think we have as much configuration surface are there. That way at least this stuff is configurable.

jhamman · 2024-10-17T13:55:24Z

src/zarr/core/group.py

+    def __iter__(self) -> Generator[tuple[str, Array], None]:
+        for name, async_array in self._group._sync_iter(self._group._async_group.arrays()):
+            yield name, Array(async_array)
+
+    def __call__(self) -> Generator[tuple[str, Array], None]:
+        return iter(self)


I could use some opinions here!

On one hand, iter(group.arrays) is nice and tidy. But it also doesn't match the existing API (group.arrays()) and there are other methods where symetry seems useful (array_keys, array_values, etc.).

@d-v-b - you'll have an opinion here :)

what breaks if we remove __call__? I feel like that's redundant if you can just iterate. And then we could add keys and values methods to the ArraysProxy object that lossily call __iter__, thereby replacing array_keys and array_values?

We end up with a kind of a weird data structure -- something that's asymptotically a dict? But I think that also matches the semantics of "lots of stuff stored on disk with unique names".

what breaks if we remove call

lots of stuff that looks like this (from our test suite):

> assert sorted(group.arrays(), key=lambda x: x[0]) == expected_arrays E TypeError: 'ArraysProxy' object is not callable

Looking at 2.18, I'm realizing we also are missing the recurse option to Group.arrays(recurse: bool = False)...

All of this makes me think it may be asking too much to overload .arrays with setiem/getitem/iter/etc.

jhamman · 2024-10-17T23:08:32Z

I just ran across this blast from the past: #508

…into feature/group-setitem

jhamman · 2024-10-23T22:38:08Z

This is ready for a review. I've got taken the least invasive route. I think we may want to deprecate this but I see no harm in keeping it now. We can debate its removal after the 3.0 release.

jni

Thanks @jhamman! Personally I quite like soft deprecations as a way to gently move a community towards other APIs.

tests/test_group.py

jhamman · 2024-10-24T22:45:24Z

@d-v-b - I'm planning to merge this in its current state. If you would like to come behind with a set of deprecations and/or features that add a separate namespace for Group.arrays, go for it!

feature(group): add group setitem api

77115ff

jni mentioned this pull request Oct 17, 2024

Fix zarr 3.0 compatibility napari/napari#7215

Merged

arrays proxy

ed67cc3

jhamman commented Oct 17, 2024

View reviewed changes

jhamman added 4 commits October 23, 2024 14:40

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

6e8f3da

…into feature/group-setitem

rollback to simple version

4ab2698

rollback deprecation

056398e

rollback ...

804cf12

jhamman requested review from d-v-b and jni October 23, 2024 22:38

jni approved these changes Oct 24, 2024

View reviewed changes

tests/test_group.py Outdated Show resolved Hide resolved

jhamman commented Oct 24, 2024

View reviewed changes

tests/test_group.py Outdated Show resolved Hide resolved

jhamman and others added 2 commits October 24, 2024 15:43

Update tests/test_group.py

2dafa7e

Merge branch 'main' into feature/group-setitem

255f384

jhamman merged commit f4af51c into zarr-developers:main Oct 24, 2024
28 checks passed

jhamman deleted the feature/group-setitem branch October 24, 2024 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(group): add group setitem api #2393

feature(group): add group setitem api #2393

jhamman commented Oct 17, 2024 •

edited

Loading

jni commented Oct 17, 2024

d-v-b commented Oct 17, 2024

jni commented Oct 17, 2024

d-v-b commented Oct 17, 2024 •

edited

Loading

jni commented Oct 17, 2024

jni commented Oct 17, 2024

d-v-b commented Oct 17, 2024

jhamman commented Oct 17, 2024

d-v-b commented Oct 17, 2024

jhamman Oct 17, 2024

d-v-b Oct 17, 2024

jhamman Oct 17, 2024 •

edited

Loading

jhamman commented Oct 17, 2024

jhamman commented Oct 23, 2024

jni left a comment

jhamman commented Oct 24, 2024

feature(group): add group setitem api #2393

feature(group): add group setitem api #2393

Conversation

jhamman commented Oct 17, 2024 • edited Loading

jni commented Oct 17, 2024

d-v-b commented Oct 17, 2024

jni commented Oct 17, 2024

d-v-b commented Oct 17, 2024 • edited Loading

jni commented Oct 17, 2024

jni commented Oct 17, 2024

d-v-b commented Oct 17, 2024

jhamman commented Oct 17, 2024

d-v-b commented Oct 17, 2024

jhamman Oct 17, 2024

Choose a reason for hiding this comment

d-v-b Oct 17, 2024

Choose a reason for hiding this comment

jhamman Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

jhamman commented Oct 17, 2024

jhamman commented Oct 23, 2024

jni left a comment

Choose a reason for hiding this comment

jhamman commented Oct 24, 2024

jhamman commented Oct 17, 2024 •

edited

Loading

d-v-b commented Oct 17, 2024 •

edited

Loading

jhamman Oct 17, 2024 •

edited

Loading