Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(group): add group setitem api #2393

Merged
merged 8 commits into from
Oct 24, 2024

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Oct 17, 2024

I think #251 was fixed in 2.x at some point but it this PR also makes it work with v3.

closes #251
closes #508

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@jni
Copy link
Contributor

jni commented Oct 17, 2024

Are you aiming to get this merged before 3.0 @jhamman? And if not, what is the recommended way to do this with the current beta? My naive attempt in napari/napari#7215 (here to translate:

grp[key] = arr

is:

grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)

(which is a little clunky, maybe shape and dtype should be optional if data is provided?)

@d-v-b
Copy link
Contributor

d-v-b commented Oct 17, 2024

putting aside the value of v2 compatibility, I'm not a fan of this API for a few reasons:

  • a zarr group is two mappings: its attributes and the members. Using __setitem__ for just one of those two mappings is arbitrary and unfair 🧑‍⚖️
  • using __setitem__ to create an array hides the full configuration options of array creation and can lead users to over-rely on zarr defaults
  • using __getitem__ to retrieve an array similarly offers no place to configure the runtime array-access properties like write_empty_chunks, caching, etc.

for the first point, we could scope all the hierarchy mutation stuff to a group.members attribute, e.g. group.members['array'] = array. but I would still want a way for users to configure the default array parameters, e.g. via our new config object.

grp.create_array(name=key, shape=arr.shape, dtype=arr.dtype, data=arr)

(which is a little clunky, maybe shape and dtype should be optional if data is provided?)

@jni I agree that this is clunky, and it's silly to expose data and dtype. I think it would be better if we separated "create a brand new array" from "create an array like this one" into two different methods.

@jni
Copy link
Contributor

jni commented Oct 17, 2024

Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.

As to defaults, I am a big fan of them, let's have more of those. (See also #2083 😉)

Either way, I am looking for guidance as to what I should do in napari for now, and also, that guidance should probably go into #2102. 🙏

@d-v-b
Copy link
Contributor

d-v-b commented Oct 17, 2024

Regardless of Citizens United 😂, I don't think attributes have human rights @d-v-b, so I don't think "unfair" is really valid here. Which leaves "arbitrary". I think generally in hierarchical APIs like these it's very common for attributes to be under a .attrs, um, attribute? I think "commonly known API" is not arbitrary, but being a nice player in an ecosystem.

I'm not aware of the hierarchical API ecosystem; I'm coming from a more basic "how do we model the data" perspective, and from that POV, arbitrary symmetry breaking is almost always a mistake. For example, if you had to make a JSON serializable model of a zarr group, you have three options. (Recall that a zarr group is a dict of attributes and a dict of arrays / groups)

  • option 1:

    group: {"array_1": {... model of array_1}, "group_1": {...model of group_1}, "attrs": {... the attributes}
    

    This sucks because we can have name collisions between arrays / groups and our attributes. Namely, an array or group named attrs will break this model.

  • option 2:

    group: {"attr_key_1": {... attrs under attribute key 1}, "members": {...the arrays and groups}}`
    

    This sucks for the same reason as above. We are mixing our attribute namespace with our group model namespace.

  • option 3:

    group: {"attrs": {... the attributes}, "members": {...the arrays and groups}}`
    

    This separates the attributes and members namespaces. This is the only one of the three that's a usable JSON data model.

We could of course ignore this in our Group API and pretend that a group can be modeled as dict[str, Array | Group], but that's a decision motivated by either laziness or deference to bad decisions made in the past. The right motivation is fairness :) which should lead us to explicitly separate the members and the attributes IF we want a dict-like data model for a group.

@jni
Copy link
Contributor

jni commented Oct 17, 2024

that's a decision motivated by either laziness or deference to bad decisions made in the past.

ok, fair. ;)

@jni
Copy link
Contributor

jni commented Oct 17, 2024

(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)

@d-v-b
Copy link
Contributor

d-v-b commented Oct 17, 2024

(but, sometimes, deference to the past can be important. I won't make that judgement call here. But hdf5, and (consequently?) NetCDF, both use the .attrs-style API, if I'm not mistaken. And zarr v2??)

and I would argue it's a mistake in all three cases. FWIW I didn't think about this much until I did pydantic-zarr, i.e. I had to embed the zarr data model into JSON, at which point I realized these problems with the Zarr v2 API (and the APIs that it was inspired from). since then I can't unsee it 🙈

@jhamman
Copy link
Member Author

jhamman commented Oct 17, 2024

My opinion here is that we should include this in 3.0 but consider the following options:

  • rename AsyncGroup.setitem to something more in line with where we want the GroupAPI to go
  • on the sync Group
  • turn Group.arrays into a property
  • use the arrays.setter as a fast path
  • deprecate Group.__setitem__ in favor of Group.arrays['a'] = arr

@d-v-b - how would this API feel to you?

@d-v-b
Copy link
Contributor

d-v-b commented Oct 17, 2024

My opinion here is that we should include this in 3.0 but consider the following options:

* rename `AsyncGroup.setitem` to something more in line with where we want the GroupAPI to go

* on the sync `Group`

* turn `Group.arrays` into a property

* use the `arrays.setter` as a fast path

* deprecate `Group.__setitem__` in favor of `Group.arrays['a'] = arr`

@d-v-b - how would this API feel to you?

This is definitely an improvement, and I don't think I'm opposed to it. I'm most excited about the deprecation part :)

Should we put default array codec settings + chunking in the config, and use that for invocations like Group.arrays['foo'] = arr? And similarly for __getitem__, although I don't think we have as much configuration surface are there. That way at least this stuff is configurable.

Comment on lines 319 to 324
def __iter__(self) -> Generator[tuple[str, Array], None]:
for name, async_array in self._group._sync_iter(self._group._async_group.arrays()):
yield name, Array(async_array)

def __call__(self) -> Generator[tuple[str, Array], None]:
return iter(self)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use some opinions here!

On one hand, iter(group.arrays) is nice and tidy. But it also doesn't match the existing API (group.arrays()) and there are other methods where symetry seems useful (array_keys, array_values, etc.).

@d-v-b - you'll have an opinion here :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what breaks if we remove __call__? I feel like that's redundant if you can just iterate. And then we could add keys and values methods to the ArraysProxy object that lossily call __iter__, thereby replacing array_keys and array_values?

We end up with a kind of a weird data structure -- something that's asymptotically a dict? But I think that also matches the semantics of "lots of stuff stored on disk with unique names".

Copy link
Member Author

@jhamman jhamman Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what breaks if we remove call

lots of stuff that looks like this (from our test suite):

>       assert sorted(group.arrays(), key=lambda x: x[0]) == expected_arrays
E       TypeError: 'ArraysProxy' object is not callable

Looking at 2.18, I'm realizing we also are missing the recurse option to Group.arrays(recurse: bool = False)...

All of this makes me think it may be asking too much to overload .arrays with setiem/getitem/iter/etc.

@jhamman
Copy link
Member Author

jhamman commented Oct 17, 2024

I just ran across this blast from the past: #508

@jhamman
Copy link
Member Author

jhamman commented Oct 23, 2024

This is ready for a review. I've got taken the least invasive route. I think we may want to deprecate this but I see no harm in keeping it now. We can debate its removal after the 3.0 release.

@jhamman jhamman requested review from d-v-b and jni October 23, 2024 22:38
Copy link
Contributor

@jni jni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jhamman! Personally I quite like soft deprecations as a way to gently move a community towards other APIs.

tests/test_group.py Outdated Show resolved Hide resolved
tests/test_group.py Outdated Show resolved Hide resolved
@jhamman
Copy link
Member Author

jhamman commented Oct 24, 2024

@d-v-b - I'm planning to merge this in its current state. If you would like to come behind with a set of deprecations and/or features that add a separate namespace for Group.arrays, go for it!

@jhamman jhamman merged commit f4af51c into zarr-developers:main Oct 24, 2024
28 checks passed
@jhamman jhamman deleted the feature/group-setitem branch October 24, 2024 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Group.__setitem__ or remove it Storing Dask Arrays directly to Zarr Group fails
3 participants