some notes and proposed functionality #1

magland · 2024-03-13T01:38:37Z

magland
Mar 13, 2024
Maintainer

Jotting down some notes and goals here. We should probably split this out into separate issues/discussions.

Establish a well-defined one-to-one correspondence between a subset of hdf5 and a subset of zarr, and the subset of hdf5 should ideally include everything supported by NWB. I know that hdmf_zarr addresses this to some extent, but I think it's important to spell this out rigorously, and then provide a straightforward implementations of the transformations (back and forth) without using the nwb builder system.

In addition the subset of zarr should be contained in a larger allowed subset of zarr that is supported by nwb but not corresponding to hdf5 (I am thinking about custom codecs for compression)

Side note: I don't think that zarr_dtype should be required on every dataset, since it should be inferred from the data type of the zarr dataset.

Need to address

which types are allowed in attributes (the spec for hdf5 and zarr are different)
references in attributes
references in datasets
scalar datasets
NaN in attributes. kerchunk allows non-json-compliant .zattrs, but I don't think we should do that. It should be json-compliant, but we do need a way to represent NaN.

I think .zarr.json files that are compatible with fsspec ReferenceFileSystem and zarr DirectoryStore should be supported, and loadable in pynwb.

We should support templates field in .zarr.json files.

I think we should extend beyond what is supported by zarr DirectoryStore (custom zarr storage backend) to allow

Linking datasets to datasets in external remote hdf5 files or external lindi/zarr stores
Linking groups to groups in other lindi/zarr stores

We need to figure out whether references within a .zarr.json are going to point to absolute urls, or relative paths. There are advantages of both.

I've been using .zarr.json extension, but maybe there is a better one to use.

Provide a nwb-to-.zarr.json converter that can operate on remote files and takes parameters including num_chunks_per_dataset_threshold that determines when to include a link to an external file rather than embedding chunks within the generated file. But to be clear, the embedded chunks are still themselves references to an external file. (not explaining this very well)

Figure out how to interface with pynwb for both reading and writing to .zarr.json files

Provide some kind of consolidation scheme where you can start with a zarr directory store with tens of thousands of chunk files and produce a .zarr.json file plus a much smaller number of files, with the .zarr.json pointing to byte ranges within those files. This is more suited to uploading to cloud storage.

Support the notion of a NWB ammendment (or augmentation) where the file is not a proper nwb file, but is an ammendment to an existing one. Perhaps provide a pointer to the original? Like an include field?

magland · 2024-03-14T13:20:53Z

magland
Mar 14, 2024
Maintainer Author

Another idea. Once we have a well-defined LINDI Zarr, then we could potentially wrap it in something that looks like a h5py object and feed that in to pynwb.

1 reply

oruebel Mar 14, 2024
Maintainer

I think going through hdmf-zarr may be easier for integrating with PyNWB. That being said, being able to wrap this up in a way that h5py.File would understand would be neat.

oruebel · 2024-03-14T17:16:15Z

oruebel
Mar 14, 2024
Maintainer

Establish a well-defined one-to-one correspondence between a subset of hdf5 and a subset of zarr, and the subset of hdf5 should ideally include everything supported by NWB. I know that hdmf_zarr addresses this to some extent, but I think it's important to spell this out rigorously, and then provide a straightforward implementations of the transformations (back and forth) without using the nwb builder system.

Just for reference, for hdmf-zarr the storage specification is described here https://hdmf-zarr.readthedocs.io/en/latest/storage.html and the mapping from h5py filters to zarr is implemented here https://github.com/hdmf-dev/hdmf-zarr/blob/897d3b95ddfadd9364b2e2bc019e6ade86e920ce/src/hdmf_zarr/utils.py#L473-L539

I agree that having this mapping formally defined here is important, but it would be useful to have this be compliant with hdmf-zarr if possible. If this is not possible, then we should discuss how we can get the two to work together. One part that is not present in HDF5 files (in general) but is in NWB, is the notion of object_id attributes as a means to uniquely identify objects. In hdmf_zarr we include the object_id in the definition of links. However, this is just for validation purposes but they are not required for hdmf_zarr to work. That being said, assigning object_id (UUID) to all object would be a nice feature and is something that I think could be easily done in the lindi wrapping layer.

Side note: I don't think that zarr_dtype should be required on every dataset, since it should be inferred from the data type of the zarr dataset.

zarr_dtype is used in hdmf-zarr to allow us to identify that an attribute or dataset stores links/references to other objects. Zarr does not support links and references. hdmf-zarr implements links/references on top of Zarr by storing the definition of the references as JSON strings. zarr_dtype is used to identify attributes/datasets/groups that are (or contain) links and references. I don't think zarr_dtype is necessary for regular data (e.g,. arrays of floats) but we need at least a way to detect that links/references are present without having to touch the content of datasets. Another corner case, I believe, are scalars. Zarr stores these as 1D arrays so hdmf-zarr uses the zarr_dtype to distinguidh between 1D arrays that should be treated as a single scalar value. In summary, I do think we can relax the requirement for zarr_dtype but we need a way to identify which objects store links/references so that they can be resolved lazily on read.

Need to address

Another item to add to this list are "Links to Groups/Datasets that are stored in a Group" . In hdmf-zarr links are stored as attributes on the Group that contain a JSON string with the definition of the link. See https://hdmf-zarr.readthedocs.io/en/latest/storage.html#links

I think .zarr.json files that are compatible with fsspec ReferenceFileSystem and zarr DirectoryStore should be supported, and loadable in pynwb.

Could you clarify what part of the Zarr spec you are referring to with .zarr.json ?

I think we should extend beyond what is supported by zarr DirectoryStore (custom zarr storage backend) to allow

Having the ability to assemble a file from parts of other files will be very powerful. It would be neat to be able to create such files directly from PyNWB, but that's probably a bit further down the road.

We should probably split this out into separate issues/discussions.

I agree. You are raising a number of good points here, but I think we can break these up into separate features.

3 replies

magland Mar 14, 2024
Maintainer Author

Could you clarify what part of the Zarr spec you are referring to with .zarr.json

I was imagining that .zarr.json would be in the format understood by fsspec's ReferenceFileSystem
https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.implementations.reference.ReferenceFileSystem

so then at some level we don't need to distinguish between a .json file and a directory - it's just two ways of representing the same zarr. The advantage of the json representation is that files can be pointers to remote chunks of data.

oruebel Mar 14, 2024
Maintainer

Thanks for the clarification. So .zarr.json refers to format that lindi would be using, rather than being something that already exists in Zarr.

magland Mar 14, 2024
Maintainer Author

Yes. And I suppose if we add the advanced features that we have been discussing, then we may ultimately want to use .lindi.

rly · 2024-03-14T17:43:56Z

rly
Mar 14, 2024
Maintainer

Establish a well-defined one-to-one correspondence

I agree. I described some limitations here of my approach to hdf5 to json (and thus also to kerchunk) here: https://github.com/rly/h5tojson/blob/main/h5tojson/h5tojson.py#L3. We could make a table.

I think .zarr.json files that are compatible with fsspec ReferenceFileSystem and zarr DirectoryStore should be supported, and loadable in pynwb.

Yes. But I'm not sure whether that should be done here. That may be best done through the builder system of pynwb so that we create PyNWB objects so as to provide validation and convenience functions on top of the raw data.

We should support templates field in .zarr.json files.

Agreed.

I think we should extend beyond what is supported by zarr DirectoryStore (custom zarr storage backend) to allow

Very much agreed and something we have been thinking heavily about.

A pro of this approach is that existing NWB files are compatible. You can make a JSON file where the dataset references point to datasets of an existing singular NWB HDF5 (or Zarr) file, and you can easily generate such a JSON file on the fly from a given singular NWB HDF5 file. The underlying NWB HDF5 file needs not change, though we should establish a rule that the non-array data extracted to JSON should be consistent with the internal file, or say that metadata in the JSON takes precedence if there is disagreement -- this rule would make editing NWB data easier, since editing some aspects of HDF5 is non-trivial.

A downside of this approach is that an NWB LINDI dataset (or whatever we want to call it) would be distributed across files, and it is easy to create broken links and incomplete datasets when sharing individual files. But that is a necessary consequence of allowing modularity. We can mitigate that by storing URIs that point to DANDI (or elsewhere) whenever possible.

We should discuss more whether it makes sense to allow links to both groups and datasets or just datasets or just groups.

We need to figure out whether references within a .zarr.json are going to point to absolute urls, or relative paths. There are advantages of both.

I am in favor of both when possible. URIs back to the original data if they come from a repository. Relative paths so that if the referenced data is already local, then we don't need to resolve and download the URI. If the relative path breaks (the referenced file is deleted or moved), then there is still a URI.

I've been using .zarr.json extension, but maybe there is a better one to use.

For now, I think that is OK.

Provide a nwb-to-.zarr.json converter that can operate on remote files and takes parameters

Yes, I think between our forks, we have this already or something close to it (depending on what we decide as the .zarr.json format).

Figure out how to interface with pynwb for both reading and writing to .zarr.json files

Yes.

Provide some kind of consolidation scheme where you can start with a zarr directory store with tens of thousands of chunk files and produce a .zarr.json file plus a much smaller number of files, with the .zarr.json pointing to byte ranges within those files. This is more suited to uploading to cloud storage.

I haven't tried zarr to .zarr.json via kerchunk yet. Does that work?

Support the notion of a NWB ammendment (or augmentation) where the file is not a proper nwb file, but is an ammendment to an existing one. Perhaps provide a pointer to the original? Like an include field?

What about having the amendment just replace the JSON file? The file is a text file that can be version controlled. Otherwise, you can have complex chains of amendments that become hard to process. Linking back to the original may be complicated if you are not using version control though. Some related discussion from an earlier attempt here: hdmf-dev/hdmf#677

6 replies

magland Mar 14, 2024
Maintainer Author

I am in favor of both when possible. URIs back to the original data if they come from a repository. Relative paths so that if the referenced data is already local, then we don't need to resolve and download the URI. If the relative path breaks (the referenced file is deleted or moved), then there is still a URI.

I'm uneasy about allowing both in the same file. Is the idea that one might be broken and it would fall back to the other? I feel like it could create ambiguities, or situations where it loads incorrect data.

magland Mar 14, 2024
Maintainer Author

I haven't tried zarr to .zarr.json via kerchunk yet. Does that work?

I haven't tried that. But, if the zarr is a DirectoryStore on a remote server, you could just create the reference file system by referencing all the files by url (and I guess full byte ranges)

magland Mar 14, 2024
Maintainer Author

What about having the amendment just replace the JSON file? The file is a text file that can be version controlled. Otherwise, you can have complex chains of amendments that become hard to process. Linking back to the original may be complicated if you are not using version control though. Some related discussion from an earlier attempt here: hdmf-dev/hdmf#677

Makes sense. One scenario to add in to the mix... I might like for some processing job (e.g., spike sorting) to output a single Units object that does not have a particular path... and then in a subsequent step, that could be added into a larger nwb object at a specified location.

rly Mar 14, 2024
Maintainer

Is the idea that one might be broken and it would fall back to the other? I feel like it could create ambiguities, or situations where it loads incorrect data.

Yes. I think it would not be too different from having a local cache. I was imagining that the default behavior is to load from relative path (local cache) if present. If not present, then depending on whether a flag stream=True is set, then access the remote URI. We could have another option like ignore_cache=True to always ignore the cache.

rly Mar 14, 2024
Maintainer

I might like for some processing job (e.g., spike sorting) to output a single Units object that does not have a particular path... and then in a subsequent step, that could be added into a larger nwb object at a specified location.

Yes! @oruebel and I were thinking along similar lines. Processing steps can create "stub" NWB objects that are not full NWB datasets on their own, but could be added to an NWB dataset (via copy or link) and at that stage, full NWB validation happens, e.g., to make sure that any referenced electrodes exist in the NWB file. That's a tricky example though, because if the Units table includes references to particular electrodes, or a denoised ElectricalSeries includes a reference to the raw ElectricalSeries, how do you create such a "stub" NWB object? I guess the stub could link back to the original file, but not be incorporated into a new raw + processed file yet, and when that happens, we would have to resolve the links and raise errors if there are inconsistencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some notes and proposed functionality #1

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

some notes and proposed functionality #1

magland Mar 13, 2024 Maintainer

Replies: 3 comments · 10 replies

magland Mar 14, 2024 Maintainer Author

oruebel Mar 14, 2024 Maintainer

oruebel Mar 14, 2024 Maintainer

magland Mar 14, 2024 Maintainer Author

oruebel Mar 14, 2024 Maintainer

magland Mar 14, 2024 Maintainer Author

rly Mar 14, 2024 Maintainer

magland Mar 14, 2024 Maintainer Author

magland Mar 14, 2024 Maintainer Author

magland Mar 14, 2024 Maintainer Author

rly Mar 14, 2024 Maintainer

rly Mar 14, 2024 Maintainer

magland
Mar 13, 2024
Maintainer

Replies: 3 comments 10 replies

magland
Mar 14, 2024
Maintainer Author

oruebel Mar 14, 2024
Maintainer

oruebel
Mar 14, 2024
Maintainer

magland Mar 14, 2024
Maintainer Author

oruebel Mar 14, 2024
Maintainer

magland Mar 14, 2024
Maintainer Author

rly
Mar 14, 2024
Maintainer

magland Mar 14, 2024
Maintainer Author

magland Mar 14, 2024
Maintainer Author

magland Mar 14, 2024
Maintainer Author

rly Mar 14, 2024
Maintainer

rly Mar 14, 2024
Maintainer