Replies: 3 comments 10 replies
-
Another idea. Once we have a well-defined LINDI Zarr, then we could potentially wrap it in something that looks like a h5py object and feed that in to pynwb. |
Beta Was this translation helpful? Give feedback.
-
Just for reference, for hdmf-zarr the storage specification is described here https://hdmf-zarr.readthedocs.io/en/latest/storage.html and the mapping from h5py filters to zarr is implemented here https://github.com/hdmf-dev/hdmf-zarr/blob/897d3b95ddfadd9364b2e2bc019e6ade86e920ce/src/hdmf_zarr/utils.py#L473-L539 I agree that having this mapping formally defined here is important, but it would be useful to have this be compliant with hdmf-zarr if possible. If this is not possible, then we should discuss how we can get the two to work together. One part that is not present in HDF5 files (in general) but is in NWB, is the notion of
Another item to add to this list are "Links to Groups/Datasets that are stored in a Group" . In
Could you clarify what part of the Zarr spec you are referring to with
Having the ability to assemble a file from parts of other files will be very powerful. It would be neat to be able to create such files directly from PyNWB, but that's probably a bit further down the road.
I agree. You are raising a number of good points here, but I think we can break these up into separate features. |
Beta Was this translation helpful? Give feedback.
-
I agree. I described some limitations here of my approach to hdf5 to json (and thus also to kerchunk) here: https://github.com/rly/h5tojson/blob/main/h5tojson/h5tojson.py#L3. We could make a table.
Yes. But I'm not sure whether that should be done here. That may be best done through the builder system of pynwb so that we create PyNWB objects so as to provide validation and convenience functions on top of the raw data.
Agreed.
Very much agreed and something we have been thinking heavily about. A pro of this approach is that existing NWB files are compatible. You can make a JSON file where the dataset references point to datasets of an existing singular NWB HDF5 (or Zarr) file, and you can easily generate such a JSON file on the fly from a given singular NWB HDF5 file. The underlying NWB HDF5 file needs not change, though we should establish a rule that the non-array data extracted to JSON should be consistent with the internal file, or say that metadata in the JSON takes precedence if there is disagreement -- this rule would make editing NWB data easier, since editing some aspects of HDF5 is non-trivial. A downside of this approach is that an NWB LINDI dataset (or whatever we want to call it) would be distributed across files, and it is easy to create broken links and incomplete datasets when sharing individual files. But that is a necessary consequence of allowing modularity. We can mitigate that by storing URIs that point to DANDI (or elsewhere) whenever possible. We should discuss more whether it makes sense to allow links to both groups and datasets or just datasets or just groups.
I am in favor of both when possible. URIs back to the original data if they come from a repository. Relative paths so that if the referenced data is already local, then we don't need to resolve and download the URI. If the relative path breaks (the referenced file is deleted or moved), then there is still a URI.
For now, I think that is OK.
Yes, I think between our forks, we have this already or something close to it (depending on what we decide as the .zarr.json format).
Yes.
I haven't tried zarr to .zarr.json via kerchunk yet. Does that work?
What about having the amendment just replace the JSON file? The file is a text file that can be version controlled. Otherwise, you can have complex chains of amendments that become hard to process. Linking back to the original may be complicated if you are not using version control though. Some related discussion from an earlier attempt here: hdmf-dev/hdmf#677 |
Beta Was this translation helpful? Give feedback.
-
@rly
Jotting down some notes and goals here. We should probably split this out into separate issues/discussions.
Establish a well-defined one-to-one correspondence between a subset of hdf5 and a subset of zarr, and the subset of hdf5 should ideally include everything supported by NWB. I know that hdmf_zarr addresses this to some extent, but I think it's important to spell this out rigorously, and then provide a straightforward implementations of the transformations (back and forth) without using the nwb builder system.
In addition the subset of zarr should be contained in a larger allowed subset of zarr that is supported by nwb but not corresponding to hdf5 (I am thinking about custom codecs for compression)
Side note: I don't think that zarr_dtype should be required on every dataset, since it should be inferred from the data type of the zarr dataset.
Need to address
I think .zarr.json files that are compatible with fsspec ReferenceFileSystem and zarr DirectoryStore should be supported, and loadable in pynwb.
We should support templates field in .zarr.json files.
I think we should extend beyond what is supported by zarr DirectoryStore (custom zarr storage backend) to allow
We need to figure out whether references within a .zarr.json are going to point to absolute urls, or relative paths. There are advantages of both.
I've been using .zarr.json extension, but maybe there is a better one to use.
Provide a nwb-to-.zarr.json converter that can operate on remote files and takes parameters including num_chunks_per_dataset_threshold that determines when to include a link to an external file rather than embedding chunks within the generated file. But to be clear, the embedded chunks are still themselves references to an external file. (not explaining this very well)
Figure out how to interface with pynwb for both reading and writing to .zarr.json files
Provide some kind of consolidation scheme where you can start with a zarr directory store with tens of thousands of chunk files and produce a .zarr.json file plus a much smaller number of files, with the .zarr.json pointing to byte ranges within those files. This is more suited to uploading to cloud storage.
Support the notion of a NWB ammendment (or augmentation) where the file is not a proper nwb file, but is an ammendment to an existing one. Perhaps provide a pointer to the original? Like an include field?
Beta Was this translation helpful? Give feedback.
All reactions