hdf5 to zarr mapping #4

magland · 2024-03-14T18:18:20Z

magland
Mar 14, 2024
Maintainer

@oruebel I'm moving this discussion to this new thread

Just for reference, for hdmf-zarr the storage specification is described here https://hdmf-zarr.readthedocs.io/en/latest/storage.html and the mapping from h5py filters to zarr is implemented here https://github.com/hdmf-dev/hdmf-zarr/blob/897d3b95ddfadd9364b2e2bc019e6ade86e920ce/src/hdmf_zarr/utils.py#L473-L539

I agree that having this mapping formally defined here is important, but it would be useful to have this be compliant with hdmf-zarr if possible. If this is not possible, then we should discuss how we can get the two to work together. One part that is not present in HDF5 files (in general) but is in NWB, is the notion of object_id attributes as a means to uniquely identify objects. In hdmf_zarr we include the object_id in the definition of links. However, this is just for validation purposes but they are not required for hdmf_zarr to work. That being said, assigning object_id (UUID) to all object would be a nice feature and is something that I think could be easily done in the lindi wrapping layer.

Thanks for pointing to those references. Kerchunk has their own filter mapping function, ..., will be interesting to see how it differs.

I feel like it would be better to define a more general hdf5 -> zarr transform rather than separate hdmf-zarr and hdmf-h5 specifications. You'd specify what happens to attributes, datasets, groups, references, etc., rather than name, doc, groups, datasets, ..., neurodata_type, ... I realize it's a subtle difference, but it's a simplification.

Regarding compatibility with hdmf-zarr... I agree that would be useful. I don't know how practical it would be to make adjustments to hdmf-zarr, but I would propose the following

As I mentioned, zarr_dtype shouldn't be needed in most cases. For the case of references, it seems strange to me that zarr_dtype = 'object' would needed. I would propose replacing

"table": {
    "value": {
        "path": "/general/extracellular_ephys/electrodes",
        "source": ".",
        "object_id": "f6685427-3919-4e06-b195-ccb7ab42f0fa",
        "source_object_id": "6224bb89-578a-4839-b31c-83f11009292c"
    },
    "zarr_dtype": "object"
}

with something like

"table": {
    "_REFERENCE": {
        "path": "/general/extracellular_ephys/electrodes",
        "source": "."
    }
}

(I'm not sure about the purpose of source here)

In the case of region references, a "region" field could be added.

This _REFERENCE field here is somewhat consistent with an attribute that kerchunk/fsspec uses called _ARRAY_DIMENSIONS which is applied to every dataset. This is how one can identify scalar datasets -- the underlying zarr array has shape [1] but the _ARRAY_DIMENSIONS is []. So the reader can know it is a scalar.

Compound types could always be json-encoded (or maybe var length string), with embedded references formatted as above.

So this is what I'd propose. If hdmf-zarr needs to stay as is, one option is to have a converter... since the .zarr.json files are going to be relatively small, we could just have a conversion utility that needs to be applied before sending into hdmf zarr.

oruebel · 2024-03-14T18:21:21Z

oruebel
Mar 14, 2024
Maintainer

Kerchunk has their own filter mapping function, ..., will be interesting to see how it differs.

I think the filter mapping should be essentially the same.

0 replies

oruebel · 2024-03-14T18:23:46Z

oruebel
Mar 14, 2024
Maintainer

I feel like it would be better to define a more general hdf5 -> zarr transform rather than separate hdmf-zarr and hdmf-h5 specifications. You'd specify what happens to attributes, datasets, groups, references, etc., rather than name, doc, groups, datasets, ..., neurodata_type, ... I realize it's a subtle difference, but it's a simplification.

The key difference is that one describes the mapping from the schema language to the storage, while the other describes the mapping between storage primitives only. I think both are useful, but for the purpose of lindi, I think the latter is more relevant since lindi probably won't deal with schema.

1 reply

magland Mar 14, 2024
Maintainer Author

makes sense

oruebel · 2024-03-14T18:25:36Z

oruebel
Mar 14, 2024
Maintainer

(I'm not sure about the purpose of source here)

source the the path to Zarr file that contains the object and path is the path to the object within the Zarr file that is being linked to.

2 replies

magland Mar 14, 2024
Maintainer Author

I see, so it gives the option of linking to external files

oruebel Mar 14, 2024
Maintainer

Yes

oruebel · 2024-03-14T18:27:40Z

oruebel
Mar 14, 2024
Maintainer

Regarding compatibility with hdmf-zarr... I agree that would be useful. I don't know how practical it would be to make adjustments to hdmf-zarr, but I would propose the following

For references, I think this could be encapsulated in a way that hdmf-zarr could understand either option. It would be useful to discuss how we want to implement references.

0 replies

oruebel · 2024-03-14T18:33:33Z

oruebel
Mar 14, 2024
Maintainer

So this is what I'd propose. If hdmf-zarr needs to stay as is, one option is to have a converter...

If the main difference is the formatting of links/references then we can probably manage compatibility in the hdfm-zarr backend. I think we can probably change some of the format in hdfm-zarr but would need a mechanism for backward compatibility.

Assigning and storing UUID's for all objects, I think would be useful to consider here. In particular when linking across remote files, having an easy way to validate that the object that is being linked to is the right one will be useful.

3 replies

magland Mar 14, 2024
Maintainer Author

Assigning and storing UUID's for all objects, I think would be useful to consider here. In particular when linking across remote files, having an easy way to validate that the object that is being linked to is the right one will be useful.

But isn't there a danger that the UUID could be the same, but the content is different?

Along those lines... if I create a derivative NWB object that includes all the data from a base file, and adds some new results... then should all the object IDs change? Or stay the same?

oruebel Mar 14, 2024
Maintainer

But isn't there a danger that the UUID could be the same, but the content is different?

UUID's are designed to be (for all practical purposes) globally unique. I.e. unless a user deliberately reuses a UUID for two different objects, they should be unique. Having a checksum (e.g., MD5) would be even better, but the cost to compute those is pretty high and having a UUID I think should be sufficient in most cases.

if I create a derivative NWB object that includes all the data from a base file, and adds some new results... then should all the object IDs change? Or stay the same?

Any new objects added will always get a new UUID. For objects that come from an existing file, I believe a user can decide whether to generate new UUID's for those objects when making a copy or whether to keep the UUID's. @rly is the expert on export for NWB files.

rly Mar 14, 2024
Maintainer

For objects that come from an existing file, I believe a user can decide whether to generate new UUID's for those objects when making a copy or whether to keep the UUID's.

Correct. https://hdmf.readthedocs.io/en/stable/export.html#what-happens-to-object-ids-when-i-export

All existing NWB neurodata types do not get a new UUID unless you tell it to change. This includes for the base NWBFile object. I am still on the fence over whether the default should instead be to change all UUIDs on export.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdf5 to zarr mapping #4

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

hdf5 to zarr mapping #4

magland Mar 14, 2024 Maintainer

Replies: 5 comments · 6 replies

oruebel Mar 14, 2024 Maintainer

oruebel Mar 14, 2024 Maintainer

magland Mar 14, 2024 Maintainer Author

oruebel Mar 14, 2024 Maintainer

magland Mar 14, 2024 Maintainer Author

oruebel Mar 14, 2024 Maintainer

oruebel Mar 14, 2024 Maintainer

oruebel Mar 14, 2024 Maintainer

magland Mar 14, 2024 Maintainer Author

oruebel Mar 14, 2024 Maintainer

rly Mar 14, 2024 Maintainer

magland
Mar 14, 2024
Maintainer

Replies: 5 comments 6 replies

oruebel
Mar 14, 2024
Maintainer

oruebel
Mar 14, 2024
Maintainer

magland Mar 14, 2024
Maintainer Author

oruebel
Mar 14, 2024
Maintainer

magland Mar 14, 2024
Maintainer Author

oruebel Mar 14, 2024
Maintainer

oruebel
Mar 14, 2024
Maintainer

oruebel
Mar 14, 2024
Maintainer

magland Mar 14, 2024
Maintainer Author

oruebel Mar 14, 2024
Maintainer

rly Mar 14, 2024
Maintainer