Replies: 5 comments 6 replies
-
I think the filter mapping should be essentially the same. |
Beta Was this translation helpful? Give feedback.
-
The key difference is that one describes the mapping from the schema language to the storage, while the other describes the mapping between storage primitives only. I think both are useful, but for the purpose of lindi, I think the latter is more relevant since lindi probably won't deal with schema. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
For references, I think this could be encapsulated in a way that hdmf-zarr could understand either option. It would be useful to discuss how we want to implement references. |
Beta Was this translation helpful? Give feedback.
-
If the main difference is the formatting of links/references then we can probably manage compatibility in the hdfm-zarr backend. I think we can probably change some of the format in hdfm-zarr but would need a mechanism for backward compatibility. Assigning and storing UUID's for all objects, I think would be useful to consider here. In particular when linking across remote files, having an easy way to validate that the object that is being linked to is the right one will be useful. |
Beta Was this translation helpful? Give feedback.
-
@oruebel I'm moving this discussion to this new thread
Thanks for pointing to those references. Kerchunk has their own filter mapping function, ..., will be interesting to see how it differs.
I feel like it would be better to define a more general hdf5 -> zarr transform rather than separate hdmf-zarr and hdmf-h5 specifications. You'd specify what happens to attributes, datasets, groups, references, etc., rather than name, doc, groups, datasets, ..., neurodata_type, ... I realize it's a subtle difference, but it's a simplification.
Regarding compatibility with hdmf-zarr... I agree that would be useful. I don't know how practical it would be to make adjustments to hdmf-zarr, but I would propose the following
As I mentioned, zarr_dtype shouldn't be needed in most cases. For the case of references, it seems strange to me that zarr_dtype = 'object' would needed. I would propose replacing
with something like
(I'm not sure about the purpose of source here)
In the case of region references, a "region" field could be added.
This
_REFERENCE
field here is somewhat consistent with an attribute that kerchunk/fsspec uses called_ARRAY_DIMENSIONS
which is applied to every dataset. This is how one can identify scalar datasets -- the underlying zarr array has shape[1]
but the_ARRAY_DIMENSIONS
is[]
. So the reader can know it is a scalar.Compound types could always be json-encoded (or maybe var length string), with embedded references formatted as above.
So this is what I'd propose. If hdmf-zarr needs to stay as is, one option is to have a converter... since the .zarr.json files are going to be relatively small, we could just have a conversion utility that needs to be applied before sending into hdmf zarr.
Beta Was this translation helpful? Give feedback.
All reactions