Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve html representation of datasets #1100

Merged
merged 33 commits into from
Nov 5, 2024

Conversation

h-mayorquin
Copy link
Contributor

@h-mayorquin h-mayorquin commented Apr 19, 2024

Motivation

Improve the display of the data in the html representation of containers. Note that this PR is focused on datasets that were already written. For in memory representation the issue on what to do with things that are wrapped in an iterator or an DataIO subtype can be addressed in another PR I think.

How to test the behavior?

HDF5

I have been using this script

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb.testing.mock.ophys import mock_ImagingPlane, mock_TwoPhotonSeries

import numpy as np

data=np.random.rand(500_000, 384)
timestamps = np.arange(500_000)
data = data=H5DataIO(data=data, compression=True, chunks=True)

nwbfile = mock_NWBFile()
electrical_series = mock_ElectricalSeries(data=data, nwbfile=nwbfile, rate=None, timestamps=timestamps)

imaging_plane = mock_ImagingPlane(grid_spacing=[1.0, 1.0], nwbfile=nwbfile)


data = H5DataIO(data=np.random.rand(2, 2, 2), compression=True, chunks=True)
two_photon_series = mock_TwoPhotonSeries(name="TwoPhotonSeries", imaging_plane=imaging_plane, data=data, nwbfile=nwbfile)


# Write to file
from pynwb import NWBHDF5IO
with NWBHDF5IO('ecephys_tutorial.nwb', 'w') as io:
    io.write(nwbfile)



from pynwb import NWBHDF5IO

io = NWBHDF5IO('ecephys_tutorial.nwb', 'r')
nwbfile = io.read()
nwbfile

image

Zarr

from numcodecs import Blosc
from hdmf_zarr import ZarrDataIO
import numpy as np
from pynwb.testing.mock.file import mock_NWBFile
from hdmf_zarr.nwb import NWBZarrIO
import os
import zarr
from numcodecs import Blosc, Delta

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
filters = [Delta(dtype="i4")]

data_with_zarr_data_io = ZarrDataIO(
    data=np.arange(100000000, dtype='i4').reshape(10000, 10000),
    chunks=(1000, 1000),
    compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.SHUFFLE),
    # filters=filters,
)

timestamps = np.arange(10000)

data = data_with_zarr_data_io

nwbfile = mock_NWBFile()
electrical_series_name = "ElectricalSeries"
rate = None
electrical_series = mock_ElectricalSeries(name=electrical_series_name, data=data, nwbfile=nwbfile, timestamps=timestamps, rate=None)


path = "zarr_test.nwb.zarr"
absolute_path = os.path.abspath(path)
with NWBZarrIO(path=path, mode="w") as io:
    io.write(nwbfile)
    
from hdmf_zarr.nwb import NWBZarrIO

path = "zarr_test.nwb.zarr"

io = NWBZarrIO(path=path, mode="r")
nwbfile = io.read()
nwbfile

image

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

src/hdmf/container.py Outdated Show resolved Hide resolved
@h-mayorquin h-mayorquin marked this pull request as ready for review April 23, 2024 15:14
Copy link

codecov bot commented Apr 23, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 8 lines in your changes missing coverage. Please review.

Project coverage is 89.12%. Comparing base (06a62b9) to head (01f8f8f).
Report is 1 commits behind head on dev.

Files with missing lines Patch % Lines
src/hdmf/utils.py 87.50% 2 Missing and 2 partials ⚠️
src/hdmf/backends/hdf5/h5tools.py 84.61% 1 Missing and 1 partial ⚠️
src/hdmf/container.py 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1100      +/-   ##
==========================================
+ Coverage   89.08%   89.12%   +0.03%     
==========================================
  Files          45       45              
  Lines        9890     9944      +54     
  Branches     2816     2825       +9     
==========================================
+ Hits         8811     8863      +52     
+ Misses        763      762       -1     
- Partials      316      319       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rly rly requested a review from stephprince April 24, 2024 01:21
@rly rly added the category: enhancement improvements of code or code behavior label Apr 24, 2024
@rly rly added this to the 3.14.0 milestone Apr 24, 2024
Copy link
Contributor

@stephprince stephprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for the PR.

Could you add tests for the data html representation with hdf5 and zarr? I think we mainly have string equivalence tests for this kind of thing.

I'm also wondering if it would be nice to have the hdf5 dataset info displayed in a similar table format as the zarr arrays to make it more consistent across backends. I think we should be able to replicate this using the hdf5 dataset info as an input to a method like this: https://github.com/zarr-developers/zarr-python/blob/9d046ea0d2878af7d15b3de3ec3036fe31661340/zarr/util.py#L402

src/hdmf/container.py Outdated Show resolved Hide resolved
src/hdmf/container.py Show resolved Hide resolved
src/hdmf/container.py Outdated Show resolved Hide resolved
@h-mayorquin
Copy link
Contributor Author

OK, I added table formating for hdf5:

image

@h-mayorquin
Copy link
Contributor Author

h-mayorquin commented Apr 26, 2024

@stephprince
Concerning the test, yes, I can do it, but, can you helmp to create a container that contains array data? I just don't have experienced with the bare bones object. This is my attempt:

from hdmf.container import Container

container = Container(name="Container")
container.__fields__ = {
    "name": "data",
    "description": "test data",
}

test_data = np.array([1, 2, 3, 4, 5])
setattr(container, "data", test_data)
container.fields

But the data is not added as a field. How can I move forward?

@h-mayorquin
Copy link
Contributor Author

Related:

hdmf-dev/hdmf-zarr#186

@h-mayorquin
Copy link
Contributor Author

I added the handling division by zero, check this out what happens with external files (like Video):

image

From this example:

import remfile
import h5py

asset_path = "sub-CSHL049/sub-CSHL049_ses-c99d53e6-c317-4c53-99ba-070b26673ac4_behavior+ecephys+image.nwb"
recording_asset = dandiset.get_asset_by_path(path=asset_path)
url = recording_asset.get_content_url(follow_redirects=True, strip_query=True)
file_path = url

rfile = remfile.File(file_path)
file = h5py.File(rfile, 'r')

from pynwb import NWBHDF5IO

io = NWBHDF5IO(file=file, mode='r')

nwbfile = io.read()
nwbfile

@rly
Copy link
Contributor

rly commented Oct 2, 2024

@stephprince when you have time, can you review this?

@rly rly modified the milestones: 3.14.5, 3.14.6 Oct 3, 2024
@stephprince
Copy link
Contributor

Rereading through this discussion, I believe where we left off is that the we want to remove the backend-specific logic from the Container class. To do so, it was proposed that:

In this PR we:

  • Add HDMFIO.generate_dataset_html(dataset) which would implement a minimalist representation
  • Implement HDF5IO.generate_dataset_html(h5py.Dataset) to represent an h5py.Dataset

In a separate PR on hdmf_zarr we would:

  • implement ZarrIO.generate_dataset_html(Zarr.array)

In the Container class, it would look like this:

read_io = self.get_read_io()  # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
    html_repr = read_io.generate_dataset_html(my_dataset)
else:
    # The file was not read from disk so the dataset should be numpy array or a list

@h-mayorquin did you want to do this? Otherwise I can go ahead and make the proposed changes to finish up this PR.

@h-mayorquin
Copy link
Contributor Author

Hi, @stephprince

I think this is a good summary.

I am not sure how to decouple HDF5IO.generate_dataset_html(h5py.Dataset) here as hdmf seems super coupled with hdf5. Or is it the idea that we only want to exclude zarr?

This has been on the back of my mind for a while and everytime but I had other priorities. It would be great if you have time to finish it.

@stephprince
Copy link
Contributor

@h-mayorquin yes I can take a look at it and finish it up

@stephprince
Copy link
Contributor

I have pushed the updates we discussed:

  • added utility functions generate_array_html_repr and get_basic_array_info to the utils module to get basic array info and generate an array html table
  • added a static HDMFIO.generate_dataset_html() method, the HDF5/Zarr implementations collect information from the dataset and then generate the actual html representation
  • updated Container._generate_array_html() to use these methods

I tested a Zarr implementation that looks like this and can submit a PR in hdmf_zarr for that:

def generate_dataset_html(dataset):
    """Generates an html representation for a dataset for the ZarrIO class"""

    # get info from zarr array and generate html repr
    zarr_info_dict = {k:v for k, v in dataset.info_items()}
    repr_html = generate_array_html_repr(zarr_info_dict, dataset, "Zarr Array")

    return repr_html

@oruebel @h-mayorquin if you could please review and let me know if there are any remaining concerns

@h-mayorquin
Copy link
Contributor Author

Looks good to me, thanks for taking on this.

oruebel
oruebel previously approved these changes Oct 31, 2024
@stephprince stephprince merged commit be602e5 into hdmf-dev:dev Nov 5, 2024
29 checks passed
@h-mayorquin h-mayorquin deleted the improve_html_repr_of_data branch November 5, 2024 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants