Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Image iterator reveals that list of references gets an incorrect data shape #1237

Open
h-mayorquin opened this issue Jan 21, 2025 · 0 comments · May be fixed by #1238
Open

[Bug] Image iterator reveals that list of references gets an incorrect data shape #1237

h-mayorquin opened this issue Jan 21, 2025 · 0 comments · May be fixed by #1238
Assignees
Labels
category: bug errors in the code or code behavior
Milestone

Comments

@h-mayorquin
Copy link
Contributor

h-mayorquin commented Jan 21, 2025

I have been working on an iterator for conversions involving a large number of images to prevent excessive RAM usage when loading the images as arrays. Together with @oruebel, I developed a working example that performs well for cases involving only an Images container.

Click For Minimal Example with Image Iterator
from scipy.datasets import face 

from hdmf.data_utils import AbstractDataChunkIterator, DataChunk

from pynwb.testing.mock.file import mock_NWBFile

from pynwb.base import ImageReferences
from pynwb.image import GrayscaleImage, Images, IndexSeries, RGBImage

from pathlib import Path
from PIL import Image
import numpy as np

class SingleImageIterator(AbstractDataChunkIterator):
    """Simple iterator to return a single image. This avoids loading the entire image into memory at initializing
    and instead loads it at writing time one by one"""

    def __init__(self, filename):
        self._filename = Path(filename)
        
        # Get image information without loading the full image
        with Image.open(self._filename) as img:
            self.image_mode = img.mode
            self._image_shape = img.size[::-1]  # PIL uses (width, height) instead of (height, width)
            self._max_shape = (None, None)
            
            self.number_of_bands = len(img.getbands())
            if self.number_of_bands > 1:
                self._image_shape += (self.number_of_bands,)
                self._max_shape += (self.number_of_bands,)
            
            # Calculate file size in bytes
            self._size_bytes = self._filename.stat().st_size
            # Calculate approximate memory size when loaded as numpy array
            self._memory_size = np.prod(self._image_shape) * np.dtype(float).itemsize

        self._images_returned = 0  # Number of images returned in __next__

    def __iter__(self):
        """Return the iterator object"""
        return self

    def __next__(self):
        """
        Return the DataChunk with the single full image
        """
        if self._images_returned == 0:
            data = np.asarray(Image.open(self._filename))
            selection = (slice(None),) * data.ndim
            self._images_returned += 1
            return DataChunk(data=data, selection=selection)
        else:
            raise StopIteration

    def recommended_chunk_shape(self):
        """
        Recommend the chunk shape for the data array.
        """
        return self._image_shape

    def recommended_data_shape(self):
        """
        Recommend the initial shape for the data array.
        """
        return self._image_shape

    @property
    def dtype(self):
        """
        Define the data type of the array
        """
        return np.dtype(float)

    @property
    def maxshape(self):
        """
        Property describing the maximum shape of the data array that is being iterated over
        """
        return self._max_shape


    def __len__(self):
        return self._image_shape[0]

    @property
    def size_info(self):
        """
        Return dictionary with size information
        """
        return {
            'file_size_bytes': self._size_bytes,
            'memory_size_bytes': self._memory_size,
            'shape': self._image_shape,
            'mode': self.image_mode,
            'bands': self.number_of_bands
        }



gs_image_array = face(gray=True)
rgb_image_array =face()

use_iterator = True
if not use_iterator:
    gs_face_object = GrayscaleImage(
        name="gs_face",
        data=gs_image_array,
        description="Grayscale version of a raccoon.",
        resolution=35.433071,
    )

    rgb_face_object = RGBImage(
        name="rgb_face",
        data=rgb_image_array,
        resolution=70.0,
        description="RGB version of a raccoon.",
    )
    
else:
    # Save the images to disk
    image_folder = Path("./images")
    image_folder.mkdir(parents=True, exist_ok=True)
    
    gs_file_path = image_folder / "gs_face.png"
    rgb_file_path = image_folder / "rgb_face.png"
    
    Image.fromarray(gs_image_array).save(gs_file_path)
    Image.fromarray(rgb_image_array).save(rgb_file_path)

    gs_face_object = GrayscaleImage(
        name="gs_face",
        data=SingleImageIterator(filename=gs_file_path),
        description="Grayscale version of a raccoon.",
        resolution=35.433071,
    )
    
    rgb_face_object = RGBImage(
        name="rgb_face",
        data=SingleImageIterator(filename=rgb_file_path),
        resolution=70.0,
        description="RGB version of a raccoon.",
    )

images = [gs_face_object, rgb_face_object]
order_of_images = ImageReferences("order_of_images", images)


print(f"order of images shape: {order_of_images_shape}")

image_container = Images(
    name="raccoons",
    images=images,
    description="A collection of raccoons.",
    order_of_images=ImageReferences("order_of_images", images),
)

idx_series = IndexSeries(
    name="stimuli",
    data=np.asarray([0, 1, 0, 1], dtype=np.uint64),
    indexed_images=image_container,
    unit="N/A",
    timestamps=[0.1, 0.2, 0.3, 0.4],
)


nwbfile = mock_NWBFile()
nwbfile.add_stimulus_template(image_container)
nwbfile.add_stimulus(idx_series)

from pynwb import NWBHDF5IO

nwbfile_path = "test_images.nwb"
with NWBHDF5IO(nwbfile_path, "w") as io:
    io.write(nwbfile)
  
    
# nwbfile_read = NWBHDF5IO(nwbfile_path, "r").read()
# nwbfile_read

However, a problem arises when I try to use the images as templates to be referenced by an IndexSeries, which requires an ImageReferences object. The ImageReferences object contains a list of images where the data attribute is an iterator. Writing the NWB file in this case fails because hdmf.utils.get_data_shape attempts to infer the shape of the data by indexing the iterator, which is not subscriptable. This results in an error:

Click For Erorr Trace
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 174
    172 nwbfile_path = "test_images.nwb"
    173 with NWBHDF5IO(nwbfile_path, "w") as io:
--> 174     io.write(nwbfile)
    177 # nwbfile_read = NWBHDF5IO(nwbfile_path, "r").read()
    178 # nwbfile_read

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/backends/hdf5/h5tools.py:395, in HDF5IO.write(self, **kwargs)
    390     raise UnsupportedOperation(("Cannot write to file %s in mode '%s'. "
    391                                 "Please use mode 'r+', 'w', 'w-', 'x', or 'a'")
    392                                % (self.source, self.__mode))
    394 cache_spec = popargs('cache_spec', kwargs)
--> 395 super().write(**kwargs)
    396 if cache_spec:
    397     self.__cache_spec()

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/backends/io.py:98, in HDMFIO.write(self, **kwargs)
     95     herd.to_zip(path=self.herd_path)
     97 """Write a container to the IO source."""
---> 98 f_builder = self.__manager.build(container, source=self.__source, root=True)
     99 self.write_builder(f_builder, **kwargs)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:173, in BuildManager.build(self, **kwargs)
    169             raise ValueError("Cannot change container_source once set: '%s' %s.%s"
    170                              % (container.name, container.__class__.__module__,
    171                                 container.__class__.__name__))
    172 # NOTE: if exporting, then existing cached builder will be ignored and overridden with new build result
--> 173 result = self.__type_map.build(container, self, source=source, spec_ext=spec_ext, export=export)
    174 self.prebuilt(container, result)
    175 self.__active_prebuilt(result)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:788, in TypeMap.build(self, **kwargs)
    786 if manager is None:
    787     manager = BuildManager(self)
--> 788 builder = obj_mapper.build(container, manager, builder=builder, source=source, spec_ext=spec_ext, export=export)
    790 # add additional attributes (namespace, data_type, object_id) to builder
    791 namespace, data_type = self.get_container_ns_dt(container)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:733, in ObjectMapper.build(self, **kwargs)
    731         builder = GroupBuilder(name, parent=parent, source=source)
    732     self.__add_datasets(builder, self.__spec.datasets, container, manager, source, export)
--> 733     self.__add_groups(builder, self.__spec.groups, container, manager, source, export)
    734     self.__add_links(builder, self.__spec.links, container, manager, source, export)
    735 else:

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1130, in ObjectMapper.__add_groups(self, builder, groups, container, build_manager, source, export)
   1128 self.__add_datasets(sub_builder, spec.datasets, container, build_manager, source, export)
   1129 self.__add_links(sub_builder, spec.links, container, build_manager, source, export)
-> 1130 self.__add_groups(sub_builder, spec.groups, container, build_manager, source, export)
   1131 empty = sub_builder.is_empty()
   1132 if not empty or (empty and spec.required):

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1130, in ObjectMapper.__add_groups(self, builder, groups, container, build_manager, source, export)
   1128 self.__add_datasets(sub_builder, spec.datasets, container, build_manager, source, export)
   1129 self.__add_links(sub_builder, spec.links, container, build_manager, source, export)
-> 1130 self.__add_groups(sub_builder, spec.groups, container, build_manager, source, export)
   1131 empty = sub_builder.is_empty()
   1132 if not empty or (empty and spec.required):

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1143, in ObjectMapper.__add_groups(self, builder, groups, container, build_manager, source, export)
   1141 self.__check_quantity(attr_value, spec, container)
   1142 if attr_value is not None:
-> 1143     self.__add_containers(builder, spec, attr_value, build_manager, source, container, export)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1201, in ObjectMapper.__add_containers(self, builder, spec, value, build_manager, source, parent_container, export)
   1199 elif isinstance(value, list):
   1200     for container in value:
-> 1201         self.__add_containers(builder, spec, container, build_manager, source, parent_container, export)
   1202 else:  # pragma: no cover
   1203     msg = ("Received %s, expected AbstractContainer or a list of AbstractContainers."
   1204            % value.__class__.__name__)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1163, in ObjectMapper.__add_containers(self, builder, spec, value, build_manager, source, parent_container, export)
   1161 self.logger.debug("    Building newly instantiated %s '%s'" % (value.__class__.__name__, value.name))
   1162 if isinstance(spec, BaseStorageSpec):
-> 1163     new_builder = build_manager.build(value, source=source, spec_ext=spec, export=export)
   1164 else:
   1165     new_builder = build_manager.build(value, source=source, export=export)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:173, in BuildManager.build(self, **kwargs)
    169             raise ValueError("Cannot change container_source once set: '%s' %s.%s"
    170                              % (container.name, container.__class__.__module__,
    171                                 container.__class__.__name__))
    172 # NOTE: if exporting, then existing cached builder will be ignored and overridden with new build result
--> 173 result = self.__type_map.build(container, self, source=source, spec_ext=spec_ext, export=export)
    174 self.prebuilt(container, result)
    175 self.__active_prebuilt(result)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:788, in TypeMap.build(self, **kwargs)
    786 if manager is None:
    787     manager = BuildManager(self)
--> 788 builder = obj_mapper.build(container, manager, builder=builder, source=source, spec_ext=spec_ext, export=export)
    790 # add additional attributes (namespace, data_type, object_id) to builder
    791 namespace, data_type = self.get_container_ns_dt(container)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:734, in ObjectMapper.build(self, **kwargs)
    732     self.__add_datasets(builder, self.__spec.datasets, container, manager, source, export)
    733     self.__add_groups(builder, self.__spec.groups, container, manager, source, export)
--> 734     self.__add_links(builder, self.__spec.links, container, manager, source, export)
    735 else:
    736     if builder is None:

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1066, in ObjectMapper.__add_links(self, builder, links, container, build_manager, source, export)
   1064     self.logger.debug("        Skipping link - no attribute value")
   1065     continue
-> 1066 self.__add_containers(builder, spec, attr_value, build_manager, source, container, export)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1165, in ObjectMapper.__add_containers(self, builder, spec, value, build_manager, source, parent_container, export)
   1163     new_builder = build_manager.build(value, source=source, spec_ext=spec, export=export)
   1164 else:
-> 1165     new_builder = build_manager.build(value, source=source, export=export)
   1166 # use spec to determine what kind of HDF5 object this AbstractContainer corresponds to
   1167 if isinstance(spec, LinkSpec) or value.parent is not parent_container:

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:173, in BuildManager.build(self, **kwargs)
    169             raise ValueError("Cannot change container_source once set: '%s' %s.%s"
    170                              % (container.name, container.__class__.__module__,
    171                                 container.__class__.__name__))
    172 # NOTE: if exporting, then existing cached builder will be ignored and overridden with new build result
--> 173 result = self.__type_map.build(container, self, source=source, spec_ext=spec_ext, export=export)
    174 self.prebuilt(container, result)
    175 self.__active_prebuilt(result)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:788, in TypeMap.build(self, **kwargs)
    786 if manager is None:
    787     manager = BuildManager(self)
--> 788 builder = obj_mapper.build(container, manager, builder=builder, source=source, spec_ext=spec_ext, export=export)
    790 # add additional attributes (namespace, data_type, object_id) to builder
    791 namespace, data_type = self.get_container_ns_dt(container)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:732, in ObjectMapper.build(self, **kwargs)
    730 if builder is None:
    731     builder = GroupBuilder(name, parent=parent, source=source)
--> 732 self.__add_datasets(builder, self.__spec.datasets, container, manager, source, export)
    733 self.__add_groups(builder, self.__spec.groups, container, manager, source, export)
    734 self.__add_links(builder, self.__spec.links, container, manager, source, export)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1113, in ObjectMapper.__add_datasets(self, builder, datasets, container, build_manager, source, export)
   1108 else:
   1109     self.logger.debug("        Adding typed dataset for spec name: %s, %s: %s, %s: %s"
   1110                       % (repr(spec.name),
   1111                          spec.def_key(), repr(spec.data_type_def),
   1112                          spec.inc_key(), repr(spec.data_type_inc)))
-> 1113     self.__add_containers(builder, spec, attr_value, build_manager, source, container, export)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:1163, in ObjectMapper.__add_containers(self, builder, spec, value, build_manager, source, parent_container, export)
   1161 self.logger.debug("    Building newly instantiated %s '%s'" % (value.__class__.__name__, value.name))
   1162 if isinstance(spec, BaseStorageSpec):
-> 1163     new_builder = build_manager.build(value, source=source, spec_ext=spec, export=export)
   1164 else:
   1165     new_builder = build_manager.build(value, source=source, export=export)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:173, in BuildManager.build(self, **kwargs)
    169             raise ValueError("Cannot change container_source once set: '%s' %s.%s"
    170                              % (container.name, container.__class__.__module__,
    171                                 container.__class__.__name__))
    172 # NOTE: if exporting, then existing cached builder will be ignored and overridden with new build result
--> 173 result = self.__type_map.build(container, self, source=source, spec_ext=spec_ext, export=export)
    174 self.prebuilt(container, result)
    175 self.__active_prebuilt(result)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/manager.py:788, in TypeMap.build(self, **kwargs)
    786 if manager is None:
    787     manager = BuildManager(self)
--> 788 builder = obj_mapper.build(container, manager, builder=builder, source=source, spec_ext=spec_ext, export=export)
    790 # add additional attributes (namespace, data_type, object_id) to builder
    791 namespace, data_type = self.get_container_ns_dt(container)

File ~/development/hdmf/src/hdmf/utils.py:577, in docval.<locals>.dec.<locals>.func_call(*args, **kwargs)
    575 def func_call(*args, **kwargs):
    576     pargs = _check_args(args, kwargs)
--> 577     return func(args[0], **pargs)

File ~/development/hdmf/src/hdmf/build/objectmapper.py:741, in ObjectMapper.build(self, **kwargs)
    739     raise ValueError(msg)
    740 spec_dtype, spec_shape, spec_dims, spec = self.__check_dset_spec(self.spec, spec_ext)
--> 741 dimension_labels = self.__get_dimension_labels_from_spec(container.data, spec_shape, spec_dims)
    742 if isinstance(spec_dtype, RefSpec):
    743     self.logger.debug("Building %s '%s' as a dataset of references (source: %s)"
    744                       % (container.__class__.__name__, container.name, repr(source)))

File ~/development/hdmf/src/hdmf/build/objectmapper.py:841, in ObjectMapper.__get_dimension_labels_from_spec(self, data, spec_shape, spec_dims)
    839 if spec_shape is None or spec_dims is None:
    840     return None
--> 841 data_shape = get_data_shape(data)
    842 # if shape is a list of allowed shapes, find the index of the shape that matches the data
    843 if isinstance(spec_shape[0], list):

File ~/development/hdmf/src/hdmf/utils.py:833, in get_data_shape(data, strict_no_data_load)
    831 if hasattr(data, '__len__') and not isinstance(data, (str, bytes)):
    832     if not strict_no_data_load or isinstance(data, (list, tuple, set)):
--> 833         return __get_shape_helper(data)
    834 return None

File ~/development/hdmf/src/hdmf/utils.py:821, in get_data_shape.<locals>.__get_shape_helper(local_data)
    819         el = next(iter(local_data))
    820         if not isinstance(el, (str, bytes)):
--> 821             shape.extend(__get_shape_helper(el))
    822 return tuple(shape)

File ~/development/hdmf/src/hdmf/utils.py:819, in get_data_shape.<locals>.__get_shape_helper(local_data)
    817 shape.append(len(local_data))
    818 if len(local_data):
--> 819     el = next(iter(local_data))
    820     if not isinstance(el, (str, bytes)):
    821         shape.extend(__get_shape_helper(el))

File ~/development/pynwb/src/pynwb/core.py:96, in NWBData.__getitem__(self, args)
     94 if isinstance(self.data, (tuple, list)) and isinstance(args, (tuple, list)):
     95     return [self.data[i] for i in args]
---> 96 return self.data[args]

TypeError: 'SingleImageIterator' object is not subscriptable

Additionally, even without an iterator, the get_data_shape method calculates an incorrect shape for a ImageReferences. For example, it reports a shape of [2, height, width] for a list with two references, which conflicts with the spec for ImageReferences, where the expected shape is [None]:

from hdmf.utils import get_data_shape

order_of_images_shape = get_data_shape(order_of_images.data)
print(f"order of images shape: {order_of_images_shape}")

order of images shape: (2, 768, 1024)

This mismatch triggers the following warning:

if len(data_shape) != len(spec_shape):
msg = "Shape of data does not match shape in spec '%s'" % self.spec.path
warnings.warn(msg, IncorrectDatasetShapeBuildWarning)
return None
# check each dimension. None means any length is allowed

The warning indicates that the calculation is incorrect because the spec shape of ImageReferences expects [None].

@rly and I were discussing a fix for this issue, which I will add as a PR soon.

@h-mayorquin h-mayorquin linked a pull request Jan 21, 2025 that will close this issue
@rly rly added the category: bug errors in the code or code behavior label Jan 21, 2025
@rly rly added this to the Future milestone Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants