-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image cast_storage very slow for arrays (e.g. numpy, tensors) #6782
Comments
This may be a solution that only changes elif pa.types.is_list(storage.type):
from .features import Array3DExtensionType
def get_shapes(arr):
shape = ()
while isinstance(arr, pa.ListArray):
len_curr = len(arr)
arr = arr.flatten()
len_new = len(arr)
shape = shape + (len_new // len_curr,)
return shape
def get_dtypes(arr):
dtype = storage.type
while hasattr(dtype, "value_type"):
dtype = dtype.value_type
return dtype
arrays = []
for i, is_null in enumerate(storage.is_null()):
if not is_null.as_py():
storage_part = storage.take([i])
shape = get_shapes(storage_part)
dtype = get_dtypes(storage_part)
extension_type = Array3DExtensionType(shape=shape, dtype=str(dtype))
array = pa.ExtensionArray.from_storage(extension_type, storage_part)
arrays.append(array.to_numpy().squeeze(0))
else:
arrays.append(None)
bytes_array = pa.array(
[encode_np_array(arr)["bytes"] if arr is not None else None for arr in arrays],
type=pa.binary(),
)
path_array = pa.array([None] * len(storage), type=pa.string())
storage = pa.StructArray.from_arrays(
[bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null()
) (Edited): to handle nulls Notably this doesn't change anything about the passing through of data or other things, just in the Fri Apr 5 17:55:51 2024 restats
63818 function calls (61995 primitive calls) in 0.812 seconds
Ordered by: cumulative time
List reduced from 1051 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
47/1 0.000 0.000 0.810 0.810 {built-in method builtins.exec}
2/1 0.000 0.000 0.810 0.810 <string>:1(<module>)
2/1 0.000 0.000 0.809 0.809 arrow_dataset.py:594(wrapper)
2/1 0.000 0.000 0.809 0.809 arrow_dataset.py:551(wrapper)
2/1 0.000 0.000 0.809 0.809 arrow_dataset.py:2916(map)
3 0.000 0.000 0.807 0.269 arrow_dataset.py:3277(_map_single)
1 0.000 0.000 0.760 0.760 arrow_writer.py:589(finalize)
1 0.000 0.000 0.760 0.760 arrow_writer.py:423(write_examples_on_file)
1 0.000 0.000 0.759 0.759 arrow_writer.py:527(write_batch)
1 0.001 0.001 0.754 0.754 arrow_writer.py:161(__arrow_array__)
2/1 0.000 0.000 0.719 0.719 table.py:1800(wrapper)
1 0.000 0.000 0.719 0.719 table.py:1950(cast_array_to_feature)
1 0.006 0.006 0.718 0.718 image.py:209(cast_storage)
1 0.000 0.000 0.451 0.451 image.py:361(encode_np_array)
1 0.000 0.000 0.444 0.444 image.py:343(image_to_bytes)
1 0.000 0.000 0.413 0.413 Image.py:2376(save)
1 0.000 0.000 0.413 0.413 PngImagePlugin.py:1233(_save)
1 0.000 0.000 0.413 0.413 ImageFile.py:517(_save)
1 0.000 0.000 0.413 0.413 ImageFile.py:545(_encode_tile)
397 0.409 0.001 0.409 0.001 {method 'encode' of 'ImagingEncoder' objects} |
Also encounter this problem. Has been strugging with it for a long time... |
This actually applies to all arrays (numpy or tensors like in torch), not only from external files. import numpy as np
import datasets
ds = datasets.Dataset.from_dict(
{"image": [np.random.randint(0, 255, (2048, 2048, 3), dtype=np.uint8)]},
features=datasets.Features({"image": datasets.Image(decode=True)}),
)
ds.set_format("numpy")
ds = ds.map(load_from_cache_file=False) |
Update: see comments below
Describe the bug
Operations that save an image from a path are very slow.
I believe the reason for this is that the image data (
numpy
) is converted intopyarrow
format but then back to python using.pylist()
before being converted to a numpy array again.pylist
is already slow but used on a multi-dimensional numpy array such as an image it takes a very long time.From the trace below we can see that
__arrow_array__
takes a long time.It is currently also called in
get_inferred_type
, this should be removable #6781 but doesn't change the underyling issue.The conversion to
pyarrow
and back also leads to thenumpy
array having typeint64
which causes a warning message because the image type exceptsuint8
.However, originally the
numpy
image array was inuint8
.Steps to reproduce the bug
Expected behavior
The
numpy
image data should be passed through as it will be directly consumed bypillow
to convert it to bytes.As an example one can replace
list_of_np_array_to_pyarrow_listarray(data)
in__arrow_array__
with justout = data
as a test.We have to change
cast_storage
of theImage
feature so it handles the passed through data (& if to handle type before)Leading to the following:
This is of course only a test as it passes through all
numpy
arrays irrespective of if they should be an image.Also I guess
cast_storage
is meant for castingpyarrow
storage exclusively.Converting to
pyarrow
array seems like a good solution as it also handlespytorch
tensors etc., maybe there is a more efficient way to create a PIL image from apyarrow
array?Not sure how this should be handled but I would be happy to help if there is a good solution.
Environment info
datasets
version: 2.18.1.dev0huggingface_hub
version: 0.22.2fsspec
version: 2024.3.1The text was updated successfully, but these errors were encountered: