Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Deeplake dataset row access fails under multiprocessing #2977

Open
1 task
abhayv opened this issue Oct 25, 2024 · 2 comments
Open
1 task

[BUG] Deeplake dataset row access fails under multiprocessing #2977

abhayv opened this issue Oct 25, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@abhayv
Copy link

abhayv commented Oct 25, 2024

Severity

P0 - Critical breaking issue or missing functionality

Current Behavior

Accessing deeplake dataset rows under a multiprocessing library such as concurrent futures results in an error.

Consider the following script which creates a dummy deeplake dataset and tries to access it with multiprocessing

import concurrent
from functools import partial

import deeplake
from deeplake import Dataset

DS_PATH = "/tmp/test_deeplake"


def create_deeplake_ds():
    ds = deeplake.empty(DS_PATH, overwrite=True)

    with ds:
        ds.create_tensor("dummy", htype="text")
        ds.dummy.append("dummy_test")


def worker(idx: int, ds: Dataset) -> None:
    print("Row", ds[idx])


if __name__ == "__main__":
    use_multi = True
    create_deeplake_ds()
    ds = deeplake.load(DS_PATH, read_only=True)

    if use_multi:
        with concurrent.futures.ProcessPoolExecutor() as executor:
            results = list(executor.map(partial(worker, ds=ds), [0]))

    else:
        results = worker(0, ds=ds)

With deeplake 3.9.26

this gives the following error:

concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/site-packages/deeplake/core/dataset/dataset.py", line 1380, in __getattr__
    return self.__getitem__(key)
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/site-packages/deeplake/core/dataset/dataset.py", line 582, in __getitem__
    raise TensorDoesNotExistError(item)
deeplake.util.exceptions.TensorDoesNotExistError: "Tensor 'index_params' does not exist."

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/concurrent/futures/process.py", line 202, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/concurrent/futures/process.py", line 202, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/Users/abhay/deep_test/deep.py", line 19, in worker
    print("Row", ds[idx])
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/site-packages/deeplake/core/dataset/dataset.py", line 653, in __getitem__
    index_params=self.index_params,
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/site-packages/deeplake/core/dataset/dataset.py", line 1382, in __getattr__
    raise AttributeError(
AttributeError: '<class 'deeplake.core.dataset.dataset.Dataset'>' object has no attribute 'index_params'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/abhay/deep_test/deep.py", line 29, in <module>
    results = list(executor.map(partial(worker, ds=ds), [0]))
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/concurrent/futures/process.py", line 567, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/Users/abhay/miniconda3/envs/test/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
AttributeError: '<class 'deeplake.core.dataset.dataset.Dataset'>' object has no attribute 'index_params'

Steps to Reproduce

See description in current behavior

Expected/Desired Behavior

Either it should be documented that accessing dataset under multiprocessing is not allowed or the access should not throw the error that is seen

Python Version

Python 3.10.0

OS

MacOS Ventura 13.5

IDE

Terminal

Packages

deeplake==3.9.26

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR (Thank you!)
@abhayv abhayv added the bug Something isn't working label Oct 25, 2024
@davidbuniat
Copy link
Member

thanks @abhayv for raising the issue! Adding @levongh to this thread to see what we can suggest for V3.9.26.

As you may have know we released V4 with async friendly API. Here is an async dataloader example https://docs.deeplake.ai/latest/guide/async-data-loader/.

We are hardening V4, would love to get your thoughts and feedback.

@davidbuniat
Copy link
Member

@abhayv we did a release for deeplake==3.9.27 that addresses the issue, let us know if the problem is fixed on your end.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants