Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: aero_graph_net failed to load dataset #710

Open
willyawan16 opened this issue Nov 15, 2024 · 3 comments
Open

🐛[BUG]: aero_graph_net failed to load dataset #710

willyawan16 opened this issue Nov 15, 2024 · 3 comments
Assignees
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@willyawan16
Copy link

Version

0.8.0

On which installation method(s) does this occur?

No response

Describe the issue

Failed to load dataset when trying to train aero_graph_net.
Is there any way to fix this?

it stuck in the hydra instantiation as shown in the error log.

Minimum reproducible example

Relevant log output

[18:55:46 - agnet - INFO] Loading the training dataset...
Error executing job with overrides: ['+experiment=ahmed/mgn', 'data.data_dir=./data/ahmed_body']
concurrent.futures.process._RemoteTraceback:
'''
Traceback (most recent call last):
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/process.py", line 392, in wait_result_broken_or_wakeup
    result_item = result_reader.recv()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/connection.py", line 251, in recv
    return _ForkingPickler.loads(buf.getbuffer())
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 496, in rebuild_storage_fd
    fd = df.detach()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/multiprocessing/reduction.py", line 164, in recvfds
    raise RuntimeError('received %d items of ancdata' %
RuntimeError: received 0 items of ancdata
'''

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target
    return _target_(*args, **kwargs)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/modulus/datapipes/gnn/ahmed_body_dataset.py", line 219, in __init__
    for (i, graph, coeff, normal, area) in executor.map(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 267, in <module>
    main()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 219, in main
    trainer = MGNTrainer(cfg)
  File "/home/willy/modulus/modulus/examples/cfd/aero_graph_net/train.py", line 54, in __init__
    self.dataset = instantiate(cfg.data.train)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 226, in instantiate
    return instantiate_node(
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 347, in instantiate_node
    return _call_target(_target_, partial, args, kwargs, full_key)
  File "/home/willy/anaconda3/envs/modulus/lib/python3.10/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 97, in _call_target
    raise InstantiationException(msg) from e
hydra.errors.InstantiationException: Error in call to target 'modulus.datapipes.gnn.ahmed_body_dataset.AhmedBodyDataset':
BrokenProcessPool('A process in the process pool was terminated abruptly while the future was running or pending.')
full_key: data.train

Environment details

@willyawan16 willyawan16 added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 15, 2024
@Alexey-Kamenev
Copy link
Collaborator

Alexey-Kamenev commented Nov 15, 2024

Can you please double check that the path in data.data_dir=./data/ahmed_body is correct? Maybe try using absolute path just to check?

Also, try limiting number of dataset pre-fetching workers: data.train.num_workers=1

Finally, see if the example works with reduced dataset, for example, to use only 2 train samples: data.train.num_samples=2

@willyawan16
Copy link
Author

HYDRA_FULL_ERROR=1 python train.py +experiment=ahmed/mgn data.data_dir=/home/willy/modulus/modulus/examples/cfd/aero_graph_net/data/ahmed_body data.train.num_workers=1 data.val.num_workers=1 data.test.num_workers=1 data.train.num_samples=10 data.val.num_samples=5 data.test.num_samples=5

I changed my command as the above, and it passed the dataset loading problem.
But why when I try to change the num_samples higher than that, it returns the same error?

@Alexey-Kamenev
Copy link
Collaborator

So anything greater than 10 in data.train.num_samples causes that error to appear?
From the error itself, it looks like something happens during dataset pre-loading in one of the graph loading processes.
Unfortunately, I could not reproduce the issue on my side.

You can try adding some simple prints to create_graph function to see if there is a particular file or place where the error occurs (and keep num_workers=1 to simplify the debugging).

Also, which environment does this issue happen in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants