Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: MG Link_Prediction Tests Hanging #4704

Open
2 tasks done
nv-rliu opened this issue Oct 8, 2024 · 1 comment · May be fixed by #4741
Open
2 tasks done

[BUG]: MG Link_Prediction Tests Hanging #4704

nv-rliu opened this issue Oct 8, 2024 · 1 comment · May be fixed by #4741
Assignees
Labels
bug Something isn't working cuGraph
Milestone

Comments

@nv-rliu
Copy link
Contributor

nv-rliu commented Oct 8, 2024

Version

24.10

Which installation method(s) does this occur on?

Conda

Describe the bug.

Noticing that a handful of the MG Link Prediction tests are hanging and timing out on single-node, 2 GPU & 8 GPU configurations in our nightlies.

The tests are:

  • test_cosine_mg
  • test_jaccard_mg
  • test_overlap_mg
  • test_sorensen_mg

The first few tests will pass, but then the pytest file will hang until the job is killed due to the timeout.

Minimum reproducible example

pytest -v --import-mode=append test_jaccard_mg.py

Relevant log output

10/06/24-09:44:46.865339912_UTC>>>> NODE 0: ******** STARTING TESTS FROM: tests/link_prediction/test_jaccard_mg.py, using 2 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.3, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=1 min_time=0.000005 max_time=0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/python/cugraph
configfile: pytest.ini
plugins: benchmark-4.0.0, cov-5.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 96 items

tests/link_prediction/test_jaccard_mg.py::test_dask_mg_jaccard[graph_file:/datasets/karate.csv-directed:False-has_vertex_pair:False-has_vertices:False-has_topk:False-is_weighted:False] 
Dask client/cluster created using LocalCUDACluster
PASSED
tests/link_prediction/test_jaccard_mg.py::test_dask_mg_all_pairs_jaccard[graph_file:/datasets/karate.csv-directed:False-has_vertex_pair:False-has_vertices:False-has_topk:False-is_weighted:False] 2024-10-06 02:54:51,390 - distributed.scheduler - WARNING - Worker failed to heartbeat for 589s; attempting restart: <WorkerState 'tcp://127.0.0.1:34653', name: 0, status: running, memory: 1, processing: 1>
2024-10-06 02:54:51,391 - distributed.scheduler - WARNING - Worker failed to heartbeat for 589s; attempting restart: <WorkerState 'tcp://127.0.0.1:45585', name: 1, status: running, memory: 1, processing: 1>
2024-10-06 02:54:55,395 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-10-06 02:54:55,395 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-10-06 02:54:55,805 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:45585' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-5bafbba2-f450-4ef0-9a27-30cbbeec400a'} (stimulus_id='handle-worker-cleanup-1728208495.8048048')
2024-10-06 02:54:55,844 - distributed.nanny - WARNING - Restarting worker
vertices  None
2024-10-06 02:54:55,882 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:34653' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'_make_plc_graph-7526511b-3437-4809-b535-b91dd8cd9a4c'} (stimulus_id='handle-worker-cleanup-1728208495.8818748')
2024-10-06 02:54:55,929 - distributed.nanny - WARNING - Restarting worker
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
2024-10-06 02:59:48,917 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-10-06 02:59:48,918 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1022, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1022, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-10-06 03:00:17,190 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:45585 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:45585 after 30 s
2024-10-06 03:00:17,191 - distributed.scheduler - ERROR - broadcast to tcp://127.0.0.1:34653 failed: OSError: Timed out trying to connect to tcp://127.0.0.1:34653 after 30 s

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 546, in connect
    stream = await self.client.connect(
  File "/opt/conda/lib/python3.10/site-packages/tornado/tcpclient.py", line 279, in connect
    af, addr, stream = await connector.start(connect_timeout=timeout)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 342, in connect
    comm = await wait_for(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 1926, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/opt/conda/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/pytest", line 10, in <module>
    sys.exit(console_main())
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 201, in console_main
    code = main()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175, in main
    ret: ExitCode | int = config.hook.pytest_cmdline_main(config=config)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 330, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 318, in wrap_session
    config.hook.pytest_sessionfinish(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 868, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 893, in pytest_sessionfinish
    result = yield
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 141, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 107, in pytest_sessionfinish
    session._setupstate.teardown_exact(None)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1032, in finish
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1021, in finish
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 907, in _teardown_yield_fixture
    next(it)
  File "/root/cugraph/python/cugraph/cugraph/tests/conftest.py", line 52, in dask_client
    stop_dask_client(dask_client, dask_cluster)
  File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 182, in stop_dask_client
    Comms.destroy()
  File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 214, in destroy
    __instance.destroy()
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
    self.client.run(
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3183, in run
    return self.sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
    return sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 440, in sync
    raise error
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 414, in f
    result = yield future
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3088, in _run
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6642, in send_message
    comm = await self.rpc.connect(addr)
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1535, in connect
    return connect_attempt.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1425, in _connect
    comm = await connect(
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:34653 after 30 s
Exception ignored in: <function Comms.__del__ at 0x15234c2544c0>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3183, in run
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 431, in sync
  File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
10/06/24-10:00:18.401427570_UTC>>>> ERROR: command timed out after 900 seconds
10/06/24-10:00:18.402550166_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
10/06/24-10:00:18.480734601_UTC>>>> NODE 0: remaining python processes: [ 2532726 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
10/06/24-10:00:18.505469746_UTC>>>> NODE 0: remaining dask processes: [  ]

Environment details

Running inside the nightly MNMG test containers.

Other/Misc.

No response

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@nv-rliu nv-rliu added ? - Needs Triage Need team to review and classify bug Something isn't working labels Oct 8, 2024
@nv-rliu nv-rliu added this to the 24.12 milestone Oct 9, 2024
@nv-rliu nv-rliu added cuGraph and removed ? - Needs Triage Need team to review and classify labels Oct 14, 2024
@ChuckHastings
Copy link
Collaborator

This has been isolated to an issue in the C++ implementation. I'm moving this to the core board for the moment.

@ChuckHastings ChuckHastings linked a pull request Nov 12, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuGraph
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants