Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX "Connection timed out" error #7227

Open
chrhansk opened this issue Nov 29, 2024 · 0 comments
Open

UCX "Connection timed out" error #7227

chrhansk opened this issue Nov 29, 2024 · 0 comments

Comments

@chrhansk
Copy link

I wanted to test the MPI setup on a cluster using the standard "hello world" example. The cluster uses ubuntu 24.04 (noble) with a relatively recent version of mpich:

libmpich-dev/noble,now 4.2.0-5build3 amd64 [installed,automatic]
libmpich12/noble,now 4.2.0-5build3 amd64 [installed,automatic]

I compiled the program using mpicc, but when I run it, I via UCX_LOG_LEVEL=info mpirun -n 2 ./hello I get the errors

[1732873719.963681] [<hostname>:889102:0]     ucp_context.c:2137 UCX  INFO  Version 1.16.0 (loaded from /lib/x86_64-linux-gnu/libucp.so.0)
[1732873720.102385] [<hostname>:889102:0]        ib_iface.c:1138 UCX  ERROR ibv_create_cq(cqe=256) failed: Connection timed out
Abort(675407631) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48301)..........: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(265).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(800).........: 
MPIR_Comm_commit_internal(585): 
MPID_Comm_commit_pre_hook(151): 
MPIDI_world_pre_init(640).....: 
MPIDI_UCX_init_world(260).....: 
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
[1732873720.103858] [<hostname>:889102:0]           async.c:679  UCX  DIAG  async handler table is not empty during exit (contains 3 elems)
[1732873720.103864] [<hostname>:889102:0]          thread.c:433  UCX  DIAG  async thread still running (use count 3)
[1732873720.103864] [<hostname>:889102:0]          thread.c:433  UCX  DIAG  async thread still running (use count 3)
[1732873719.962201] [<hostname>:889101:0]     ucp_context.c:2137 UCX  INFO  Version 1.16.0 (loaded from /lib/x86_64-linux-gnu/libucp.so.0)
[1732873720.112981] [<hostname>:889101:0]        ib_iface.c:1138 UCX  ERROR ibv_create_cq(cqe=256) failed: Connection timed out
Abort(675407631) on node 0: Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(48301)..........: MPI_Init(argc=(nil), argv=(nil)) failed
MPII_Init_thread(265).........: 
MPIR_init_comm_world(34)......: 
MPIR_Comm_commit(800).........: 
MPIR_Comm_commit_internal(585): 
MPID_Comm_commit_pre_hook(151): 
MPIDI_world_pre_init(640).....: 
MPIDI_UCX_init_world(260).....: 
init_worker(38)...............:  ucx function returned with failed status(ucx_init.c 38 init_worker Input/output error)
[1732873720.115111] [<hostname>:889101:0]           async.c:679  UCX  DIAG  async handler table is not empty during exit (contains 3 elems)
[1732873720.115118] [<hostname>:889101:0]          thread.c:433  UCX  DIAG  async thread still running (use count 3)
[1732873720.115119] [<hostname>:889101:0]          thread.c:433  UCX  DIAG  async thread still running (use count 3)

Is there a way to troubleshoot / sidestep this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant