MPICH hangs/crashes on Delta (with SS-11 network) when shm channels are enabled. #7184

JiakunYan · 2024-10-21T21:53:19Z

This is more of a note as I have found a workaround to this issue. Just post here in case that you have a better understanding of the problem or a better solution.

I was running an HPX application on NCSA Delta (with Slingshot-11 interconnect). The application is the same as #7171. I was using netmod ofi and the cluster-installed libfabric. The application ran on two nodes with two processes per node.

In roughly half of the execution, the application crashed after MPICH delivered messages with the wrong data. In the other half, the application hangs. The problem typically occurred after sending only a few hundred to a few thousand messages.

I investigated further using gdb in the case of program hanging. I found a thread was stuck in cxip_evtq_req_cancel and never returned when MPICH tried to cancel a OFI receive. Below is the backtrace output:

#0  0x00007f30af98ddfd in cxip_evtq_req_cancel () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
#1  0x00007f30af95e039 in cxip_rxc_cancel () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
#2  0x00007f30b129842f in fi_cancel (fid=0x220b1d0, context=0x64177a0) at /opt/cray/libfabric/1.15.2.0/include/rdma/fi_endpoint.h:210
#3  0x00007f30b12a51b1 in MPIDI_NM_mpi_cancel_recv (rreq=0x6417600, is_blocking=true) at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:463
#4  0x00007f30b129c851 in MPIDI_anysrc_try_cancel_partner (rreq=0x6429500, is_cancelled=0x7f307c49b674) at ./src/mpid/ch4/src/mpidig_request.h:108
#5  0x00007f30b12a7862 in match_posted_rreq (rank=0, tag=524287, context_id=496, vci=0, req=0x7f307c49b6d0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:183
#6  0x00007f30b12a80a6 in MPIDIG_send_target_msg_cb (am_hdr=0x7f30a6bca770, data=0x7f30a6bca7a0, in_data_sz=34, attr=1, req=0x0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:338
#7  0x00007f30b10df000 in MPIDI_POSIX_progress_recv (vci=0, made_progress=0x7f307c49b890) at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:60
#8  0x00007f30b10df383 in MPIDI_POSIX_progress (vci=0, made_progress=0x7f307c49b890) at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:147
#9  0x00007f30b10df614 in MPIDI_SHM_progress (vci=0, made_progress=0x7f307c49b890) at ./src/mpid/ch4/shm/src/shm_progress.h:18
#10 0x00007f30b10e0f17 in MPIDI_progress_test (state=0x7f307c49b9b0) at ./src/mpid/ch4/src/ch4_progress.h:142
#11 0x00007f30b10e1d9b in MPID_Progress_test (state=0x7f307c49b9b0) at ./src/mpid/ch4/src/ch4_progress.h:242
#12 0x00007f30b10e2e98 in MPIR_Test_state (request_ptr=0x640baf0, flag=0x7f307c49baf8, status=0x7f307c49bb00, state=0x7f307c49b9b0) at src/mpi/request/request_impl.c:215
#13 0x00007f30b10e24b3 in MPID_Test (request_ptr=0x640baf0, flag=0x7f307c49baf8, status=0x7f307c49bb00) at ./src/mpid/ch4/src/ch4_wait.h:84
#14 0x00007f30b10e2f8b in MPIR_Test (request_ptr=0x640baf0, flag=0x7f307c49baf8, status=0x7f307c49bb00) at src/mpi/request/request_impl.c:236
#15 0x00007f30b06b31e7 in internal_Test (request=0x7f307c49bb20, flag=0x7f307c49baf8, status=0x7f307c49bb00) at src/binding/c/request/test.c:86
#16 0x00007f30b06b32cb in PMPI_Test (request=0x7f307c49bb20, flag=0x7f307c49baf8, status=0x7f307c49bb00) at src/binding/c/request/test.c:141
#17 0x00007f30af6fe38f in lcw::mpi::comp::manager_req_t::do_progress (this=0x24b3ac0) at /u/jiakuny/workspace/lcw/src/backend/mpi/comp_manager/manager_req.cpp:33
#18 lcw::mpi::comp::manager_req_t::do_progress (this=0x24b3ac0) at /u/jiakuny/workspace/lcw/src/backend/mpi/comp_manager/manager_req.cpp:17
#19 0x00007f30af6fc27a in lcw::backend_mpi_t::do_progress (this=<optimized out>, device=0x24d8280)
    at /sw/spack/deltas11-2023-03/apps/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.4.0-yycklku/lib/gcc/x86_64-pc-linux-gnu/11.4.0/../../../../include/c++/11.4.0/bits/shared_ptr_base.h:1295
#20 0x00007f30b5498898 in hpx::parcelset::policies::lcw::parcelport::background_work (mode=<optimized out>, num_thread=47, this=0x22a2dc0)
    at /u/jiakuny/workspace/hpx-lcw/libs/full/parcelport_lcw/src/parcelport_lcw.cpp:214
#21 hpx::parcelset::policies::lcw::parcelport::background_work (this=0x22a2dc0, num_thread=47, mode=<optimized out>)
    at /u/jiakuny/workspace/hpx-lcw/libs/full/parcelport_lcw/src/parcelport_lcw.cpp:201
#22 0x00007f30b54b2909 in hpx::parcelset::parcelhandler::do_background_work (this=0x21b33a8, num_thread=<optimized out>, num_thread@entry=47, stop_buffering=stop_buffering@entry=false, 
    mode=mode@entry=hpx::parcelset::parcelport_background_mode::all)
    at /sw/spack/deltas11-2023-03/apps/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.4.0-yycklku/lib/gcc/x86_64-pc-linux-gnu/11.4.0/../../../../include/c++/11.4.0/bits/shared_ptr_base.h:1295
#23 0x00007f30b556a881 in hpx::detail::network_background_callback (rt=0x21b2c00, num_thread=47)
    at /u/jiakuny/workspace/hpx-lcw/libs/full/runtime_distributed/src/runtime_distributed.cpp:167
#24 0x00007f30b4d0a33f in hpx::util::detail::basic_function<bool (unsigned long), true, false>::operator()(unsigned long) const (vs#0=<optimized out>, this=<optimized out>)
    at ../libs/core/functional/include/hpx/functional/detail/basic_function.hpp:233
#25 hpx::util::detail::deferred<hpx::function<bool (unsigned long), false>, hpx::util::pack_c<unsigned long, 0ul>, unsigned long>::operator()() (this=<optimized out>)
    at ../libs/core/functional/include/hpx/functional/deferred_call.hpp:89
#26 hpx::util::detail::callable_vtable<bool ()>::_invoke<hpx::util::detail::deferred<hpx::function<bool (unsigned long), false>, hpx::util::pack_c<unsigned long, 0ul>, unsigned long> >(void*) (f=<optimized out>) at ../libs/core/functional/include/hpx/functional/detail/vtable/callable_vtable.hpp:88
#27 0x00007f30b4d0610f in hpx::util::detail::basic_function<bool (), false, false>::operator()() const (this=<optimized out>)
    at ../libs/core/functional/include/hpx/functional/detail/basic_function.hpp:233
#28 operator() (__closure=0x5172000) at /u/jiakuny/workspace/hpx-lcw/libs/core/thread_pools/src/detail/background_thread.cpp:41
#29 hpx::util::detail::callable_vtable<std::pair<hpx::threads::thread_schedule_state, hpx::threads::thread_id>(hpx::threads::thread_restart_state)>::_invoke<hpx::threads::detail::create_background_thread(hpx::threads::policies::scheduler_base&, std::size_t, const hpx::threads::detail::scheduling_callbacks&, std::shared_ptr<bool>&, int64_t&)::<lambda(hpx::threads::thread_restart_state)> >(void *, hpx::threads::thread_restart_state &&) (f=0x5172000, vs#0=<optimized out>) at ../libs/core/functional/include/hpx/functional/detail/vtable/callable_vtable.hpp:88
#30 0x00007f30b4c4291f in hpx::util::detail::basic_function<std::pair<hpx::threads::thread_schedule_state, hpx::threads::thread_id> (hpx::threads::thread_restart_state), false, false>::operator()(hpx::threads::thread_restart_state) const (vs#0=<optimized out>, this=0x4fd10f8, this@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at ../libs/core/functional/include/hpx/functional/detail/basic_function.hpp:233
#31 hpx::threads::coroutines::detail::coroutine_impl::operator() (this=0x4fd1090) at ../libs/core/coroutines/src/detail/coroutine_impl.cpp:81
--Type <RET> for more, q to quit, c to continue without paging--s
#32 0x00007f30b4c41d89 in hpx::threads::coroutines::detail::lx::trampoline<hpx::threads::coroutines::detail::coroutine_impl> (fun=<optimized out>)
    at ../libs/core/coroutines/include/hpx/coroutines/detail/context_linux_x86.hpp:179
#33 0x0000000000000000 in ?? ()

The problem will disappear if I do either of the following two options.

run the application with one process per node.
recompile MPICH with shared memory channel disabled --with-ch4-shmmods=none.

It suggests to me that libfabric/cxi is having difficulty dealing with receive cancelation in multithreaded scenarios. Have you encountered similar issues before?

The text was updated successfully, but these errors were encountered:

yfguo · 2024-10-23T16:06:27Z

Multi-threaded communication can probably trigger this more frequently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPICH hangs/crashes on Delta (with SS-11 network) when shm channels are enabled. #7184

MPICH hangs/crashes on Delta (with SS-11 network) when shm channels are enabled. #7184

JiakunYan commented Oct 21, 2024

yfguo commented Oct 23, 2024

MPICH hangs/crashes on Delta (with SS-11 network) when shm channels are enabled. #7184

MPICH hangs/crashes on Delta (with SS-11 network) when shm channels are enabled. #7184

Comments

JiakunYan commented Oct 21, 2024

yfguo commented Oct 23, 2024