Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPICH hangs/crashes on Delta (with SS-11 network) when shm channels are enabled. #7184

Open
JiakunYan opened this issue Oct 21, 2024 · 1 comment

Comments

@JiakunYan
Copy link

This is more of a note as I have found a workaround to this issue. Just post here in case that you have a better understanding of the problem or a better solution.

I was running an HPX application on NCSA Delta (with Slingshot-11 interconnect). The application is the same as #7171. I was using netmod ofi and the cluster-installed libfabric. The application ran on two nodes with two processes per node.

In roughly half of the execution, the application crashed after MPICH delivered messages with the wrong data. In the other half, the application hangs. The problem typically occurred after sending only a few hundred to a few thousand messages.

I investigated further using gdb in the case of program hanging. I found a thread was stuck in cxip_evtq_req_cancel and never returned when MPICH tried to cancel a OFI receive. Below is the backtrace output:

#0  0x00007f30af98ddfd in cxip_evtq_req_cancel () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
#1  0x00007f30af95e039 in cxip_rxc_cancel () from /opt/cray/libfabric/1.15.2.0/lib64/libfabric.so.1
#2  0x00007f30b129842f in fi_cancel (fid=0x220b1d0, context=0x64177a0) at /opt/cray/libfabric/1.15.2.0/include/rdma/fi_endpoint.h:210
#3  0x00007f30b12a51b1 in MPIDI_NM_mpi_cancel_recv (rreq=0x6417600, is_blocking=true) at ./src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:463
#4  0x00007f30b129c851 in MPIDI_anysrc_try_cancel_partner (rreq=0x6429500, is_cancelled=0x7f307c49b674) at ./src/mpid/ch4/src/mpidig_request.h:108
#5  0x00007f30b12a7862 in match_posted_rreq (rank=0, tag=524287, context_id=496, vci=0, req=0x7f307c49b6d0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:183
#6  0x00007f30b12a80a6 in MPIDIG_send_target_msg_cb (am_hdr=0x7f30a6bca770, data=0x7f30a6bca7a0, in_data_sz=34, attr=1, req=0x0) at src/mpid/ch4/src/mpidig_pt2pt_callbacks.c:338
#7  0x00007f30b10df000 in MPIDI_POSIX_progress_recv (vci=0, made_progress=0x7f307c49b890) at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:60
#8  0x00007f30b10df383 in MPIDI_POSIX_progress (vci=0, made_progress=0x7f307c49b890) at ./src/mpid/ch4/shm/src/../posix/posix_progress.h:147
#9  0x00007f30b10df614 in MPIDI_SHM_progress (vci=0, made_progress=0x7f307c49b890) at ./src/mpid/ch4/shm/src/shm_progress.h:18
#10 0x00007f30b10e0f17 in MPIDI_progress_test (state=0x7f307c49b9b0) at ./src/mpid/ch4/src/ch4_progress.h:142
#11 0x00007f30b10e1d9b in MPID_Progress_test (state=0x7f307c49b9b0) at ./src/mpid/ch4/src/ch4_progress.h:242
#12 0x00007f30b10e2e98 in MPIR_Test_state (request_ptr=0x640baf0, flag=0x7f307c49baf8, status=0x7f307c49bb00, state=0x7f307c49b9b0) at src/mpi/request/request_impl.c:215
#13 0x00007f30b10e24b3 in MPID_Test (request_ptr=0x640baf0, flag=0x7f307c49baf8, status=0x7f307c49bb00) at ./src/mpid/ch4/src/ch4_wait.h:84
#14 0x00007f30b10e2f8b in MPIR_Test (request_ptr=0x640baf0, flag=0x7f307c49baf8, status=0x7f307c49bb00) at src/mpi/request/request_impl.c:236
#15 0x00007f30b06b31e7 in internal_Test (request=0x7f307c49bb20, flag=0x7f307c49baf8, status=0x7f307c49bb00) at src/binding/c/request/test.c:86
#16 0x00007f30b06b32cb in PMPI_Test (request=0x7f307c49bb20, flag=0x7f307c49baf8, status=0x7f307c49bb00) at src/binding/c/request/test.c:141
#17 0x00007f30af6fe38f in lcw::mpi::comp::manager_req_t::do_progress (this=0x24b3ac0) at /u/jiakuny/workspace/lcw/src/backend/mpi/comp_manager/manager_req.cpp:33
#18 lcw::mpi::comp::manager_req_t::do_progress (this=0x24b3ac0) at /u/jiakuny/workspace/lcw/src/backend/mpi/comp_manager/manager_req.cpp:17
#19 0x00007f30af6fc27a in lcw::backend_mpi_t::do_progress (this=<optimized out>, device=0x24d8280)
    at /sw/spack/deltas11-2023-03/apps/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.4.0-yycklku/lib/gcc/x86_64-pc-linux-gnu/11.4.0/../../../../include/c++/11.4.0/bits/shared_ptr_base.h:1295
#20 0x00007f30b5498898 in hpx::parcelset::policies::lcw::parcelport::background_work (mode=<optimized out>, num_thread=47, this=0x22a2dc0)
    at /u/jiakuny/workspace/hpx-lcw/libs/full/parcelport_lcw/src/parcelport_lcw.cpp:214
#21 hpx::parcelset::policies::lcw::parcelport::background_work (this=0x22a2dc0, num_thread=47, mode=<optimized out>)
    at /u/jiakuny/workspace/hpx-lcw/libs/full/parcelport_lcw/src/parcelport_lcw.cpp:201
#22 0x00007f30b54b2909 in hpx::parcelset::parcelhandler::do_background_work (this=0x21b33a8, num_thread=<optimized out>, num_thread@entry=47, stop_buffering=stop_buffering@entry=false, 
    mode=mode@entry=hpx::parcelset::parcelport_background_mode::all)
    at /sw/spack/deltas11-2023-03/apps/linux-rhel8-x86_64/gcc-8.5.0/gcc-11.4.0-yycklku/lib/gcc/x86_64-pc-linux-gnu/11.4.0/../../../../include/c++/11.4.0/bits/shared_ptr_base.h:1295
#23 0x00007f30b556a881 in hpx::detail::network_background_callback (rt=0x21b2c00, num_thread=47)
    at /u/jiakuny/workspace/hpx-lcw/libs/full/runtime_distributed/src/runtime_distributed.cpp:167
#24 0x00007f30b4d0a33f in hpx::util::detail::basic_function<bool (unsigned long), true, false>::operator()(unsigned long) const (vs#0=<optimized out>, this=<optimized out>)
    at ../libs/core/functional/include/hpx/functional/detail/basic_function.hpp:233
#25 hpx::util::detail::deferred<hpx::function<bool (unsigned long), false>, hpx::util::pack_c<unsigned long, 0ul>, unsigned long>::operator()() (this=<optimized out>)
    at ../libs/core/functional/include/hpx/functional/deferred_call.hpp:89
#26 hpx::util::detail::callable_vtable<bool ()>::_invoke<hpx::util::detail::deferred<hpx::function<bool (unsigned long), false>, hpx::util::pack_c<unsigned long, 0ul>, unsigned long> >(void*) (f=<optimized out>) at ../libs/core/functional/include/hpx/functional/detail/vtable/callable_vtable.hpp:88
#27 0x00007f30b4d0610f in hpx::util::detail::basic_function<bool (), false, false>::operator()() const (this=<optimized out>)
    at ../libs/core/functional/include/hpx/functional/detail/basic_function.hpp:233
#28 operator() (__closure=0x5172000) at /u/jiakuny/workspace/hpx-lcw/libs/core/thread_pools/src/detail/background_thread.cpp:41
#29 hpx::util::detail::callable_vtable<std::pair<hpx::threads::thread_schedule_state, hpx::threads::thread_id>(hpx::threads::thread_restart_state)>::_invoke<hpx::threads::detail::create_background_thread(hpx::threads::policies::scheduler_base&, std::size_t, const hpx::threads::detail::scheduling_callbacks&, std::shared_ptr<bool>&, int64_t&)::<lambda(hpx::threads::thread_restart_state)> >(void *, hpx::threads::thread_restart_state &&) (f=0x5172000, vs#0=<optimized out>) at ../libs/core/functional/include/hpx/functional/detail/vtable/callable_vtable.hpp:88
#30 0x00007f30b4c4291f in hpx::util::detail::basic_function<std::pair<hpx::threads::thread_schedule_state, hpx::threads::thread_id> (hpx::threads::thread_restart_state), false, false>::operator()(hpx::threads::thread_restart_state) const (vs#0=<optimized out>, this=0x4fd10f8, this@entry=<error reading variable: dwarf2_find_location_expression: Corrupted DWARF expression.>)
    at ../libs/core/functional/include/hpx/functional/detail/basic_function.hpp:233
#31 hpx::threads::coroutines::detail::coroutine_impl::operator() (this=0x4fd1090) at ../libs/core/coroutines/src/detail/coroutine_impl.cpp:81
--Type <RET> for more, q to quit, c to continue without paging--s
#32 0x00007f30b4c41d89 in hpx::threads::coroutines::detail::lx::trampoline<hpx::threads::coroutines::detail::coroutine_impl> (fun=<optimized out>)
    at ../libs/core/coroutines/include/hpx/coroutines/detail/context_linux_x86.hpp:179
#33 0x0000000000000000 in ?? ()

The problem will disappear if I do either of the following two options.

  • run the application with one process per node.
  • recompile MPICH with shared memory channel disabled --with-ch4-shmmods=none.

It suggests to me that libfabric/cxi is having difficulty dealing with receive cancelation in multithreaded scenarios. Have you encountered similar issues before?

@yfguo
Copy link
Contributor

yfguo commented Oct 23, 2024

Multi-threaded communication can probably trigger this more frequently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants