Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dragonfly crashes during migrations #4455

Closed
andydunstall opened this issue Jan 14, 2025 · 5 comments · Fixed by #4495 or #4508
Closed

Dragonfly crashes during migrations #4455

andydunstall opened this issue Jan 14, 2025 · 5 comments · Fixed by #4495 or #4508
Assignees
Labels
bug Something isn't working

Comments

@andydunstall
Copy link
Contributor

andydunstall commented Jan 14, 2025

Dragonfly crashes during migrations, with stack trace:

#0  __pthread_kill_implementation (threadid=281474836474368, signo=signo@entry=11, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x0000fffff7d97690 in __pthread_kill_internal (signo=11, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x0000fffff7d4cb3c in __GI_raise (sig=11) at ../sysdeps/posix/raise.c:26
#3  <signal handler called>
#4  0x0000aaaaaaf2e4fc in dfly::DashTable<dfly::CompactObj, dfly::CompactObj, dfly::detail::PrimeTablePolicy>::Iterator<false, true>::GetVersion<true> (this=<optimized out>)
    at /var/lib/dragonfly/dragonfly/dragonfly/src/core/dash.h:430
#5  dfly::RestoreStreamer::WriteBucket (this=this@entry=0x3bd5e0c1598, it=...) at /var/lib/dragonfly/dragonfly/dragonfly/src/server/journal/streamer.cc:300
#6  0x0000aaaaaaf2f178 in dfly::RestoreStreamer::OnDbChange (this=0x3bd5e0c1598, db_index=<optimized out>, req=...) at /var/lib/dragonfly/dragonfly/dragonfly/src/server/journal/streamer.cc:337
#7  0x0000aaaaaaee31cc in std::function<void (unsigned short, dfly::DbSlice::ChangeReq const&)>::operator()(unsigned short, dfly::DbSlice::ChangeReq const&) const (__args#1=..., __args#0=<optimized out>, 
    this=0x3bd5e0e5558) at /usr/include/c++/13/bits/std_function.h:591
#8  dfly::DbSlice::FlushChangeToEarlierCallbacks (this=this@entry=0x3bd5e0f0440, db_ind=db_ind@entry=0, it=..., upper_bound=3889649) at /var/lib/dragonfly/dragonfly/dragonfly/src/server/db_slice.cc:1196
#9  0x0000aaaaaaf2eaf8 in operator() (it=..., __closure=<optimized out>) at /usr/include/c++/13/bits/allocator.h:184
#10 dfly::DashTable<dfly::CompactObj, dfly::CompactObj, dfly::detail::PrimeTablePolicy>::TraverseBuckets<dfly::RestoreStreamer::Run()::<lambda(dfly::DashTable<dfly::CompactObj, dfly::CompactObj, dfly::detail::PrimeTablePolicy>::bucket_iterator)> > (cb=..., cursor=..., this=<optimized out>) at /var/lib/dragonfly/dragonfly/dragonfly/src/core/dash.h:1003
#11 dfly::RestoreStreamer::Run (this=0x3bd5e0c1e98) at /var/lib/dragonfly/dragonfly/dragonfly/src/server/journal/streamer.cc:211
#12 0x0000aaaaaae00224 in std::function<void (std::unique_ptr<dfly::cluster::OutgoingMigration::SliceSlotMigration, std::default_delete<dfly::cluster::OutgoingMigration::SliceSlotMigration> >&)>::operator()(std::unique_ptr<dfly::cluster::OutgoingMigration::SliceSlotMigration, std::default_delete<dfly::cluster::OutgoingMigration::SliceSlotMigration> >&) const (__args#0=..., this=<optimized out>)
    at /usr/include/c++/13/bits/std_function.h:591
#13 operator() (__closure=<synthetic pointer>, pb=<optimized out>) at /var/lib/dragonfly/dragonfly/dragonfly/src/server/cluster/outgoing_slot_migration.cc:137
#14 operator() (context=<optimized out>, __closure=<synthetic pointer>) at /var/lib/dragonfly/dragonfly/dragonfly/helio/util/proactor_pool.h:194
#15 std::__invoke_impl<void, util::ProactorPool::AwaitFiberOnAll<dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)> >(dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)>&&)::<lambda(util::ProactorPool::ProactorBase*)>, util::fb2::ProactorBase*> (
    __f=<synthetic pointer>) at /usr/include/c++/13/bits/invoke.h:61
#16 std::__invoke<util::ProactorPool::AwaitFiberOnAll<dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)> >(dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)>&&)::<lambda(util::ProactorPool::ProactorBase*)>, util::fb2::ProactorBase*> (
    __fn=<synthetic pointer>) at /usr/include/c++/13/bits/invoke.h:96
#17 std::__apply_impl<util::ProactorPool::AwaitFiberOnAll<dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)> >(dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)>&&)::<lambda(util::ProactorPool::ProactorBase*)>, std::tuple<util::fb2::ProactorBase*>, 0> (__t=<synthetic pointer>, __f=<synthetic pointer>) at /usr/include/c++/13/tuple:2302
#18 std::apply<util::ProactorPool::AwaitFiberOnAll<dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)> >(dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)>&&)::<lambda(util::ProactorPool::ProactorBase*)>, std::tuple<util::fb2::ProactorBase*> > (
    __t=<synthetic pointer>, __f=<synthetic pointer>) at /usr/include/c++/13/tuple:2313
#19 util::fb2::detail::WorkerFiberImpl<const util::ProactorPool::AwaitFiberOnAll<dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)> >(dfly::cluster::OutgoingMigration::OnAllShards(std::function<void(std::unique_ptr<SliceSlotMigration>&)>)::<lambda(util::fb2::ProactorBase*)>&&)::<lambda(util::ProactorPool::ProactorBase*)>, util::fb2::ProactorBase*&>::run_ (c=..., this=0xffffc4040e00) at /var/lib/dragonfly/dragonfly/dragonfly/helio/util/fibers/detail/fiber_interface.h:304

Running Dragonfly v1.26.1

This is running 100x100GB shards, each populated to 75% memory. I'll add more information below...

@andydunstall andydunstall added the bug Something isn't working label Jan 14, 2025
@BorysTheDev
Copy link
Contributor

an 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaabe8d0b8 16 google::LogMessage::Fail()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaabe8cfc0 144 google::LogMessage::SendToLog()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaabe8c7c0 80 google::LogMessage::Flush()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaabe905a0 32 google::LogMessageFatal::~LogMessageFatal()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaaac69344 176 __assert_fail
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaaac93b34 16 dfly::detail::Segment<>::Key()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaaade71f0 80 dfly::DashTable<>::Iterator<>::operator->()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab50ff6c 176 dfly::RestoreStreamer::WriteBucket()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab50f11c 176 dfly::RestoreStreamer::Run()::{lambda()#1}::operator()()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab511278 112 dfly::DashTable<>::TraverseBuckets<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab50f384 240 dfly::RestoreStreamer::Run()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e8650 48 dfly::cluster::OutgoingMigration::SliceSlotMigration::R>
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1de248 32 dfly::cluster::OutgoingMigration::SyncFb()::{lambda()#4>
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e44c8 32 std::__invoke_impl<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e3494 64 std::__invoke_r<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e2390 48 std::_Function_handler<>::_M_invoke()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e8cc8 48 std::function<>::operator()()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1dd8ac 48 dfly::cluster::OutgoingMigration::OnAllShards()::{lambd>
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e1208 64 util::ProactorPool::AwaitFiberOnAll<>()::{lambda()#1}::>
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e6540 32 std::__invoke_impl<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e61d0 64 std::__invoke<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e5cac 48 std::_apply_impl<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e5cf4 64 std::apply<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e5d84 112 util::fb2::detail::WorkerFiberImpl<>::run
()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e5724 64 util::fb2::detail::WorkerFiberImpl<>::WorkerFiberImpl<>>
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ 0xaaaaab1e7810 80 std::__invoke_impl<>()
Jan 20 14:58:53 ip-10-5-15-59 dragonfly[1896]: @ ... and at least 4 more frames
Jan 20 15:01:06 ip-10-5-15-59 systemd[1]: dragonfly.service: Main process exited, code=dumped, status=6/ABRT
Jan 20 15:01:06 ip-10-5-15-59 systemd[1]: dragonfly.service: Failed with result 'core-dump'.

@adiholden
Copy link
Collaborator

Assumption for a flow which can lead to this crash
A Cluster node gets new cluster config after migration was done leads to flushslots callback register, this callback deletes entries from table.
Another migration start from this cluster node starting RestoreStreamer::Run
When calling the callback passed to TraverseBuckets we first call FlushChangeToEarlierCallbacks which will delete entries on this bucket and than will call the WriteBucket which will fail on the assert as the iterator to the entry in dash table points to an entry which is not longer occupied (GetBusy on the slot is false).

How to fix - fix the for loop in RestoreStreamer::WriteBucket to access only valid entries.
To test - create a test with 2 outgoing migrations from a cluster node. start the second migration after the first migration is finished

@adiholden
Copy link
Collaborator

I was able to reproduce this crash with my test

@adiholden
Copy link
Collaborator

Reopenning this issue as we still see crashes, now from another flow
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: *** SIGSEGV received at time=1737562074 on cpu 11 ***
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: PC: @ 0xaaaaaaf351fc (unknown) dfly::RestoreStreamer::WriteBucket()
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: @ 0xaaaaab6f2158 224 absl::lts_20240722::AbslFailureSignalHandler()
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: @ 0xfffff7ffb8f8 4912 (unknown)
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: @ 0xaaaaaaf35f4c 144 dfly::RestoreStreamer::OnDbChange()
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: @ 0xaaaaaaeecd2c 176 dfly::DbSlice::FlushChangeToEarlierCallbacks()
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: @ 0xaaaaaaf35aec 528 dfly::RestoreStreamer::Run()
Jan 22 16:07:54 ip-10-5-32-145 dragonfly[1782]: @ 0xaaaaaae07d34 128 boost::context::detail::fiber_entry<>()

@adiholden
Copy link
Collaborator

I hope this will be resolved by #4508

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants