You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've run into a deadlock that I can't seem to reproduce a minimal example of. My case appears to be a very rare race condition, and the only way I've found to reproduce it reliably is by repeatedly running a large set of convoluted unit tests (which were written for an application I'm working on) until it happens to get triggered in one of the runs. I often have to leave the tests running on repeat for 1-2 hours (that's potentially hundreds of reruns) before I see the deadlock happen. I still don't know what exact conditions need to align to cause it, but luckily I do know what the stack trace is when it happens (ordered from bottom of the stack to top of the stack):
The deadlock happens because this mutex gets locked twice in this one thread (as shown in the stack trace above): [i] and [ii].
In most cases this won't happen because this whole branch is protected by the condition that the observer is subscribed, so we can usually rely on this condition to prevent frame [5] in the stack trace from being run.
The race condition appears to be that somehow between frame [1] and frame [5] another thread changes the observer's state from subscribed to unsubscribed. As I mentioned at the start I haven't figured out a way to minimally reproduce this, but assuming it's possible for another thread to change the observer to unsubscribed, it should be clear from the stack trace that what I've described is a deadlock hazard.
This race condition was happening for me on release v4.1.0, which I understand is a few years behind master, but the problematic code path seems to still exist, as the lines I linked above are from the latest master.
A very easy way to fix this problem is to change this std::mutex to a std::recursive_mutex (and of course change the template parameter on the locking mechanisms that use it). I'm happy to provide a PR to fix this, but I don't know how to make a regression test to prove the fix.
The text was updated successfully, but these errors were encountered:
I've run into a deadlock that I can't seem to reproduce a minimal example of. My case appears to be a very rare race condition, and the only way I've found to reproduce it reliably is by repeatedly running a large set of convoluted unit tests (which were written for an application I'm working on) until it happens to get triggered in one of the runs. I often have to leave the tests running on repeat for 1-2 hours (that's potentially hundreds of reruns) before I see the deadlock happen. I still don't know what exact conditions need to align to cause it, but luckily I do know what the stack trace is when it happens (ordered from bottom of the stack to top of the stack):
multicast_observer::add
subscriber::add
composite_subscription::add
composite_subscription_inner::add
composite_subscription_state::add
subscription::unsubscribe
subscription_state::unsubscribe
static_subscription::unsubscribe
multicast_observer::add::<lambda>
The deadlock happens because this mutex gets locked twice in this one thread (as shown in the stack trace above): [i] and [ii].
In most cases this won't happen because this whole branch is protected by the condition that the observer is subscribed, so we can usually rely on this condition to prevent frame [5] in the stack trace from being run.
The race condition appears to be that somehow between frame [1] and frame [5] another thread changes the observer's state from subscribed to unsubscribed. As I mentioned at the start I haven't figured out a way to minimally reproduce this, but assuming it's possible for another thread to change the observer to unsubscribed, it should be clear from the stack trace that what I've described is a deadlock hazard.
This race condition was happening for me on release v4.1.0, which I understand is a few years behind
master
, but the problematic code path seems to still exist, as the lines I linked above are from the latestmaster
.A very easy way to fix this problem is to change this
std::mutex
to astd::recursive_mutex
(and of course change the template parameter on the locking mechanisms that use it). I'm happy to provide a PR to fix this, but I don't know how to make a regression test to prove the fix.The text was updated successfully, but these errors were encountered: