[mavlink] Use separate mutex for event buffer to prevent deadlock #22225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This prevents a deadlock when calling mavlink status from the mavlink shell.
Solved Problem
The coredump presents the following deadlock situation: Both mavlink transmit (
mavlink_if0
) and receive (mavlink_rcv_if0
) tasks are waiting on themavlink_module_mutex
, which is held by themavlink
task executing themavlink status
command launched from taskmavlink_shell
.Why is
mavlink_if0
not making progress?modules__mavlink_unity.cpp:4841
related to this code in theMavlink::task_main()
function, where the mavlink_module_mutex needs to be taken first, so the thread is waiting until the mutex is available.Why is
mavlink_rcv_if0
not making progress?modules__mavlink_unity.cpp:4841
related to this code in theMavlink::forward_message()
function, where the mavlink_module_mutex needs to be taken first, so the thread is waiting until the mutex is available.Why is mavlink_shell not making progress?
The
mavlink_shell
task spawned a new thread to execute themavlink status
command that came in via Mavlink Shell. It is waiting on the exit semaphore of the mavlink thread to be posted inside nx_waitpid()`.Why is
mavlink
not making progress?The
mavlink status
command calls theMavlink::get_status_all_instances()
function, which looks like this:We can see that the
mavlink_module_mutex
is taken first, and then a bunch of printf functions are called. This task is waiting inside a printf statement on a semaphore0x2005e660
to be posted. This is because the printf output is redirected away from stdout to mavlink via pipe0:We can inspect the pipe device via the
i_private*
pointer, pipe1 contains themavlink status
command, with the buffer already read and no task waiting on them.However, the pipe0 is waiting on the write semaphore, because the write buffer is full:
You can see that the buffer content
"st: :190, Total: 94 (lost: 47)\n\t sysid: 1, compid: 51, Total: 16 (lo"
correlated exactly the with the stuck printf statement in the mavlink thread from functionMavlinkReceiver::print_detailed_rx_stats()
:Circular dependencies
So we have the following dependency graph:
mavlink_if0
waits onmavlink_module_mutex
in the main loop, and therefore does not send any mavlink messages anymore.mavlink_rcv_if0
also waits on themavlink_module_mutex
and therefore does not process incoming messages anymore.mavlink_shell
waits for themavlink status
command to finish (that’s fine).mavlink
waits on the output buffer to be read, so that it can continue to printf its mavlink status output, but the buffer is never read because it already took themavlink_module_mutex
and thereforemavlink_if0
never reads the output buffer.=> Deadlock
Solution
Remove any blocking attempts to lock
mavlink_module_mutex
to in the transmit thread.It's not great that the receive thread is also getting blocked while mavlink status is executing, but at least it won't deadlock anything.
Changelog Entry
For release notes:
Alternatives
We could also increase the pipe buffer size, however, the status output is quite large and scales with the number of mavlink instances, so the buffer would have to be very large. Then it may become problematic to allocate such a large chunk of data and it's not guaranteed that it's really large enough to prevent the deadlock.
Test coverage
LockGuard lg{mavlink_module_mutex};
into the main loop.LockGuard lg{mavlink_event_buffer_mutex};
instead.