Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mavlink] Use separate mutex for event buffer to prevent deadlock #22225

Conversation

niklaut
Copy link
Contributor

@niklaut niklaut commented Oct 16, 2023

This prevents a deadlock when calling mavlink status from the mavlink shell.

Solved Problem

The coredump presents the following deadlock situation: Both mavlink transmit (mavlink_if0) and receive (mavlink_rcv_if0) tasks are waiting on the mavlink_module_mutex, which is held by the mavlink task executing the mavlink status command launched from task mavlink_shell.

(gdb) px4_tasks
      ╷                        ╷      ╷      ╷       ╷
      │                        │      │      │       │
  pid │ Task Name              │ Prio │ Base │ State │ Waiting For
══════╪════════════════════════╪══════╪══════╪═══════╪═══════════════════════════════════
 1267 │ mavlink_if0            │  100 │  100 │ w:sem │ 0x20020ca0 -2 waiting: 1x mavlink
 1268 │ mavlink_rcv_if0        │  175 │  175 │ w:sem │ 0x20020ca0 -2 waiting: 1x mavlink
 1886 │ mavlink_shell          │  100 │  100 │ w:sem │ 0x2005fe10 -1 waiting
 2664 │ mavlink                │  175 │  100 │ w:sem │ 0x2005e660 -1 waiting
      ╵                        ╵      ╵      ╵       ╵

Why is mavlink_if0 not making progress?

(gdb) px4_switch_task 1267
Switched to task 'mavlink_if0' (1267).
(gdb) bt
#0  arm_switchcontext () at platforms/nuttx/NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_switchcontext.S:79
#1  0x0800b4d8 in nxsem_wait (sem=sem@entry=0x20020ca0 <mavlink_module_mutex+4>) at platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:153
#2  0x0800c76a in pthread_sem_take (sem=sem@entry=0x20020ca0 <mavlink_module_mutex+4>, abs_timeout=<optimized out>, intr=<optimized out>) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_initialize.c:72
#3  0x0815f7b0 in pthread_mutex_take (mutex=mutex@entry=0x20020c9c <mavlink_module_mutex>, abs_timeout=abs_timeout@entry=0x0, intr=intr@entry=true) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_mutex.c:169
#4  0x0815f526 in pthread_mutex_timedlock (mutex=mutex@entry=0x20020c9c <mavlink_module_mutex>, abs_timeout=abs_timeout@entry=0x0) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_mutextimedlock.c:190
#5  0x0815d8f4 in pthread_mutex_lock (mutex=mutex@entry=0x20020c9c <mavlink_module_mutex>) at platforms/nuttx/NuttX/nuttx/libs/libc/pthread/pthread_mutex_lock.c:89
#6  0x080e0a0a in LockGuard::LockGuard (mutex=..., this=0x2003673c) at src/include/containers/LockGuard.hpp:44
#7  Mavlink::task_main (this=this@entry=0x20036900, argc=12, argc@entry=14, argv=0x20035db0, argv@entry=0x20035da8) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:4841
#8  0x080e0d40 in Mavlink::start_helper (argc=14, argv=0x20035da8) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:5136
#9  0x08014e90 in nxtask_startup (entrypt=entrypt@entry=0x80e0d21 <Mavlink::start_helper(int, char**)>, argc=<optimized out>, argv=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/sched/task_startup.c:151
#10 0x0800c2de in nxtask_start () at platforms/nuttx/NuttX/nuttx/sched/task/task_start.c:130

modules__mavlink_unity.cpp:4841 related to this code in the Mavlink::task_main() function, where the mavlink_module_mutex needs to be taken first, so the thread is waiting until the mutex is available.

/* handle new events */
if (check_events()) {
    if (_event_sub.updated()) {
        LockGuard lg{mavlink_module_mutex};

        event_s orb_event;

        while (_event_sub.update(&orb_event)) {
            if (events::externalLogLevel(orb_event.log_levels) == events::LogLevel::Disabled) {
                ++event_sequence_offset; // skip this event

            } else {
                events::Event e;
                e.id = orb_event.id;
                e.timestamp_ms = orb_event.timestamp / 1000;
                e.sequence = orb_event.event_sequence - event_sequence_offset;
                e.log_levels = orb_event.log_levels;
                static_assert(sizeof(e.arguments) == sizeof(orb_event.arguments),
                          "uorb message event: arguments size mismatch");
                memcpy(e.arguments, orb_event.arguments, sizeof(orb_event.arguments));
                _event_buffer->insert_event(e);
            }
        }
    }
}

Why is mavlink_rcv_if0 not making progress?

(gdb) px4_switch_task 1268
Switched to task 'mavlink_rcv_if0' (1268).
(gdb) bt
#0  arm_switchcontext () at platforms/nuttx/NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_switchcontext.S:79
#1  0x0800b4d8 in nxsem_wait (sem=sem@entry=0x20020ca0 <mavlink_module_mutex+4>) at platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:153
#2  0x0800c76a in pthread_sem_take (sem=sem@entry=0x20020ca0 <mavlink_module_mutex+4>, abs_timeout=<optimized out>, intr=<optimized out>) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_initialize.c:72
#3  0x0815f7b0 in pthread_mutex_take (mutex=mutex@entry=0x20020c9c <mavlink_module_mutex>, abs_timeout=abs_timeout@entry=0x0, intr=intr@entry=true) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_mutex.c:169
#4  0x0815f526 in pthread_mutex_timedlock (mutex=mutex@entry=0x20020c9c <mavlink_module_mutex>, abs_timeout=abs_timeout@entry=0x0) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_mutextimedlock.c:190
#5  0x0815d8f4 in pthread_mutex_lock (mutex=mutex@entry=0x20020c9c <mavlink_module_mutex>) at platforms/nuttx/NuttX/nuttx/libs/libc/pthread/pthread_mutex_lock.c:89
#6  0x080d8b6a in LockGuard::LockGuard (mutex=..., this=0x20039cac) at src/include/containers/LockGuard.hpp:44
#7  Mavlink::forward_message (msg=0x20039dcc, self=0x20036900) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:2797
#8  0x080e75fa in MavlinkReceiver::run (this=0x20036918, this@entry=<error reading variable: value has been optimized out>) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:12455
#9  0x080e77ce in MavlinkReceiver::start_trampoline (context=<error reading variable: value has been optimized out>) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:12790
#10 0x08014ccc in pthread_startup (entry=<optimized out>, arg=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/pthread/pthread_create.c:59

modules__mavlink_unity.cpp:4841 related to this code in the Mavlink::forward_message() function, where the mavlink_module_mutex needs to be taken first, so the thread is waiting until the mutex is available.

LockGuard lg{mavlink_module_mutex};

for (Mavlink *inst : mavlink_module_instances) {
    if (inst && (inst != self) && (inst->_forwarding_on)) {
        // Pass message only if target component was seen before
        if (inst->_receiver.component_was_seen(target_system_id, target_component_id)) {
            inst->pass_message(msg);
        }
    }
}

Why is mavlink_shell not making progress?

(gdb) px4_switch_task 1886
Switched to task 'mavlink_shell' (1886).
(gdb) bt
#0  arm_switchcontext () at platforms/nuttx/NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_switchcontext.S:79
#1  0x0800b4d8 in nxsem_wait (sem=0x2005fe10) at platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:153
#2  0x0801ed94 in nx_waitpid (pid=<optimized out>, stat_loc=stat_loc@entry=0x2005f4ec, options=options@entry=4) at platforms/nuttx/NuttX/nuttx/sched/sched/sched_waitpid.c:129
#3  0x0801eda2 in waitpid (pid=pid@entry=2664, stat_loc=stat_loc@entry=0x2005f4ec, options=options@entry=4) at platforms/nuttx/NuttX/nuttx/sched/sched/sched_waitpid.c:593
#4  0x08010740 in nsh_builtin (vtbl=vtbl@entry=0x2005f6c0, cmd=0x2005f964 "mavlink", argv=argv@entry=0x2005f56c, redirfile=redirfile@entry=0x0, oflags=oflags@entry=0) at platforms/nuttx/NuttX/apps/nshlib/nsh_builtin.c:160
#5  0x0800daa0 in nsh_execute (vtbl=vtbl@entry=0x2005f6c0, argc=argc@entry=2, argv=argv@entry=0x2005f56c, redirfile=redirfile@entry=0x0, oflags=oflags@entry=0) at platforms/nuttx/NuttX/apps/nshlib/nsh_parse.c:528
#6  0x0800e746 in nsh_parse_command (vtbl=vtbl@entry=0x2005f6c0, cmdline=<optimized out>) at platforms/nuttx/NuttX/apps/nshlib/nsh_parse.c:2578
#7  0x0800e816 in nsh_parse (vtbl=vtbl@entry=0x2005f6c0, cmdline=cmdline@entry=0x2005f964 "mavlink") at platforms/nuttx/NuttX/apps/nshlib/nsh_parse.c:2662
#8  0x080120b6 in nsh_session (pstate=0x2005f6c0, login=<optimized out>, argc=<optimized out>, argv=<optimized out>) at platforms/nuttx/NuttX/apps/nshlib/nsh_session.c:176
#9  0x08010b8e in nsh_consolemain (argc=argc@entry=0, argv=argv@entry=0x0) at platforms/nuttx/NuttX/apps/nshlib/nsh_consolemain.c:100
#10 0x080ced92 in MavlinkShell::shell_start_thread (argc=<optimized out>, argv=<optimized out>) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:12963
#11 0x08014e90 in nxtask_startup (entrypt=entrypt@entry=0x80ced81 <MavlinkShell::shell_start_thread(int, char**)>, argc=<optimized out>, argv=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/sched/task_startup.c:151
#12 0x0800c2de in nxtask_start () at platforms/nuttx/NuttX/nuttx/sched/task/task_start.c:130

The mavlink_shell task spawned a new thread to execute the mavlink status command that came in via Mavlink Shell. It is waiting on the exit semaphore of the mavlink thread to be posted inside nx_waitpid()`.

ret = nxsem_wait(&group->tg_exitsem);

Why is mavlink not making progress?

(gdb) px4_switch_task 2664
Switched to task 'mavlink' (2664).
(gdb) bt
#0  arm_switchcontext () at platforms/nuttx/NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_switchcontext.S:79
#1  0x0800b4d8 in nxsem_wait (sem=sem@entry=0x2005e660) at platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:153
#2  0x0815f1f6 in pipecommon_write (filep=0x20060260, buffer=0x2005ffe7 "0)\n\n)\n\n", len=46) at platforms/nuttx/NuttX/nuttx/drivers/pipes/pipe_common.c:653
#3  0x08018f88 in nx_write (fd=<optimized out>, buf=buf@entry=0x2005ffbc, nbytes=nbytes@entry=46) at platforms/nuttx/NuttX/nuttx/fs/vfs/fs_write.c:138
#4  0x08018f98 in write (fd=<optimized out>, buf=buf@entry=0x2005ffbc, nbytes=nbytes@entry=46) at platforms/nuttx/NuttX/nuttx/fs/vfs/fs_write.c:202
#5  0x08015fd8 in lib_fflush (stream=stream@entry=0x2005ff94, bforce=bforce@entry=true) at platforms/nuttx/NuttX/nuttx/libs/libc/stdio/lib_libfflush.c:117
#6  0x08016b42 in fputc (c=c@entry=10, stream=0x2005ff94) at platforms/nuttx/NuttX/nuttx/libs/libc/stdio/lib_fputc.c:48
#7  0x08016152 in stdoutstream_putc (this=0x200610e0, ch=10) at platforms/nuttx/NuttX/nuttx/libs/libc/stdio/lib_stdoutstream.c:53
#8  0x08015142 in vsprintf_internal (arglist=0x0, numargs=0, ap=..., fmt=0x8188b79 "", stream=0x200610e0) at platforms/nuttx/NuttX/nuttx/libs/libc/stdio/lib_libvsprintf.c:208
#9  lib_vsprintf (stream=stream@entry=0x200610e0, fmt=fmt@entry=0x8188b48 "\t  sysid:%3u, compid:%3u, Total: %lu (lost: %lu)\n", ap=..., ap@entry=...) at platforms/nuttx/NuttX/nuttx/libs/libc/stdio/lib_libvsprintf.c:1291
#10 0x0801612e in vfprintf (stream=0x2005ff94, fmt=0x8188b48 "\t  sysid:%3u, compid:%3u, Total: %lu (lost: %lu)\n", fmt@entry=0x801613d <stdoutstream_flush> "!\300h\377\367(\277\070\265\004F\rF\341h(F", ap=...) at platforms/nuttx/NuttX/nuttx/libs/libc/stdio/lib_vfprintf.c:54
#11 0x0801f8e4 in printf (fmt=0x8188b48 "\t  sysid:%3u, compid:%3u, Total: %lu (lost: %lu)\n") at platforms/nuttx/NuttX/nuttx/libs/libc/stdio/lib_printf.c:44
#12 0x080d8e72 in MavlinkReceiver::print_detailed_rx_stats (this=0x20036918) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:12706
#13 MavlinkReceiver::print_detailed_rx_stats (this=0x20036918) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:12698
#14 0x080d902e in Mavlink::display_status (this=0x20036900) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:5266
#15 0x080deb08 in Mavlink::get_status_all_instances (show_streams_status=<optimized out>) at build/px4_fmu-v5x_default/src/modules/mavlink/modules__mavlink_unity.cpp:2727
#16 0x08014e90 in nxtask_startup (entrypt=entrypt@entry=0x80deb19 <mavlink_main(int, char**)>, argc=<optimized out>, argv=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/sched/task_startup.c:151
#17 0x0800c2de in nxtask_start () at platforms/nuttx/NuttX/nuttx/sched/task/task_start.c:130

The mavlink status command calls the Mavlink::get_status_all_instances() function, which looks like this:

int
Mavlink::get_status_all_instances(bool show_streams_status)
{
	LockGuard lg{mavlink_module_mutex};
	unsigned iterations = 0;

	for (Mavlink *inst : mavlink_module_instances) {
		if (inst != nullptr) {
			printf("\ninstance #%u:\n", iterations);

			if (show_streams_status) {
				inst->display_status_streams();

			} else {
				inst->display_status();
			}

			iterations++;
		}
	}

	/* return an error if there are no instances */
	return (iterations == 0);
}

We can see that the mavlink_module_mutex is taken first, and then a bunch of printf functions are called. This task is waiting inside a printf statement on a semaphore 0x2005e660 to be posted. This is because the printf output is redirected away from stdout to mavlink via pipe0:

(gdb) px4_files
                ╷            ╷       ╷
  struct inode* │ i_private* │ Name  │ Tasks
 ═══════════════╪════════════╪═══════╪══════════════════════════════════════════════════════
     0x2005e720 │ 0x2005e6e0 │ pipe1 │ mavlink, mavlink_shell
     0x2005e510 │ 0x2005e650 │ pipe0 │ mavlink, mavlink_if0, mavlink_rcv_if0, mavlink_shell
                ╵            ╵       ╵

We can inspect the pipe device via the i_private* pointer, pipe1 contains the mavlink status command, with the buffer already read and no task waiting on them.

(gdb) p *(struct pipe_dev_s*)0x2005e6e0
$1 = {
  d_bfsem = 1i available,
  d_rdsem = 0 available,
  d_wrsem = 0 available,
  d_wrndx = 0,
  d_rdndx = 15,
  d_bufsize = 70,
  d_nwriters = 1 '\001',
  d_nreaders = 2 '\002',
  d_pipeno = 1 '\001',
  d_flags = 0 '\000',
  d_buffer = 0x2005e750 "mavlink status\n",
  d_fds = {0x0, 0x0}
}

However, the pipe0 is waiting on the write semaphore, because the write buffer is full:

(gdb) p *(struct pipe_dev_s*)0x2005e650
$2 = {
  d_bfsem = 1i available,
  d_rdsem = 0 available,
  d_wrsem = -1 waiting,
  d_wrndx = 4,
  d_rdndx = 5,
  d_bufsize = 70,
  d_nwriters = 4 '\004',
  d_nreaders = 1 '\001',
  d_pipeno = 0 '\000',
  d_flags = 0 '\000',
  d_buffer = 0x2005e690 "st: :190, Total: 94 (lost: 47)\n\t  sysid:  1, compid: 51, Total: 16 (loN\325@",
  d_fds = {0x0, 0x0}
}

You can see that the buffer content "st: :190, Total: 94 (lost: 47)\n\t sysid: 1, compid: 51, Total: 16 (lo" correlated exactly the with the stuck printf statement in the mavlink thread from function MavlinkReceiver::print_detailed_rx_stats():

for (const auto &comp_stat : _component_states) {
    if (comp_stat.received_messages > 0) {
        printf("\t  sysid:%3" PRIu8 ", compid:%3" PRIu8 ", Total: %" PRIu32 " (lost: %" PRIu32 ")\n",
               comp_stat.system_id, comp_stat.component_id,
               comp_stat.received_messages, comp_stat.missed_messages);

Circular dependencies

So we have the following dependency graph:

  • mavlink_if0 waits on mavlink_module_mutex in the main loop, and therefore does not send any mavlink messages anymore.
  • mavlink_rcv_if0 also waits on the mavlink_module_mutex and therefore does not process incoming messages anymore.
  • mavlink_shell waits for the mavlink status command to finish (that’s fine).
  • mavlink waits on the output buffer to be read, so that it can continue to printf its mavlink status output, but the buffer is never read because it already took the mavlink_module_mutex and therefore mavlink_if0 never reads the output buffer.

=> Deadlock

Solution

Remove any blocking attempts to lock mavlink_module_mutex to in the transmit thread.
It's not great that the receive thread is also getting blocked while mavlink status is executing, but at least it won't deadlock anything.

Changelog Entry

For release notes:

Prevent mavlink deadlock when calling `mavlink status` via mavlink shell.

Alternatives

We could also increase the pipe buffer size, however, the status output is quite large and scales with the number of mavlink instances, so the buffer would have to be very large. Then it may become problematic to allocate such a large chunk of data and it's not guaranteed that it's really large enough to prevent the deadlock.

Test coverage

  • Was able to reproduce this locally reliably by placing the LockGuard lg{mavlink_module_mutex}; into the main loop.
  • Was unable to reproduce this when using LockGuard lg{mavlink_event_buffer_mutex}; instead.

@niklaut niklaut marked this pull request as ready for review October 17, 2023 11:10
This prevents the mavlink transmit loop from waiting on the module mutex
thus not stopping transmissions when the mutex is already taken.

This can happen when calling `mavlink status` from the mavlink shell,
where `Mavlink::get_status_all_instances()` takes the mutex and then
prints the status via pipes to the mavlink transmit buffer.
If that pipe cannot be emptied a deadlock happens.

Since the MavlinkReceiver thread also waits on the module mutex, both
reception and transmission of Mavlink packets are then prevented thus
disabling communications entirely.
@niklaut niklaut force-pushed the APX4-3439-refactor-mutex-use-in-mavlink-to-prevent-deadlock branch from 75ac1ec to e28ccb9 Compare October 17, 2023 11:18
@dagar dagar merged commit f45b960 into PX4:main Oct 17, 2023
@niklaut niklaut deleted the APX4-3439-refactor-mutex-use-in-mavlink-to-prevent-deadlock branch October 17, 2023 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants