Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stm32h7: Prevent UART from waiting on TXDMA semaphore #22302

Merged
merged 1 commit into from
Nov 9, 2023

Conversation

niklaut
Copy link
Contributor

@niklaut niklaut commented Nov 2, 2023

Mavlink can deadlock when using TXDMA to transmit to TELEM1 with flow control CTS high constantly.

Solved Problem

(gdb) px4_tasks
       ╷                 ╷         ╷        ╷      ╷      ╷       ╷
       │                 │         │        │      │      │       │
   pid │ Task Name       │ CPU(ms) │ CPU(%) │ Prio │ Base │ State │ Waiting For
 ══════╪═════════════════╪═════════╪════════╪══════╪══════╪═══════╪═══════════════════════════════════════════
     0 │ Idle Task       │   10095 │   86.9 │    0 │    0 │ RUN   │
   915 │ mavlink_if0     │       0 │    0.0 │  100 │  100 │ w:sem │ 0x2400f4f8 -2 waiting: 1x mavlink_if1
   916 │ mavlink_rcv_if0 │       2 │    0.2 │  175 │  175 │ w:sem │ 0x200164d0 -1 waiting
  1105 │ mavlink_if1     │       0 │    0.0 │  100 │  100 │ w:sem │ 0x200191dc -1 waiting: 1x mavlink_rcv_if1
  1106 │ mavlink_rcv_if1 │       0 │    0.0 │  175 │  175 │ w:sem │ 0x240002b0 -1 waiting
  1158 │ mavlink_if2     │       0 │    0.0 │  100 │  100 │ w:sem │ 0x2400f4f8 -2 waiting: 1x mavlink_if1
  1159 │ mavlink_rcv_if2 │       2 │    0.2 │  175 │  175 │ w:sem │ 0x300012d0 -1 waiting
       ╵                 ╵         ╵        ╵      ╵      ╵       ╵

Both mavlink_if0 and mavlink_if2 are waiting on a semaphore 0x2400f4f8 held by mavlink_if1, which waits on a different semaphore 0x200191dc held by mavlink_rcv_if1.

Deadlock Timeline

  1. The mavlink_if1 task is the first to send a message to TELEM1, which has nothing connected. This calls up_dma_txavailable which sets up the very first DMA TX transfer for UART7.
(gdb) b up_dma_txavailable if dev == &g_uart7priv
Breakpoint 2 at 0x8020e10: file platforms/nuttx/NuttX/nuttx/arch/arm/src/stm32h7/stm32_serial.c, line 3382.
(gdb) px4_reset
Resetting target
(gdb) c
Continuing.

Breakpoint 2, up_dma_txavailable (dev=0x240001e0 <g_uart7priv>) at platforms/nuttx/NuttX/nuttx/arch/arm/src/stm32h7/stm32_serial.c:3382
3382	  nxsem_wait(&priv->txdmasem);
=> 0x08020e10 <up_dma_txavailable+0>:	10 b5	push	{r4, lr}

(gdb) bt
#0  up_dma_txavailable (dev=0x240001e0 <g_uart7priv>) at platforms/nuttx/NuttX/nuttx/arch/arm/src/stm32h7/stm32_serial.c:3382
#1  0x08022c46 in uart_write (filep=<optimized out>, buffer=0x2001baae "", buflen=0) at platforms/nuttx/NuttX/nuttx/drivers/serial/serial.c:1223
#2  0x08038bfa in nx_write (fd=<optimized out>, buf=buf@entry=0x2001ba54, nbytes=90) at platforms/nuttx/NuttX/nuttx/fs/vfs/fs_write.c:138
#3  0x08038c0a in write (fd=<optimized out>, buf=buf@entry=0x2001ba54, nbytes=<optimized out>) at platforms/nuttx/NuttX/nuttx/fs/vfs/fs_write.c:202
#4  0x080f0932 in Mavlink::send_finish (this=0x2001af00) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:3082
#5  0x080f0bd4 in mavlink_end_uart_send (length=90, chan=MAVLINK_COMM_1) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:2411
#6  _mav_finalize_message_chan_send (chan=<optimized out>, msgid=msgid@entry=148, packet=packet@entry=0x2001ac48 "\377", <incomplete sequence \356>, min_length=min_length@entry=60 '<', length=length@entry=78 'N', crc_extra=crc_extra@entry=178 '\262') at build/px4_fmu-v6x_default_restricted_export_controlled/mavlink/mavlink_helpers.h:384
#7  0x080f4f10 in mavlink_msg_autopilot_version_send_struct (autopilot_version=0x2001ac48, chan=<optimized out>) at build/px4_fmu-v6x_default_restricted_export_controlled/mavlink/common/mavlink_msg_autopilot_version.h:284
#8  Mavlink::send_autopilot_capabilities (this=this@entry=0x2001af00) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:3436
#9  0x08100060 in Mavlink::task_main (this=this@entry=0x2001af00, argc=11, argc@entry=13, argv=0x2001a250, argv@entry=0x2001a248) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:4593
#10 0x08100878 in Mavlink::start_helper (argc=13, argv=0x2001a248) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:5136
#11 0x080301f4 in nxtask_startup (entrypt=entrypt@entry=0x8100859 <Mavlink::start_helper(int, char**)>, argc=<optimized out>, argv=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/sched/task_startup.c:151
#12 0x08024a1e in nxtask_start () at platforms/nuttx/NuttX/nuttx/sched/task/task_start.c:130

The associated DMA channel for the tx buffer is channel 4 of DMA2:

(gdb) p g_dmach
$2 = {{
    used = true,
    ctrl = 2 '\002',
    chan = 4 '\004',
    irq = 76 'L',
    shift = 0 '\000',
    base = 1073874032,
    callback = 0x8020e91 <up_dma_txcallback>,
    arg = 0x240001e0 <g_uart7priv>
  }, {
    used = true,
    ctrl = 2 '\002',
    chan = 5 '\005',
    irq = 84 'T',
    shift = 6 '\006',
    base = 1073874056,
    callback = 0x802107b <up_dma_rxcallback>,
    arg = 0x240001e0 <g_uart7priv>
  }}

The channel is configured correctly with 0x4a (=74) bytes to transfer initially.

(gdb) px4_pshow DMA2
DMA2.S4CR   = 00000000000100010000010001010101            // stream x configuration register
    EN        ...............................1 - 1        // Stream enable / flag stream ready when read low
    DMEIE     ..............................0. - 0        // Direct mode error interrupt enable
    TEIE      .............................1.. - 1        // Transfer error interrupt enable
    HTIE      ............................0... - 0        // Half transfer interrupt enable
    TCIE      ...........................1.... - 1        // Transfer complete interrupt enable
    PFCTRL    ..........................0..... - 0        // Peripheral flow controller
    DIR       ........................01...... - 1        // Data transfer direction
    CIRC      .......................0........ - 0        // Circular mode
    PINC      ......................0......... - 0        // Peripheral increment mode
    MINC      .....................1.......... - 1        // Memory increment mode
    PSIZE     ...................00........... - 0        // Peripheral data size
    MSIZE     .................00............. - 0        // Memory data size
    PINCOS    ................0............... - 0        // Peripheral increment offset size
    PL        ..............01................ - 1        // Priority level
    DBM       .............0.................. - 0        // Double buffer mode
    CT        ............0................... - 0        // Current target (only in double buffer mode)
    ACK       ...........1.................... - 1        // ACK
    PBURST    .........00..................... - 0        // Peripheral burst transfer configuration
    MBURST    .......00....................... - 0        // Memory burst transfer configuration
DMA2.S4NDTR = 00000000000000000000000001001010            // stream x number of data register
    NDT       ................0000000001001010 - 004a     // Number of data items to transfer
DMA2.S4PAR  = 01000000000000000111100000101000            // stream x peripheral address register
    PA        01000000000000000111100000101000 - 40007828 // Peripheral address
DMA2.S4M0AR = 00100100000000000010110111100000            // stream x memory 0 address register
    M0A       00100100000000000010110111100000 - 24002de0 // Memory 0 address
DMA2.S4M1AR = 00000000000000000000000000000000            // stream x memory 1 address register
    M1A       00000000000000000000000000000000 - 00000000 // Memory 1 address (used in case of Double buffer mode)
DMA2.S4FCR  = 00000000000000000000000000101111            // stream x FIFO control register
    FTH       ..............................11 - 3        // FIFO threshold selection
    DMDIS     .............................1.. - 1        // Direct mode disable
    FS        ..........................101... - 5        // FIFO status
    FEIE      ........................0....... - 0        // FIFO error interrupt enable
  1. The second call to up_dma_txavailable comes from the mavlink_rcv_if1 task, which tries to send a parameter CAL_MAG0_ID through TELEM1.
(gdb) c
Continuing.

Breakpoint 2, up_dma_txavailable (dev=0x240001e0 <g_uart7priv>) at platforms/nuttx/NuttX/nuttx/arch/arm/src/stm32h7/stm32_serial.c:3382
3382	  nxsem_wait(&priv->txdmasem);
=> 0x08020e10 <up_dma_txavailable+0>:	10 b5	push	{r4, lr}

(gdb) bt
#0  up_dma_txavailable (dev=0x240001e0 <g_uart7priv>) at platforms/nuttx/NuttX/nuttx/arch/arm/src/stm32h7/stm32_serial.c:3382
#1  0x08022c46 in uart_write (filep=<optimized out>, buffer=0x2001ba79 "\v5", buflen=0) at platforms/nuttx/NuttX/nuttx/drivers/serial/serial.c:1223
#2  0x08038bfa in nx_write (fd=<optimized out>, buf=buf@entry=0x2001ba54, nbytes=37) at platforms/nuttx/NuttX/nuttx/fs/vfs/fs_write.c:138
#3  0x08038c0a in write (fd=<optimized out>, buf=buf@entry=0x2001ba54, nbytes=<optimized out>) at platforms/nuttx/NuttX/nuttx/fs/vfs/fs_write.c:202
#4  0x080f0932 in Mavlink::send_finish (this=0x2001af00) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:3082
#5  0x080f0bd4 in mavlink_end_uart_send (length=37, chan=MAVLINK_COMM_1) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:2411
#6  _mav_finalize_message_chan_send (chan=<optimized out>, msgid=msgid@entry=22, packet=packet@entry=0x2001e1f8 "\f\003\003", min_length=min_length@entry=25 '\031', length=length@entry=25 '\031', crc_extra=crc_extra@entry=220 '\334') at build/px4_fmu-v6x_default_restricted_export_controlled/mavlink/mavlink_helpers.h:384
#7  0x080f36de in mavlink_msg_param_value_send_struct (chan=<optimized out>, param_value=param_value@entry=0x2001e1f8) at build/px4_fmu-v6x_default_restricted_export_controlled/mavlink/common/mavlink_msg_param_value.h:197
#8  0x080f4d1a in MavlinkParametersManager::send_param (this=0x2001b090, param=<optimized out>, component_id=-1) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:8909
#9  0x080f6a9e in MavlinkParametersManager::send_untransmitted (this=<optimized out>) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:8704
#10 MavlinkParametersManager::send_untransmitted (this=0x2001b090) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:8673
#11 0x081029f0 in MavlinkParametersManager::send (this=this@entry=0x2001b090) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:8652
#12 0x08107368 in MavlinkReceiver::run (this=0x2001af18, this@entry=<error reading variable: value has been optimized out>) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:12432
#13 0x081073c6 in MavlinkReceiver::start_trampoline (context=<error reading variable: value has been optimized out>) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:12715
#14 0x0803002e in pthread_startup (entry=<optimized out>, arg=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/pthread/pthread_create.c:59
(gdb) frame 7
#7  0x080f36de in mavlink_msg_param_value_send_struct (chan=<optimized out>, param_value=param_value@entry=0x2001e1f8) at build/px4_fmu-v6x_default_restricted_export_controlled/mavlink/common/mavlink_msg_param_value.h:197
197	    _mav_finalize_message_chan_send(chan, MAVLINK_MSG_ID_PARAM_VALUE, (const char *)param_value, MAVLINK_MSG_ID_PARAM_VALUE_MIN_LEN, MAVLINK_MSG_ID_PARAM_VALUE_LEN, MAVLINK_MSG_ID_PARAM_VALUE_CRC);
(gdb) p *param_value
$4 = {
  param_value = 2.76599501e-40,
  param_count = 932,
  param_index = 71,
  param_id = "CAL_MAG0_ID\000\000\000\000",
  param_type = 6 '\006'
}

Note that both transmit calls do not fill the entire TX buffer, just 127B in total:

(gdb) p g_uart7priv
$1 = {
  dev = {
    open_count = 1 '\001',
    xmitwaiting = false,
    recvwaiting = false,
    closesem = available (1i),
    xmitsem = taken (0),
    recvsem = taken (0),
    pollsem = available (1i),
    xmit = {
      sem = taken (0i) waiters: 1x mavlink_rcv_if1,
      size = 3008,
      buffer = 0x24002de0 <g_uart7txbuffer> "\375N",
      tail -> head = 0 -> 127: 127 used, 2881 free,
      content = ýN\00\00\00\01\01\94\00\00ÿî\00\00\00\00\00\003387\0dQ04\00\00\0d\01\00\00\0d\01ÿ\00\00\0b5\00\00\00\8515\00\1b\07\02þ³M7ü\00\00\00þ³M7üº~²\15\7f\01¶\01\00\06\00\00\00\00783340Q\0d\001\006â\9eý\19\00\00\01\01\01\16\00\00\0c\03\03\00\a0\03C\00CAL_MAG0_ID\00\00\00\00\00\06H\14
    },
    dmatx = {
      buffer = 0x24002de0 <g_uart7txbuffer> "\375N",
      nbuffer = 0x0,
      length = 90,
      nlength = 0,
      nbytes = 0
    },
  },
  iflow = true,
  oflow = true,
  baud = 57600,
  txdma_channel = 592,
  txdma = 0x240009d0 <g_dmach+192>,
  txdmasem = taken (-1),
}
  1. Since the first DMA transfer is not finished (and will never be finished since TELEM1 CTS is high, since it’s unconnected), this puts the mavlink_rcv_if1 task to sleep BEFORE it can release the Mavlink::_send_mutex, which was locked when calling Mavlink::send_start() some time before.

Sleeps in up_dma_txavailable (dev=0x240001e0 <g_uart7priv>) at platforms/nuttx/NuttX/nuttx/arch/arm/src/stm32h7/stm32_serial.c:3382:

// Informs DMA that Tx data is available and is ready for transfer.
#ifdef SERIAL_HAVE_TXDMA
static void up_dma_txavailable(struct uart_dev_s *dev)
{
  struct up_dev_s *priv = (struct up_dev_s *)dev->priv;
  /* Only send when the DMA is idle */
  nxsem_wait(&priv->txdmasem); // <<<<<<<<<<<<<<<<<<<<<<<
  uart_xmitchars_dma(dev);
}
#endif

called from uart_write (filep=<optimized out>, buffer=0x20019019 "\v5", buflen=0) at platforms/nuttx/NuttX/nuttx/drivers/serial/serial.c:1223.

  if (dev->xmit.head != dev->xmit.tail)
    {
#ifdef CONFIG_SERIAL_TXDMA
      uart_dmatxavail(dev);
#endif
      uart_enabletxint(dev);
    }

  uart_givesem(&dev->xmit.sem);
  return nwritten;
  1. The mavlink_if1 now waits on the Mavlink::_send_mutex in Mavlink::send_start():
(gdb) px4_switch_task 1105
Switched to task 'mavlink_if1' (1105).
(gdb) bt
#0  arm_switchcontext () at platforms/nuttx/NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_switchcontext.S:79
#1  0x08023bea in nxsem_wait (sem=sem@entry=0x200191dc) at platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:155
#2  0x08024f22 in pthread_sem_take (sem=sem@entry=0x200191dc, abs_timeout=<optimized out>, intr=<optimized out>) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_initialize.c:72
#3  0x0817b170 in pthread_mutex_take (mutex=mutex@entry=0x200191d8, abs_timeout=abs_timeout@entry=0x0, intr=intr@entry=true) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_mutex.c:169
#4  0x0817aee6 in pthread_mutex_timedlock (mutex=mutex@entry=0x200191d8, abs_timeout=abs_timeout@entry=0x0) at platforms/nuttx/NuttX/nuttx/sched/pthread/pthread_mutextimedlock.c:190
#5  0x08179f1a in pthread_mutex_lock (mutex=mutex@entry=0x200191d8) at platforms/nuttx/NuttX/nuttx/libs/libc/pthread/pthread_mutex_lock.c:89
#6  0x080f066e in Mavlink::send_start (this=0x200184a0, length=42) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:3053
#7  0x080f0b9e in _mav_finalize_message_chan_send (chan=<optimized out>, msgid=msgid@entry=76, packet=packet@entry=0x20018124 "", min_length=min_length@entry=33 '!', length=30 '\036', length@entry=33 '!', crc_extra=crc_extra@entry=152 '\230') at build/px4_fmu-v6x_default_restricted_export_controlled/mavlink/mavlink_helpers.h:377
#8  0x080f0bf2 in mavlink_msg_command_long_send_struct (chan=chan@entry=MAVLINK_COMM_1, command_long=command_long@entry=0x20018124) at build/px4_fmu-v6x_default_restricted_export_controlled/mavlink/common/mavlink_msg_command_long.h:275
#9  0x080fc7e0 in MavlinkCommandSender::handle_vehicle_command (this=0x2000e110, command=..., channel=<optimized out>) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:142
#10 0x080fc87e in MavlinkCommandSender::handle_vehicle_command (this=<optimized out>, command=..., channel=<optimized out>) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:123
#11 0x080fcbaa in MavlinkStreamCommandLong::send (this=0x20019660) at src/modules/mavlink/streams/COMMAND_LONG.hpp:89
#12 0x080faac6 in MavlinkStream::update (this=this@entry=0x20019660, t=@0x20018280: 1327377) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:13202
#13 0x081004d8 in Mavlink::task_main (this=this@entry=0x200184a0, argc=11, argc@entry=13, argv=0x20017790, argv@entry=0x20017788) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:4798
#14 0x08100878 in Mavlink::start_helper (argc=13, argv=0x20017788) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:5136
#15 0x080301f4 in nxtask_startup (entrypt=entrypt@entry=0x8100859 <Mavlink::start_helper(int, char**)>, argc=<optimized out>, argv=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/sched/task_startup.c:151
#16 0x08024a1e in nxtask_start () at platforms/nuttx/NuttX/nuttx/sched/task/task_start.c:130

In addition, in MavlinkCommandSender::handle_vehicle_command() the task locked the MavlinkCommandSender::_lock semaphore before being put to sleep, so it’s still holding it.

  1. Both the mavlink_if0 and mavlink_if2 tasks are now stuck waiting on the same MavlinkCommandSender::_lock semaphore to send their messages.
(gdb) px4_switch_task 915
Switched to task 'mavlink_if0' (915).
(gdb) bt
#0  arm_switchcontext () at platforms/nuttx/NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_switchcontext.S:79
#1  0x08023bea in nxsem_wait (sem=sem@entry=0x2400f4f8 <MavlinkCommandSender::_lock>) at platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:155
#2  0x08023c1a in sem_wait (sem=sem@entry=0x2400f4f8 <MavlinkCommandSender::_lock>) at platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:273
#3  0x080ee802 in MavlinkCommandSender::lock () at src/modules/mavlink/mavlink_command_sender.h:93
#4  0x080fc9f8 in MavlinkCommandSender::check_timeout (this=0x2000e110, channel=<optimized out>) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:196
#5  0x080fcbbe in MavlinkStreamCommandLong::send (this=0x20014120) at src/modules/mavlink/streams/COMMAND_LONG.hpp:98
#6  0x080faac6 in MavlinkStream::update (this=this@entry=0x20014120, t=@0x20012a40: 1330273) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:13202
#7  0x081004d8 in Mavlink::task_main (this=this@entry=0x20012cc0, argc=12, argc@entry=14, argv=0x20011f50, argv@entry=0x20011f48) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:4798
#8  0x08100878 in Mavlink::start_helper (argc=14, argv=0x20011f48) at build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:5136
#9  0x080301f4 in nxtask_startup (entrypt=entrypt@entry=0x8100859 <Mavlink::start_helper(int, char**)>, argc=<optimized out>, argv=<optimized out>) at platforms/nuttx/NuttX/nuttx/libs/libc/sched/task_startup.c:151
#10 0x08024a1e in nxtask_start () at platforms/nuttx/NuttX/nuttx/sched/task/task_start.c:130
(gdb) px4_switch_task 1158
Switched to task 'mavlink_if2' (1158).
(gdb) bt
#0  arm_switchcontext () at /Users/niklaut/dev/upstream/PX4_firmware_private/platforms/nuttx/NuttX/nuttx/arch/arm/src/armv7-m/gnu/arm_switchcontext.S:79
#1  0x08023bea in nxsem_wait (sem=sem@entry=0x2400f4f8 <MavlinkCommandSender::_lock>) at /Users/niklaut/dev/upstream/PX4_firmware_private/platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:155
#2  0x08023c1a in sem_wait (sem=sem@entry=0x2400f4f8 <MavlinkCommandSender::_lock>) at /Users/niklaut/dev/upstream/PX4_firmware_private/platforms/nuttx/NuttX/nuttx/sched/semaphore/sem_wait.c:273
#3  0x080ee802 in MavlinkCommandSender::lock () at /Users/niklaut/dev/upstream/PX4_firmware_private/src/modules/mavlink/mavlink_command_sender.h:93
#4  0x080fc9f8 in MavlinkCommandSender::check_timeout (this=0x2000e110, channel=<optimized out>) at /Users/niklaut/dev/upstream/PX4_firmware_private/build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:196
#5  0x080fcbbe in MavlinkStreamCommandLong::send (this=0x2001e920) at /Users/niklaut/dev/upstream/PX4_firmware_private/src/modules/mavlink/streams/COMMAND_LONG.hpp:98
#6  0x080faac6 in MavlinkStream::update (this=this@entry=0x2001e920, t=@0x2001d5c0: 1329294) at /Users/niklaut/dev/upstream/PX4_firmware_private/build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:13202
#7  0x081004d8 in Mavlink::task_main (this=this@entry=0x2001d780, argc=11, argc@entry=13, argv=0x2001cad0, argv@entry=0x2001cac8) at /Users/niklaut/dev/upstream/PX4_firmware_private/build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:4798
#8  0x08100878 in Mavlink::start_helper (argc=13, argv=0x2001cac8) at /Users/niklaut/dev/upstream/PX4_firmware_private/build/px4_fmu-v6x_default_restricted_export_controlled/src/modules/mavlink/modules__mavlink_unity.cpp:5136
#9  0x080301f4 in nxtask_startup (entrypt=entrypt@entry=0x8100859 <Mavlink::start_helper(int, char**)>, argc=<optimized out>, argv=<optimized out>) at /Users/niklaut/dev/upstream/PX4_firmware_private/platforms/nuttx/NuttX/nuttx/libs/libc/sched/task_startup.c:151
#10 0x08024a1e in nxtask_start () at /Users/niklaut/dev/upstream/PX4_firmware_private/platforms/nuttx/NuttX/nuttx/sched/task/task_start.c:130

=> DEADLOCK

Chain of semaphores: g_uart7priv->txdmasem blocks Mavlink::_send_mutex blocks MavlinkCommandSender::_lock.
As a result all three transmit tasks are blocked: mavlink_if0, mavlink_if1, mavlink_if2 as well as the mavlink_rcv_if1 task.

This doesn’t happen with interrupt-based transmit, because that doesn’t wait on a semaphore in uart_write and thus the Mavlink layers can manage the buffer transmission manually.

Hardware Flow Control

This happens only when the UART7 (TELEM1) CTS (transmit flow control) is pulled high externally, thus no transmission is possible.

(gdb) px4_gpios -ff UART7
      ╷           ╷   ╷   ╷    ╷           ╷
  Pin │ Config    │ I │ O │ AF │ Name      │ Function
 ═════╪═══════════╪═══╪═══╪════╪═══════════╪══════════════════
  E8  │ ALT+PU+VH │ ^ │   │  7 │ UART7_TX  │ UART7_TX_TELEM1
  E9  │ OUT+VH    │ _ │ _ │    │ UART7_RTS │ UART7_RTS_TELEM1
  E10 │ ALT       │ ^ │   │  7 │ UART7_CTS │ UART7_CTS_TELEM1
  F6  │ ALT+PU+VH │ ^ │   │  7 │ UART7_RX  │ UART7_RX_TELEM1
      ╵           ╵   ╵   ╵    ╵           ╵

Solution

Remove the TXDMA config from the UART7 and use interrupt based transmission on FMU v6x and v6c.

There seems to be some unlucky timing required to trigger exactly this behavior, but I wasn't able to figure out what specific configuration triggered this. However, on the FMU we had, the condition was sticky even across reflashes and recompilations including outputting tracing information. (CTS was left floating in our fork)

We implemented the alternative fix, which acquires the txdma semaphore non-blockingly.

Changelog Entry

For release notes:

Prevent a deadlock in mavlink transmission when using TX DMA on TELEM1.

Alternatives

We could also fix the usage of semaphores in the TXDMA driver, as it should return the number of bytes written to delegate the behavior to the upper layers like the interrupt-based transmit does. However, up_dma_txavailable() gets called from a number of locations and I don't fully unterstand all the implications of removing the semaphore. cc @davids5

=> We implemented this solution now.

Test coverage

Tested with and without the TXDMA config. One does deadlock, the other does not deadlock.

Context

Profiling shows a lack of Mavlink thread activity very soon after boot.

Drag this file into ui.perfetto.dev.

@niklaut niklaut marked this pull request as ready for review November 2, 2023 14:39
@davids5
Copy link
Member

davids5 commented Nov 2, 2023

@niklaut Why not just add the pull Downs in GPIO config the CTS pins?

@niklaut
Copy link
Contributor Author

niklaut commented Nov 2, 2023

Why not just add the pull Downs in GPIO config the CTS pins?

Because then we can deadlock if CTS is pulled high when connected. I mean it's not great if the TELEM1 communication partner becomes unresponsive and does not allow for any more transmission, but at least it doesn't block the other Mavlink instances from transmitting too.

I feel like the nxsem_wait call in up_dma_txavailable() shouldn't be there at all, but it should only test if the semaphore can be taken, otherwise it's up to the caller to try again later? At least that's my understanding of how it works right now with the interrupts based approach?

@niklaut niklaut force-pushed the fix/v6x_txdma_mavlink branch 2 times, most recently from 151fa0a to b82c975 Compare November 3, 2023 10:05
@niklaut niklaut changed the title boards: Adapt FMUv6x DMA config to mirror FMUv5x boards: Remove FMUv6x/c TXDMA config from UART7 Nov 3, 2023
@dagar dagar requested a review from davids5 November 3, 2023 15:10
@dagar dagar self-requested a review November 3, 2023 15:14
@dagar
Copy link
Member

dagar commented Nov 3, 2023

As a start I don't think it's worth having separate threads for Mavlink RX & TX, there's quite a bit of TX happening from the RX thread and even simple things like the seq aren't being updated properly.

I'd also prefer things like MavlinkCommandSender being completely isolated per mavlink instance, but that would take a bit longer to untangle.

@dagar
Copy link
Member

dagar commented Nov 3, 2023

What about TXDMA on STM32F7?

@niklaut
Copy link
Contributor Author

niklaut commented Nov 6, 2023

What about TXDMA on STM32F7?

It is already disabled in the defconfig, only RXDMA is used on FMUv5x. I suspect this problem occurred before and therefore it was disabled.

In general I would like to fix this issue properly by removing the need for waiting on a semaphore in the TXDMA code, but this bug is currently preventing communications and this quickfix is helping…

@davids5
Copy link
Member

davids5 commented Nov 6, 2023

@niklaut can you test the use of nxsem_trywait

int rv =  nxsem_trywait(&priv->txdmasem);
if (rv == OK) 
  {
    uart_xmitchars_dma(dev);
  }

And see if this solves the issue.

@niklaut
Copy link
Contributor Author

niklaut commented Nov 8, 2023

Yup, your fix works!

Why not just add the pull Downs in GPIO config the CTS pins?

So in PX4 upstream the CTS is already pulled down, but in our PX4 private fork, CTS is left floating. This explains why the problem was only reproducible on some days, I guess when the humidity was high enough or something 🙈

@davids5
Copy link
Member

davids5 commented Nov 8, 2023

@niklaut PR from NuttX is backported to px4_firmware_nuttx-10.3.0+ Please repoint this PR (and revert changes) or open a new one with NuttX pointing to px4_firmware_nuttx-10.3.0+ to bring it in to PX4

The UART7 TXDMA services TELEM1 with flow control. If CTS is high, the
transmitting thread will wait on a semaphore, which may block other
threads from acquiring necessary resources to make progress, for
example, preventing MAVLINK instances from transmitting.

This change in NuttX makes the TXDMA acquire the semaphore in a
non-blocking way, preventing this issue.
@niklaut niklaut force-pushed the fix/v6x_txdma_mavlink branch from b82c975 to 031ed0d Compare November 9, 2023 12:07
@niklaut niklaut changed the title boards: Remove FMUv6x/c TXDMA config from UART7 stm32h7: Prevent UART from waiting on TXDMA semaphore Nov 9, 2023
@niklaut
Copy link
Contributor Author

niklaut commented Nov 9, 2023

Updated the submodule and the PR description.

Copy link
Member

@davids5 davids5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - let's see how CI feels

@davids5 davids5 merged commit 1463f9d into PX4:main Nov 9, 2023
81 of 85 checks passed
@niklaut niklaut deleted the fix/v6x_txdma_mavlink branch November 9, 2023 13:12
@julianoes
Copy link
Contributor

Amazing work @niklaut!

@tstastny
Copy link

@niklaut
Copy link
Contributor Author

niklaut commented Nov 13, 2023

I think because there were additional changes in NuttX that got pulled in on top of the px4_firmware_nuttx-10.3.0+ branch. cc @davids5 Should we remove these commits from the PX4-NuttX branch?

@davids5
Copy link
Member

davids5 commented Nov 13, 2023

@niklaut we should just merge this the board that is failing has been removed.

@niklaut
Copy link
Contributor Author

niklaut commented Nov 13, 2023

Ok, then if the failing board is not super relevant, I guess there's no pressure to act immediately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

5 participants