-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stm32h7: Prevent UART from waiting on TXDMA semaphore #22302
Conversation
@niklaut Why not just add the pull Downs in GPIO config the CTS pins? |
Because then we can deadlock if CTS is pulled high when connected. I mean it's not great if the TELEM1 communication partner becomes unresponsive and does not allow for any more transmission, but at least it doesn't block the other Mavlink instances from transmitting too. I feel like the |
151fa0a
to
b82c975
Compare
As a start I don't think it's worth having separate threads for Mavlink RX & TX, there's quite a bit of TX happening from the RX thread and even simple things like the seq aren't being updated properly. I'd also prefer things like MavlinkCommandSender being completely isolated per mavlink instance, but that would take a bit longer to untangle. |
What about TXDMA on STM32F7? |
It is already disabled in the defconfig, only RXDMA is used on FMUv5x. I suspect this problem occurred before and therefore it was disabled. In general I would like to fix this issue properly by removing the need for waiting on a semaphore in the TXDMA code, but this bug is currently preventing communications and this quickfix is helping… |
@niklaut can you test the use of nxsem_trywait
And see if this solves the issue. |
Yup, your fix works!
So in PX4 upstream the CTS is already pulled down, but in our PX4 private fork, CTS is left floating. This explains why the problem was only reproducible on some days, I guess when the humidity was high enough or something 🙈 |
@niklaut PR from NuttX is backported to px4_firmware_nuttx-10.3.0+ Please repoint this PR (and revert changes) or open a new one with NuttX pointing to px4_firmware_nuttx-10.3.0+ to bring it in to PX4 |
The UART7 TXDMA services TELEM1 with flow control. If CTS is high, the transmitting thread will wait on a semaphore, which may block other threads from acquiring necessary resources to make progress, for example, preventing MAVLINK instances from transmitting. This change in NuttX makes the TXDMA acquire the semaphore in a non-blocking way, preventing this issue.
b82c975
to
031ed0d
Compare
Updated the submodule and the PR description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - let's see how CI feels
Amazing work @niklaut! |
@niklaut nxp board is failing since this PR https://github.com/PX4/PX4-Autopilot/actions/runs/6812057466/job/18523643113#step:7:757 |
I think because there were additional changes in NuttX that got pulled in on top of the |
Ok, then if the failing board is not super relevant, I guess there's no pressure to act immediately. |
Mavlink can deadlock when using TXDMA to transmit to TELEM1 with flow control CTS high constantly.
Solved Problem
Both
mavlink_if0
andmavlink_if2
are waiting on a semaphore0x2400f4f8
held bymavlink_if1
, which waits on a different semaphore0x200191dc
held bymavlink_rcv_if1
.Deadlock Timeline
mavlink_if1
task is the first to send a message to TELEM1, which has nothing connected. This callsup_dma_txavailable
which sets up the very first DMA TX transfer for UART7.The associated DMA channel for the tx buffer is channel 4 of DMA2:
The channel is configured correctly with 0x4a (=74) bytes to transfer initially.
up_dma_txavailable
comes from themavlink_rcv_if1
task, which tries to send a parameterCAL_MAG0_ID
through TELEM1.Note that both transmit calls do not fill the entire TX buffer, just 127B in total:
mavlink_rcv_if1
task to sleep BEFORE it can release theMavlink::_send_mutex
, which was locked when callingMavlink::send_start()
some time before.Sleeps in
up_dma_txavailable (dev=0x240001e0 <g_uart7priv>) at platforms/nuttx/NuttX/nuttx/arch/arm/src/stm32h7/stm32_serial.c:3382
:called from
uart_write (filep=<optimized out>, buffer=0x20019019 "\v5", buflen=0) at platforms/nuttx/NuttX/nuttx/drivers/serial/serial.c:1223
.mavlink_if1
now waits on theMavlink::_send_mutex
inMavlink::send_start()
:In addition, in
MavlinkCommandSender::handle_vehicle_command()
the task locked theMavlinkCommandSender::_lock
semaphore before being put to sleep, so it’s still holding it.=> DEADLOCK
Chain of semaphores:
g_uart7priv->txdmasem
blocksMavlink::_send_mutex
blocksMavlinkCommandSender::_lock
.As a result all three transmit tasks are blocked:
mavlink_if0
,mavlink_if1
,mavlink_if2
as well as themavlink_rcv_if1
task.This doesn’t happen with interrupt-based transmit, because that doesn’t wait on a semaphore in uart_write and thus the Mavlink layers can manage the buffer transmission manually.
Hardware Flow Control
This happens only when the UART7 (TELEM1) CTS (transmit flow control) is pulled high externally, thus no transmission is possible.
Solution
Remove the TXDMA config from the UART7 and use interrupt based transmission on FMU v6x and v6c.There seems to be some unlucky timing required to trigger exactly this behavior, but I wasn't able to figure out what specific configuration triggered this. However, on the FMU we had, the condition was sticky even across reflashes and recompilations including outputting tracing information.(CTS was left floating in our fork)We implemented the alternative fix, which acquires the txdma semaphore non-blockingly.
Changelog Entry
For release notes:
Alternatives
We could also fix the usage of semaphores in the TXDMA driver, as it should return the number of bytes written to delegate the behavior to the upper layers like the interrupt-based transmit does. However,
up_dma_txavailable()
gets called from a number of locations and I don't fully unterstand all the implications of removing the semaphore. cc @davids5=> We implemented this solution now.
Test coverage
Tested with and without the TXDMA config. One does deadlock, the other does not deadlock.
Context
Profiling shows a lack of Mavlink thread activity very soon after boot.
Drag this file into ui.perfetto.dev.