-
-
Notifications
You must be signed in to change notification settings - Fork 781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase performance of SWD and JTAG bitbanging #1688
base: main
Are you sure you want to change the base?
Conversation
f59f76f
to
8d73f51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking provisionally very good and we see a lot of promise here - we'll have to do a pile of testing to make sure this doesn't regress the setup-and-hold timings, but mostly the notes and comments in this review are about form not function. The whole volatile
vs register
thing in particular raises some interesting questions to explore.
static bool jtagtap_next_no_delay() | ||
{ | ||
gpio_set(TCK_PORT, TCK_PIN); | ||
__asm__("nop" ::: "memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please include a comment on what this is doing - we don't want to assume that the reader has seen the other uses of this and knows what's going on. Is there any particular reason for the inserting of a no-op rather than only a reordering barrier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotated for the readers. This routine is too fast in comparison to others, because it has nothing to do besides toggle the jTCK pin, and it certainly does not move data into (or out of) BMP. I may even want to add more nops or something similar to tone down this one.
src/platforms/common/jtagtap.c
Outdated
static bool jtagtap_next(bool tms, bool tdi); | ||
static void jtagtap_cycle(bool tms, bool tdi, size_t clock_cycles); | ||
static inline void platform_delay_busy(const uint32_t loops) __attribute__((always_inline)); | ||
static void jtagtap_tms_seq(uint32_t tms_states, size_t ticks) __attribute__((optimize(3))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to do some captures and testing with this attribute to make sure it's not shooting the timings in the foot - as you note in the PR description. We do note that this particular function signature still has the old ticks
nomenclature in it, so please switch that to clock_cycles
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
750 kHz of tdi_tdo_seq on native
was unremarkable, I hope performance improves -- please note, during capture/analysis, whether a given hardware relies on ancient FLITF Prefetch (2x64-bit lines) or runs from "0-wait-state" shadow SRAM.
Updated jtagtap_tms_seq()
declaration and definition.
/* Busy-looping delay for GPIO bitbanging operations. SUBS+BNE.N take 4 cycles. */ | ||
void platform_delay_busy(const uint32_t loops) | ||
{ | ||
/* Avoid using `volatile` variables which incur stack accesses */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cppreference page on C's volatile
does not mention anything about forced stack usage and we've never seen it imply any such - are we sure that is what the compiler's doing here? Either way, we would strongly prefer you to use a for loop not a do-while here. This is a for loop in disguise.
The disappearance of the loop when using continue;
without volatile
can be entirely explained not so much by DCE (it's the result, not the cause) but by there being no observable side effects thus allowing the compiler to assume that nothing of value is done by the loop. This is the reason for using volatile
in the jtagtap code. Perhaps we can use volatile register
to force the compiler to consider the operation to have observable side-effects but also keep the loop counter register-bound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed volatile register
was the first thing I tried, only to be yelled at by the compiler to swap qualifiers... and identical codegen. I specifically wanted to avoid extraneous ldr/str of loop counters, so I monitored Puncover and objdump -CSd. Maybe Compiler Explorer can shed some light on why this codegens this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh how interesting - ok, that'll be an interesting session in CE then as yes, that might well provide some insight into how and why the compiler's choosing to do what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably qualifying a variable as volatile
forces the compiler to assume that it should be observable elsewhere, and hence needs to be pinned to SRAM (if only on the stack). This directly contradicts with the intent of register-allocated counters (which debuggers don't display due to "optimized away" but you can always dump the gp_regs). Something "dead store elimination".
And the for-loop form is longer and has an extra jump, I didn't subjectively like it; objectively it has to fit into Flash prefetch buffer which is a few (4) 64-bit lines on F103 (?) to not incur wait-states of 72MHz MCU against 24MHz flash on random jumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From our perspective re the for loop part, the for loop form should code gen to the same or very near the same as the do-while (perhaps needs the condition expressed as --counter > 0
and --counter
in the update step removed to do it) while being a much clearer expression of your intent. Think this is very much going to be "take it to Compiler Explorer and tinker for an hour" to find the right thing to do and the right way to express this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your suggestion of --counter > 0
gets the codegen I want and the C idiomatic form you look to preserve, thanks.
Updated the three places accordingly.
m3/m4/m7 still get the SUBS+BNE.N, m0 (f072
) gets subs; cmp; bne.n
so its divisor timings will be off.
@@ -108,8 +121,9 @@ static uint32_t swdptap_seq_in_no_delay(const size_t clock_cycles) | |||
for (size_t cycle = 0; cycle < clock_cycles; ++cycle) { | |||
gpio_clear(SWCLK_PORT, SWCLK_PIN); | |||
value |= gpio_get(SWDIO_IN_PORT, SWDIO_IN_PIN) ? 1U << cycle : 0U; | |||
__asm__("" ::: "memory"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please insert a note here about this being a reordering barrier - same as in other places the trick is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. Both swdptap_seq_in_no_delay()
and swdptap_seq_out_no_delay()
need careful consideration in code influencing bitbanged timings, including effect of gpio_set()/clear()
or gpio_set_val()
and gpio_get()
.
I would like to see this entire routine fit into ART cache of STM32F4, or at least not jump around thanks to all the gpio_*
bits inlined.
ecb692b
to
47a39ca
Compare
Thanks; the documented speed will certainly need updating, more so because of #1536. |
* Copy the snippet verbatim from 10 identical places * Ask GCC to please inline it back (so codegen does not change) compromising 116 bytes of flash for less jumping around
* Copy the snippet verbatim from 10 similar places, preserving the +1 in argument (turnaround/parity) for automatic 0 clock delay instead of checking against UINT32_MAX * Mark as static for GCC to inline it back (so codegen does not change) compromising 96 bytes of flash for less jumping around
* Use predecrement and check against zero, as before * Stop touching the stack with `volatile` variables, a register is sufficient * Use normal SysTick platform_delay() in jtagtap_reset(), which is both longer duration and does not depend on platform HCLK.
* Avoid a zero loops delay, which underflows in `platform_delay_busy()`, hanging the adapter
* For STM32F4, dial in 12 & 4 as measured by LA and MSO. This became faster thanks to dropping double writes and volatile on-stack counters. * For STM32F0, dial in slightly bigger 16 & 6 because armv6-m codegen is different. * For STM32F1, dial in 18 & 4: less than 22 thanks to no volatile on-stack counters, and similar busy delay impact (armv7-m). * For everything else, keep the old numbers in hope someone measures it later.
* Reference: jtagtap_tms_seq_no_delay(). Also add comments. * Use appropriate type for TMS state output.
* This allows dropping the 0-loops check from delay_busy
* Also update signature of jtagtap_tms_seq() (s/ticks/clock_cycles/)
* SWD line reset needs 50 clocks with high TMS/SWDIO, this was achieved by looping over jtagtap_next() which incurs function call delays; * Use jtagtap_cycle() instead which seems appropriate here, and discards TDO anyways; * The adiv5_swd.c:swd_line_reset_sequence() counterpart seems less fitting to duplicate in jtagtap. * Annotate any new nops in hot no_delay path.
...by redirecting to faster tdi_seq * Do not use nullability attributes yet
47a39ca
to
58a2425
Compare
Thanks for working on this! I tried it out a bit, and it seems to sometimes set At least the clock speeds seem more consistent across different routines. There are lots of remaining oddities about the timing that I see on the logic analyzer, but I'm still investigating those. |
The main improvement is making the parity read/write functions have a consistent pulse width compared to the non-parity ones, at least in the case that does delays. For the non-delay case, there seems to be increased asymmetry in the clock waveform, possibly due to replacing the NOP instructions with barriers. Some of these barriers are redundant, because the GPIO accesses are already volatile. Also, I only have a 24MHz logic analyzer, so my ability to analyze differences in the no-delay case are going to be confounded by inadequate sample rate. |
Detailed description
Tested on
blackpill-f411ce
platform connected to a stm32f103cb DUT. Firmware does not crash on exceptions from spurious ACKs. The 24MHz fx2la does not let me verify setup-and-hold timing compliancy, but the target silicon replies okay. I could retest with other targets accessible to me, possibly with low HCLKs. I was focusing ot stm32f4 platforms, but lm4flaunchpad-icdi
may be affected, too.LINERESET waveform improved from 2.988 MHz to ~8.000 MHz, that is
swd_seq_out_no_delay()
.BMDA
-r -f 8M
was 22.5 and 41 KiB/s in JTAG and SWD mode, becomes 28 and 44 KiB/s respectively (but that bottleneck is different).The PR aims to:
Important mentions:
GPIO_BSRR hack was introduced in f1ea5ed ; the reasoning behind it is not clear to me, but we can either set some default divider to target lower (2 MHz) frequency, and BMDA already asks for some 3 or 4 MHz. I wired everything on a breadboard with flying wires and a couple ground connections (no ribbon cables).
swdptap.c
hasoptimize(3)
since 4698a26 (but notjtagtap.c
). -O3 increases flash footprint and may alter the timings.Your checklist for this pull request
make PROBE_HOST=native
)make PROBE_HOST=hosted
)Closing issues
No known issues w.r.t. SWJ speed.