Refactor IOStreamer and QSPI #700

purdeaandrei · 2024-09-22T05:45:50Z

This refactors the entire IOStreamer infrastructure and how it's used with QSPI.

Here's a rough sketch of what the structure looks like:

This PR is correct with respect to all of the pending bugs that the existing IOStreamer and QSPI controller has, that is:

DDR sampling latency is correct
Supports sample delay functionality (with the applet auto-calculating some reasonable defaults)
QSPI controller now always samples and drives on the correct edge (or relative to the correct edge)
There is no more risk of losing a sample when the skid buffer both receives and sends a sample on the same clock cycle
A combinatorial i_stream.ready will not block

In addition what's new in the PR:

sample_delay_half_clocks is runtime-configurable, with max_sample_delay_half_clocks allows IOStreamer to configure the right amount of resources
We have discussed how we can avoid having a shift register stage, and a skid buffer stage for every sample delay step. I came to the concolusion that:
- divisor must be known by IOStreamerTop in order to make optimizations like that, so I have integrated IOStretcher into IOStreamerTop.
- divisor is a runtime setting, so we cannot minimize logic based on its value, so I have added a min_divisor configuration option. divisor must always be >= min_divisor. The min_divisor option can be used to optimize away skid buffer stages, and to optimize the shift register stages that delay samples from entering the skid buffer.
- the size of the skid buffer can be made as small as ceil(max_latency / min_divisor), which can sometimes end up being as small as a single stage.
- the shift registers that are used to delay the samples and meta from entering the skid buffer, can be implemented more efficiently by using a counter. Unfortunately if the maximum latency is large, then the counter will only count up to min_divisor, and beyond that the rest of the latency will be implemented with shift registers. However if the maximum latency is small, The counter will suffice.
- If we don't constrain divisor with a min_divisor configuration option, then I don't think that we can make any kind of optimization and we are forced to have max_latency shift register stages for i_en/meta and max_latency skid buffer stages as well.
when it comes to the 1-lane to 2-lane conversion (i.e. having two 1-lane transactions be merged into a single two-lane transaction) it's a bit complicated, and one of the following options must be chosen:
- 1-lane transactions could always come in pairs. So the first would always belong to lane 0 and the second would always belong to lane 1. I have chosen to not implement this solution, because we need to have i_en individually for each lane. And that means that i_en can be set high only for one lane, so while on 2-to-1-lane conversion we always get pairs, the returning samples will not be paired anymore. One way we could handle that would be to have i_en be an attribute outside of lanes, or to have i_en be or-ed together, and overridden to be the same for all lanes, before 2-to-1 conversion, but this is undesireable.
- Another option would be to always convert a 1-lane transaction to a 2-lane transaction where the second lane is an empty lane. This is undesirable, because QSPIController now has to be aware if we're in the DDR+divisor=0 case, in which case samples will return on lane [1], and in all other cases samples would return on lane [0]. I seems very hackish from the perspective of the QSPIController. Samples should always return on the same lane independent of speed configuration. We can't just return ddr+divisor=0 lane[1] samples on lane[0], because that's not a generic solution, sometimes we care about samples taken during the rising edge, and sometimes for other protocols we may actually want to sample on both edges. So this option is undesirable.
- What I have implemented, is that IOClockers's IO2LaneTo1Lane converter adds extra information to meta, that identifies which lane each transaction originally belonged to. Then then entire IOStreamerTop structure processes the transaction with the expanded meta field, and when samples come back to IOClockerDeframer, they have the corresponding fields tag and last.
  - tag will tell the IO1LaneTo2Lane upconverter, to which lane each sample belongs to
  - last will tell the IO1LaneTo2Lane upconverter, if it needs to wait for another sample before assembling the 2-lane transaction and sending it to QSPIDeframer.
  - Note that last is generated in IO2LaneTo1Lane by looking at lane 1's i_en. If lane[1].i_en is high, than lane[0].meta.last is low, and lane[1].meta.last is high, otherwise lane[0].meta.last is high.
  - This schema results in:
    - QSPIController simply sees lane[0] as clock falling edge, and lane[1] as clock rising edge. It will simply set i_en high for the edge(s) for which it's interested in receiving a sample, and the sample will come back on the same lane number(s) as on which i_en was set high. This makes things generic for future protocols.

Added support for additionally delaying when the sample is taken. Normally the samples are taken at a time when the input signals change that have been launched at the same time as i_en, however this allows us to take samples later, specified in number of sync clock cycles. Additionally for ratio=2 iostreamer we can also delay by half a clock cycle.

… DDR inputs" This reverts commit d1cc5d9.

This changes how the sdr input sampling test works. Before only positive edges on clk_out would signify to the testbench that the input has been sampled. The old behavior put a constraint on payload, that i_en must be high only for every second payload. This was less then ideal for developing stronger skid buffer tests. This change slows down the clk_out signal, and now any edge (positive or negative), means that we have sampled something. In the waves only the shape of clk_out changes, the rest of the signals stay the same. (i.e. the same values are sampled at the same times) The DDR input sample test does not require a similar change to it.

There was a corner-case in IOStreamer where a sample could be lost, if it arrives to the skid buffer on the same cycle as when another sample is removed from the skid buffer. This PR also adds many more testcases.

This better relays the intent that we want to sample the "old" value of the inputs in case they change at the same time as clk_out, and may work better if input_generator_tb changes in the future.

…of SimulatableDDRBuffer is glitchless

This change allows o_stream to take at least one transfer, as long as there are no sample transfers in flight. This will prevent deadlocks, for example if i_stream is implemented with something like: ```m.d.comb += ready.eq(valid)``` Which is valid according to stream rules. This commit also adds tests for this.

There is a slight change in behavior, because the "bypass" feature has been removed, CS clock pulses must now be at least 1 QPSI cycle long. `sample_delay_half_clocks` is now runtime-selectable. `max_sample_delay_half_clocks` is now a configuration option, for the IOStreamerTop to know how many resources to allocate. `min_divisor` is now a configuration option, that if set to > 0, allows IOStreamer to optimize away some uneeded resources. The bulk of the changes is about cutting up the functionality into smaller sub-components.

purdeaandrei · 2024-09-22T17:02:18Z

TLDR: I think I was wrong about the folloing statement and I want to rework sample_delay-related resource optimizations, to make it possible to use smaller skid-buffer and less delayer shift register stages at slow clock speed.

If we don't constrain divisor with a min_divisor configuration option, then I don't think that we can make any kind of optimization and we are forced to have max_latency shift register stages for i_en/meta and max_latency skid buffer stages as well.

I re-thought the statement above and I think I was wrong, I think an optimization that is a lot closer to what Cathrine originally described on discord is possible, without needing non-zero min_divisor. I think there's a way to constrain sample_delay_half_clocks differently: It could be a signal of size divisor_width, and if there is a rule that sample_delay_half_clocks <= divisor_width * 2 at all times, then we can get away with 3 skid buffer stages (that is: maximum Buffer latency + 1), and we could use only a counter for the SampleDelayer (or even re-use the existing counter in IOStretcher) (plus some shift registers for the buffer latency).

This can then optionally be further extended if we add a max_additional_sample_delay_half_clocks setting. The additional clocks would be expensive (just as expensive as the current implementation in this PR), because they would need max_additional_sample_delay_half_clocks // 2 + max_additional_sample_delay_half_clocks % 2 shift register stages, and ceil((max_additional_sample_delay_half_clocks // 2 + max_additional_sample_delay_half_clocks % 2) / min_divisor) additional skid buffer elements.

Now assuming min_divisor is 0 (which it would be for most hardware that has reconfigurable divisor), I'm not sure what is better, for example:

hardcoded max_sample_delay_half_clocks = 4, with ICE40 ddr buffers, would result in 5 shift registers, and skid buffer depth of 4.
hardcoded max_additional_sample_delay_half_clocks = 2, with ICE40 ddr buffers, would result in a (re-)using a 16-bit counter and logic used to look at the counter, 4 shift registers, skid buffer depth of 4
- so similar usage as my original implementation, and would have the same amount of supported sample delay at divisor==0, however it would have the advantage of supporting much larger sample delays at larger divisors.
  i.e. is the added complexity worth the additional large sample delay support that only works at large divisors?

I think I'm tending towards it's worth it.
Imagine we're driving not an ideal transmission line, but a high impedance high capacitance bus, where the response also comes back on a high impedance and hight capacitance line. Then we're forced to drive it at a slow clocks speed, where divisor a tad larger than the propagation time one way measured in f_sync_cycles. Then if we sample on the following opposite edge, then we haven't seen the return signal yet, so we'd want a sample delay of at least as much time as divisor*2 half-clock-cycles, and with the revised idea, we basically get that for free for arbitrarily slow buses, without additional resources. The current implementation in this PR would make it very expensive, depending on how slow a bus we would want to support.

Next time I'm looking at this I think I'll try giving this idea an implementation.

whitequark · 2024-09-22T18:17:10Z

Just to be clear, I still plan to re-do the infrastructure myself at some point. As long as you're fine with duplicated work, that's going to be OK.

purdeaandrei · 2024-09-22T18:33:25Z

That wasn't clear, any way, thanks for clarifying. I'll set this to draft, feel free to reuse any of my code if you find it useful.

whitequark · 2024-09-22T19:13:05Z

The reason I'm saying that is because I'd like the implementation to be suitable to upstream to amaranth-stdio. Yours is very complex and has a lot of idiosyncrasies, which would be fine for a Glasgow applet but isn't going to cut it for a generic and reusable module that I expect everyone to use. I've scrolled through it and I don't feel that it is suitable to be moved to amaranth-stdio; I do appreciate that you've put an enormous amount of work into it and I would like to discuss the mechanism with you to make sure that whatever eventually ends up in amaranth-stdio is of high quality.

purdeaandrei added 14 commits September 22, 2024 05:57

QSPIController: add support for passing sample delay to IOStreamer

10aa480

Reapply "gateware.iostream.IOStreamer: fix bug for incorrect sampling…

1e00d28

… DDR inputs" This reverts commit d1cc5d9.

IOClocker: fix driving/sample half cycle earlier bug

82a7b66

QSPIController: fix wrong edge sampled when divisor==0

465d306

QSPIControllerApplet: auto-calculate default sample delay

91f7757

added sample delay tests

8756b70

test_iostream: refactor input sample tests so they can be reused

e1d96db

gateware.iostream.IOStreamer: fix same-cycle skid buffer push/pop

7a9d057

There was a corner-case in IOStreamer where a sample could be lost, if it arrives to the skid buffer on the same cycle as when another sample is removed from the skid buffer. This PR also adds many more testcases.

test_iostream: improve docstrings

a0e43cc

test_iostream: refactor sampling to make it use .sample()

eab84b8

This better relays the intent that we want to sample the "old" value of the inputs in case they change at the same time as clk_out, and may work better if input_generator_tb changes in the future.

iostream ddr testcase: remove glitch mitigation, because new version …

c20339a

…of SimulatableDDRBuffer is glitchless

purdeaandrei requested a review from whitequark as a code owner September 22, 2024 05:45

purdeaandrei force-pushed the f_refactor_iostreamer branch 2 times, most recently from 7250f61 to 9b22ff5 Compare September 22, 2024 07:35

purdeaandrei force-pushed the f_refactor_iostreamer branch from 9b22ff5 to a4f9696 Compare September 22, 2024 07:51

purdeaandrei marked this pull request as draft September 22, 2024 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor IOStreamer and QSPI #700

Refactor IOStreamer and QSPI #700

purdeaandrei commented Sep 22, 2024

purdeaandrei commented Sep 22, 2024

whitequark commented Sep 22, 2024

purdeaandrei commented Sep 22, 2024

whitequark commented Sep 22, 2024

Refactor IOStreamer and QSPI #700

Are you sure you want to change the base?

Refactor IOStreamer and QSPI #700

Conversation

purdeaandrei commented Sep 22, 2024

purdeaandrei commented Sep 22, 2024

whitequark commented Sep 22, 2024

purdeaandrei commented Sep 22, 2024

whitequark commented Sep 22, 2024