Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor IOStreamer and QSPI #700

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

purdeaandrei
Copy link
Contributor

This refactors the entire IOStreamer infrastructure and how it's used with QSPI.

Here's a rough sketch of what the structure looks like:
image

This PR is correct with respect to all of the pending bugs that the existing IOStreamer and QSPI controller has, that is:

  • DDR sampling latency is correct
  • Supports sample delay functionality (with the applet auto-calculating some reasonable defaults)
  • QSPI controller now always samples and drives on the correct edge (or relative to the correct edge)
  • There is no more risk of losing a sample when the skid buffer both receives and sends a sample on the same clock cycle
  • A combinatorial i_stream.ready will not block

In addition what's new in the PR:

  • sample_delay_half_clocks is runtime-configurable, with max_sample_delay_half_clocks allows IOStreamer to configure the right amount of resources
  • We have discussed how we can avoid having a shift register stage, and a skid buffer stage for every sample delay step. I came to the concolusion that:
    • divisor must be known by IOStreamerTop in order to make optimizations like that, so I have integrated IOStretcher into IOStreamerTop.
    • divisor is a runtime setting, so we cannot minimize logic based on its value, so I have added a min_divisor configuration option. divisor must always be >= min_divisor. The min_divisor option can be used to optimize away skid buffer stages, and to optimize the shift register stages that delay samples from entering the skid buffer.
    • the size of the skid buffer can be made as small as ceil(max_latency / min_divisor), which can sometimes end up being as small as a single stage.
    • the shift registers that are used to delay the samples and meta from entering the skid buffer, can be implemented more efficiently by using a counter. Unfortunately if the maximum latency is large, then the counter will only count up to min_divisor, and beyond that the rest of the latency will be implemented with shift registers. However if the maximum latency is small, The counter will suffice.
    • If we don't constrain divisor with a min_divisor configuration option, then I don't think that we can make any kind of optimization and we are forced to have max_latency shift register stages for i_en/meta and max_latency skid buffer stages as well.
  • when it comes to the 1-lane to 2-lane conversion (i.e. having two 1-lane transactions be merged into a single two-lane transaction) it's a bit complicated, and one of the following options must be chosen:
    • 1-lane transactions could always come in pairs. So the first would always belong to lane 0 and the second would always belong to lane 1. I have chosen to not implement this solution, because we need to have i_en individually for each lane. And that means that i_en can be set high only for one lane, so while on 2-to-1-lane conversion we always get pairs, the returning samples will not be paired anymore. One way we could handle that would be to have i_en be an attribute outside of lanes, or to have i_en be or-ed together, and overridden to be the same for all lanes, before 2-to-1 conversion, but this is undesireable.
    • Another option would be to always convert a 1-lane transaction to a 2-lane transaction where the second lane is an empty lane. This is undesirable, because QSPIController now has to be aware if we're in the DDR+divisor=0 case, in which case samples will return on lane [1], and in all other cases samples would return on lane [0]. I seems very hackish from the perspective of the QSPIController. Samples should always return on the same lane independent of speed configuration. We can't just return ddr+divisor=0 lane[1] samples on lane[0], because that's not a generic solution, sometimes we care about samples taken during the rising edge, and sometimes for other protocols we may actually want to sample on both edges. So this option is undesirable.
    • What I have implemented, is that IOClockers's IO2LaneTo1Lane converter adds extra information to meta, that identifies which lane each transaction originally belonged to. Then then entire IOStreamerTop structure processes the transaction with the expanded meta field, and when samples come back to IOClockerDeframer, they have the corresponding fields tag and last.
      • tag will tell the IO1LaneTo2Lane upconverter, to which lane each sample belongs to
      • last will tell the IO1LaneTo2Lane upconverter, if it needs to wait for another sample before assembling the 2-lane transaction and sending it to QSPIDeframer.
      • Note that last is generated in IO2LaneTo1Lane by looking at lane 1's i_en. If lane[1].i_en is high, than lane[0].meta.last is low, and lane[1].meta.last is high, otherwise lane[0].meta.last is high.
      • This schema results in:
        • QSPIController simply sees lane[0] as clock falling edge, and lane[1] as clock rising edge. It will simply set i_en high for the edge(s) for which it's interested in receiving a sample, and the sample will come back on the same lane number(s) as on which i_en was set high. This makes things generic for future protocols.

Added support for additionally delaying when the sample is taken.
Normally the samples are taken at a time when the input signals
change that have been launched at the same time as i_en, however
this allows us to take samples later, specified in number of
sync clock cycles. Additionally for ratio=2 iostreamer we can also
delay by half a clock cycle.
This changes how the sdr input sampling test works. Before only positive
edges on clk_out would signify to the testbench that the input has been
sampled. The old behavior put a constraint on payload, that i_en must be
high only for every second payload. This was less then ideal for developing
stronger skid buffer tests.

This change slows down the clk_out signal, and now any edge (positive or
negative), means that we have sampled something. In the waves only the
shape of clk_out changes, the rest of the signals stay the same. (i.e.
the same values are sampled at the same times)

The DDR input sample test does not require a similar change to it.
There was a corner-case in IOStreamer where a sample could be lost,
if it arrives to the skid buffer on the same cycle as when another
sample is removed from the skid buffer.

This PR also adds many more testcases.
This better relays the intent that we want to sample the
"old" value of the inputs in case they change at the same time
as clk_out, and may work better if input_generator_tb changes
in the future.
This change allows o_stream to take at least one transfer, as long as
there are no sample transfers in flight. This will prevent deadlocks,
for example if i_stream is implemented with something like:
```m.d.comb += ready.eq(valid)```
Which is valid according to stream rules.

This commit also adds tests for this.
@purdeaandrei purdeaandrei force-pushed the f_refactor_iostreamer branch 2 times, most recently from 7250f61 to 9b22ff5 Compare September 22, 2024 07:35
There is a slight change in behavior, because the "bypass" feature has been removed,
CS clock pulses must now be at least 1 QPSI cycle long.

`sample_delay_half_clocks` is now runtime-selectable.

`max_sample_delay_half_clocks` is now a configuration option, for the IOStreamerTop to know how many resources to allocate.

`min_divisor` is now a configuration option, that if set to > 0, allows IOStreamer to optimize away some uneeded resources.

The bulk of the changes is about cutting up the functionality into smaller sub-components.
@purdeaandrei
Copy link
Contributor Author

TLDR: I think I was wrong about the folloing statement and I want to rework sample_delay-related resource optimizations, to make it possible to use smaller skid-buffer and less delayer shift register stages at slow clock speed.

If we don't constrain divisor with a min_divisor configuration option, then I don't think that we can make any kind of optimization and we are forced to have max_latency shift register stages for i_en/meta and max_latency skid buffer stages as well.

I re-thought the statement above and I think I was wrong, I think an optimization that is a lot closer to what Cathrine originally described on discord is possible, without needing non-zero min_divisor. I think there's a way to constrain sample_delay_half_clocks differently: It could be a signal of size divisor_width, and if there is a rule that sample_delay_half_clocks <= divisor_width * 2 at all times, then we can get away with 3 skid buffer stages (that is: maximum Buffer latency + 1), and we could use only a counter for the SampleDelayer (or even re-use the existing counter in IOStretcher) (plus some shift registers for the buffer latency).

This can then optionally be further extended if we add a max_additional_sample_delay_half_clocks setting. The additional clocks would be expensive (just as expensive as the current implementation in this PR), because they would need max_additional_sample_delay_half_clocks // 2 + max_additional_sample_delay_half_clocks % 2 shift register stages, and ceil((max_additional_sample_delay_half_clocks // 2 + max_additional_sample_delay_half_clocks % 2) / min_divisor) additional skid buffer elements.

Now assuming min_divisor is 0 (which it would be for most hardware that has reconfigurable divisor), I'm not sure what is better, for example:

  • hardcoded max_sample_delay_half_clocks = 4, with ICE40 ddr buffers, would result in 5 shift registers, and skid buffer depth of 4.
  • hardcoded max_additional_sample_delay_half_clocks = 2, with ICE40 ddr buffers, would result in a (re-)using a 16-bit counter and logic used to look at the counter, 4 shift registers, skid buffer depth of 4
    • so similar usage as my original implementation, and would have the same amount of supported sample delay at divisor==0, however it would have the advantage of supporting much larger sample delays at larger divisors.
      i.e. is the added complexity worth the additional large sample delay support that only works at large divisors?

I think I'm tending towards it's worth it.
Imagine we're driving not an ideal transmission line, but a high impedance high capacitance bus, where the response also comes back on a high impedance and hight capacitance line. Then we're forced to drive it at a slow clocks speed, where divisor a tad larger than the propagation time one way measured in f_sync_cycles. Then if we sample on the following opposite edge, then we haven't seen the return signal yet, so we'd want a sample delay of at least as much time as divisor*2 half-clock-cycles, and with the revised idea, we basically get that for free for arbitrarily slow buses, without additional resources. The current implementation in this PR would make it very expensive, depending on how slow a bus we would want to support.

Next time I'm looking at this I think I'll try giving this idea an implementation.

@whitequark
Copy link
Member

Just to be clear, I still plan to re-do the infrastructure myself at some point. As long as you're fine with duplicated work, that's going to be OK.

@purdeaandrei
Copy link
Contributor Author

That wasn't clear, any way, thanks for clarifying. I'll set this to draft, feel free to reuse any of my code if you find it useful.

@purdeaandrei purdeaandrei marked this pull request as draft September 22, 2024 18:34
@whitequark
Copy link
Member

The reason I'm saying that is because I'd like the implementation to be suitable to upstream to amaranth-stdio. Yours is very complex and has a lot of idiosyncrasies, which would be fine for a Glasgow applet but isn't going to cut it for a generic and reusable module that I expect everyone to use. I've scrolled through it and I don't feel that it is suitable to be moved to amaranth-stdio; I do appreciate that you've put an enormous amount of work into it and I would like to discuss the mechanism with you to make sure that whatever eventually ends up in amaranth-stdio is of high quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants