-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arc] Automatic module partition #7650
base: main
Are you sure you want to change the base?
Conversation
Hey @SpriteOvO, thanks for working on a very challengine and interesting corner of the optimization space 🥳! One of the challenges I see, and you also mentioned, is that different users may have wildly different partitioning requirements. Arcilator probably wants to break the actual simulation workload into parallel tasks, while FPGA partitioning may want to find isolated clusters that have relaxed communication. It's very challenging to build a generic pass that performs well on all needs. You mentioned that your main goals was to find parallelization opportunities for Arcilator. That is awesome 😍 🥳! Arcilator could really benefit from some coarse-grained multi-threading. A lot of the analysis and bookkeeping you have to do comes from the presence of modules and the potential combinational paths that pass through module parts. Have you considered doing your partitioning directly in the Arc dialect? (You can always generalize later if feasible.) After conversion to Arc, the circuit becomes a graph of the necessary computations to advance the design by one cycle. At this stage, state and combinational paths are clearly visible, and there is no module hierarchy or hidden combinational path anymore. This would allow you to focus entirely on the partitioning itself: finding independent regions of computation, maybe grouping them into a new op (maybe some |
Drop in comment: I haven't looked over this PR, but when I had a partitioning pass in the MSFT dialect (it would pull things out deep in the module hierarchy and do cross-module movements but was obnoxiously complex and wasn't actively used), I had an attribute to specify the partition. I required the designer to specify the partition, but I could easily see different passes doing some target-specific heuristics to label the ops then running the design partition pass. |
Thanks @SpriteOvO and @CircuitCoder for doing this job! That's a really important job, T1 is counting on this PR to remove the verilator emulation backend.
We should forbid inout, in the partition boundary, the only usage for inout is for Chip/FPGA IO, which will be emulated via a tri-state IO.
Unlike the overhead of Verilog semantics, it is event driven, thus verilator use the
I personally would see, just pre-analyse all combinational path, and forbid cutting from them, cutting the combinational path delivers a bad performance on simulation and high complexity in the algorithm. |
I apologize for the super delayed response, I fell ill for the last couple of weeks. Partitioning against ARC seems to have a lot of merits. After discussing with @SpriteOvO and @sequencer, we decided to try to migrate this algorithm to the ARC dialect, which @SpriteOvO is already working on, and should result in a PoC after this weekend. My personal reasoning is that: If I understand correctly, partitioning during the ARC pass mainly involves partitioning against states, or more precisely, the new value to be written into states at the end of each cycle. The entire computation subtree would be partitioned into one half, and some operations would be copied into multiple halves. Our original motivation for doing partition with HW dialect is that: (a) We can partition against comb operations, which can be simpler (the graph representation is more intuitive to construct) and reflects well to hardware utilization if the result is synthesized onto FPGA; (b) Can use module boundary (which contains some semantical information) to guide partitioning. It turns out, however, that we overlooked some cases when simple unique partition against operations would not work (combinatory path has to pass through boundary more than one time [2]), thus breaking the correctness of lockstep simulation. So even if partitioning is done against operations, some operations would have to be duplicated. This is actually more difficult than first partitioning against states, and then doing some sort of common subexpression extraction (see [1]). My personal concerns with partitioning with states are:
After finishing the PoC, we intend to use rocket-chip w/ default configuration as a test. [1] We currently implement a simple sub-expression extraction algorithm: [2] Essentially it boils down to this pattern:
If A and B are split apart, then no matter where we place C, there would be a comb path that traverses the boundary at least twice. |
@SpriteOvO has just pushed an update, containing an initial PoC implementation of Arc-based state-directed partitioning. In brief, the partitioning operates on Hence, a multithreaded driver should call the model in the following way: For each thread,
Right now the partitioning happens at the end of the hw -> arc pipeline, contained in two passes: the first one (PartitionPass) is placed before state allocation. This pass marks state writes with assigned chunks and creates shadow states. The second one (PartitionClonePass) is placed after state allocation but before clock -> func. This pass splits the original Changes to other passes introduced in this commit
Implementation detailThe Partition planning involves finding out the dependencies between states, and the computation workload required for updating each state. Based on this information, a partition plan is generated, and each Shadow states are created for each HW state, and are allocated in the chunks' temporary storage. Inside an Between these two passes, states are allocated, and now the size for each storage is fixed. The
Unfinished worksThere are some unfinished (but mostly isolated) works to be done. This list will be updated to reflect the implementation. Most notably, the partitioning planning algorithm is missing.
QuestionsWe have a few questions regarding the inner workings of the other parts of arcilator, which may produce a better implementation for the partitioning procedure:
|
This is some really cool work that you are doing 🥳! Very exciting 🙌 One thing I'm wondering is how this interacts with the LowerState improvements and bugfixes landing as part of #7703. That PR gets rid of the LegalizeStateUpdate pass (which I think you can't use for your data race hazards anyway), and it also gets rid of Do you think we could somehow create an Another thing we need to figure out is how to deal with memory, because that is going to be a major source of headache. I like your idea of aggregating memory writes into a list of changes to be applied later, and then flushing those into simulation memory in the sync phase. That would be very neat! Your implementation currently requires quite a few attributes on operations. These attributes are discardable however, so everything in theory should still work even when you delete these attributes. Is there a way how we could encode the partitioning using something like task ops, where you don't need to put attributes on ops and we could have a single lowering pass that takes all Having a representation of tasks in the IR would also allow you to group together states and memories that are only used inside a thread. The AllocateState pass already does this for you if all state allocation is nested inside a block, and you should be able to trigger this manually using an explicit
should in theory allocate all state in the task into one contiguous block, which is then allocated in the global storage (which should be fine because it's just one region of memory that only this thread touches):
|
Thanks for the insightful reply, @fabianschuiki! 😻 I briefly looked into #7703, and it seems that it can simplify the implementation a lot. All attributes introduced in this PR can actually be dropped if rebased upon #7703 and using the I haven't come up with a good solution to this problem. One way would be to allow @SpriteOvO and I will try to rebase this PR onto #7703, and see if the idea works. Task dependencies
Assuming that we place writes to the same state into the same task, then I think we might be able to infer a finer dependency relation between tasks. If all operations are placed into tasks, then one task needs to be scheduled after another if and only if it reads a state written by another task prior in program order. We don't even need extra structures (operands, attributes) in the IR. Let's use a fancy notation to represent the transitive closure of this relation: This relation can also give us a simple procedure to merge tasks. A and B can be merged if and only if there does not exist C, such that
Therefore we might be able to get an alternative bottom-up partitioning scheme by combining atomic Nevertheless, for the top-down partitioning method, this kind of dependency can still be used to reflect the semantics of State groupingAfter some thought, It seems to me that The downside is that these storages would all be allocated flat, and the user of the model cannot control the allocation. It may be desirable for dynamically scheduled simulations to reuse unused storage spaces between tasks for better cache locality. |
This is a draft PR, which mostly serves the purpose of gathering feedbacks. Implemention is imcomplete and very hacky in places. See the "Unresolved issues" section below.
This PR introduces a new pass (
--hw-partition
) into CIRCT to partition HW top-level modules into two or more parts. Each parts contains parts of the logic and states of the original circuit. Additional ports are created for signals that crosses partition bondaries. The pass tries to perserve origional modules and hierarchies.Motivation
The primary motivation is multithreaded simulation with Arcilator. To be more concrete, the targeted typical usecase of this pass is high performance multithreaded simulation using Arcilator, with HW modules compiled from FIRRTL, on a single CPU, with (mostly) static scheduling. Time stepping happens in a lock-step style.
However, doing this on the HW level may also benifit other use cases, e.g. large scale verification with multiple FPGAs.
Unresolved Issues
inout
?Acknowledgement
Thanks to @CircuitCoder for mentoring this work.