[Arc] Automatic module partition #7650

SpriteOvO · 2024-09-30T14:22:23Z

This is a draft PR, which mostly serves the purpose of gathering feedbacks. Implemention is imcomplete and very hacky in places. See the "Unresolved issues" section below.

This PR introduces a new pass (--hw-partition) into CIRCT to partition HW top-level modules into two or more parts. Each parts contains parts of the logic and states of the original circuit. Additional ports are created for signals that crosses partition bondaries. The pass tries to perserve origional modules and hierarchies.

Motivation

The primary motivation is multithreaded simulation with Arcilator. To be more concrete, the targeted typical usecase of this pass is high performance multithreaded simulation using Arcilator, with HW modules compiled from FIRRTL, on a single CPU, with (mostly) static scheduling. Time stepping happens in a lock-step style.

However, doing this on the HW level may also benifit other use cases, e.g. large scale verification with multiple FPGAs.

Unresolved Issues

How do we deal with inout?
How do we utilize operation duplication, i.e. a single operation appears in multiple parts to reduce communication? Currently no operation is duplicated. We can also approach this from the other end: partition by only by states first, duplicate the entire combinatory subtree, and then do optimization.
What graph paritioning optimization goal should we use? Most graph partitioning algorithm / library only optimize for cut-size (or total width of cross module ports in our case, or amount of communication), while the size of each component serves as constraints. METIS supports
The implementation of combinaory path is incorrect. Currently it directly edits existing modules. When there are multiple nesting instances of a same module, this approach will fail.

Acknowledgement

Thanks to @CircuitCoder for mentoring this work.

fabianschuiki · 2024-09-30T17:45:22Z

Hey @SpriteOvO, thanks for working on a very challengine and interesting corner of the optimization space 🥳!

One of the challenges I see, and you also mentioned, is that different users may have wildly different partitioning requirements. Arcilator probably wants to break the actual simulation workload into parallel tasks, while FPGA partitioning may want to find isolated clusters that have relaxed communication. It's very challenging to build a generic pass that performs well on all needs.

You mentioned that your main goals was to find parallelization opportunities for Arcilator. That is awesome 😍 🥳! Arcilator could really benefit from some coarse-grained multi-threading. A lot of the analysis and bookkeeping you have to do comes from the presence of modules and the potential combinational paths that pass through module parts. Have you considered doing your partitioning directly in the Arc dialect? (You can always generalize later if feasible.) After conversion to Arc, the circuit becomes a graph of the necessary computations to advance the design by one cycle. At this stage, state and combinational paths are clearly visible, and there is no module hierarchy or hidden combinational path anymore. This would allow you to focus entirely on the partitioning itself: finding independent regions of computation, maybe grouping them into a new op (maybe some arc.tasklet?), and then updating the LLVM lowering to allow execution of these tasks on separate threads.

teqdruid · 2024-09-30T18:12:45Z

One of the challenges I see, and you also mentioned, is that different users may have wildly different partitioning requirements. Arcilator probably wants to break the actual simulation workload into parallel tasks, while FPGA partitioning may want to find isolated clusters that have relaxed communication. It's very challenging to build a generic pass that performs well on all needs

Drop in comment: I haven't looked over this PR, but when I had a partitioning pass in the MSFT dialect (it would pull things out deep in the module hierarchy and do cross-module movements but was obnoxiously complex and wasn't actively used), I had an attribute to specify the partition. I required the designer to specify the partition, but I could easily see different passes doing some target-specific heuristics to label the ops then running the design partition pass.

sequencer · 2024-10-02T08:29:55Z

Thanks @SpriteOvO and @CircuitCoder for doing this job! That's a really important job, T1 is counting on this PR to remove the verilator emulation backend.
Here is my two cents on the problems, but please follow @fabianschuiki's comments to get it upstreamed.

How do we deal with inout?

We should forbid inout, in the partition boundary, the only usage for inout is for Chip/FPGA IO, which will be emulated via a tri-state IO.

How do we utilize operation duplication, i.e. a single operation appears in multiple parts to reduce communication? Currently no operation is duplicated. We can also approach this from the other end: partition by only by states first, duplicate the entire combinatory subtree, and then do optimization.

all states(reg,mem) cannot be duplicated, and they live in the data cache;
if module are partitioned, same .text can runs in different CPU/FPGAs. The benefits are coming from saving LLC. but communication always exists.

What graph partitioning optimization goal should we use? Most graph partitioning algorithm / library only optimize for cut-size (or total width of cross module ports in our case, or amount of communication), while the size of each component serves as constraints.

Unlike the overhead of Verilog semantics, it is event driven, thus verilator use the mtask-based way to simulate. The arc simulation is a calculation of connection graph. I personally prefer direction partition it into multiple small non-overlapped graphs, This suit for both FPGA and in-cluster CPU simulation.

The implementation of combinational path is incorrect. Currently it directly edits existing modules. When there are multiple nesting instances of a same module, this approach will fail.

I personally would see, just pre-analyse all combinational path, and forbid cutting from them, cutting the combinational path delivers a bad performance on simulation and high complexity in the algorithm.

CircuitCoder · 2024-10-18T05:26:48Z

I apologize for the super delayed response, I fell ill for the last couple of weeks.

Partitioning against ARC seems to have a lot of merits. After discussing with @SpriteOvO and @sequencer, we decided to try to migrate this algorithm to the ARC dialect, which @SpriteOvO is already working on, and should result in a PoC after this weekend. My personal reasoning is that:

If I understand correctly, partitioning during the ARC pass mainly involves partitioning against states, or more precisely, the new value to be written into states at the end of each cycle. The entire computation subtree would be partitioned into one half, and some operations would be copied into multiple halves.

Our original motivation for doing partition with HW dialect is that: (a) We can partition against comb operations, which can be simpler (the graph representation is more intuitive to construct) and reflects well to hardware utilization if the result is synthesized onto FPGA; (b) Can use module boundary (which contains some semantical information) to guide partitioning. It turns out, however, that we overlooked some cases when simple unique partition against operations would not work (combinatory path has to pass through boundary more than one time [2]), thus breaking the correctness of lockstep simulation. So even if partitioning is done against operations, some operations would have to be duplicated. This is actually more difficult than first partitioning against states, and then doing some sort of common subexpression extraction (see [1]).

My personal concerns with partitioning with states are:

How well does the graph partitioning algorithm perform? The information we give the algorithm is now coarser. Now the graph is constructed using the following procedure:
- Nodes are states, edge weights are the bit-width of the depended state.
- The above information alone would push METIS to partition all the hot registers into one half, which results in severe computation imbalance. Therefore two additional node weight constraints: Weight 1 = sum of bitwidth of all operations within the computation subtree. Weight 2 = bit-width of the state itself. These weights reflect the computation and state allocation balance, respectively. For CPU simulations, we can ignore constraints on weight 2 right now.
How much computation is duplicated?

After finishing the PoC, we intend to use rocket-chip w/ default configuration as a test.

[1] We currently implement a simple sub-expression extraction algorithm:
If an SSA value is used in multiple halves, and only depends on the states in one half, then the computation of that SSA value is moved into that half, placed after the end-of-cycle state update.
This may break the load balancing, and maybe should only be enabled if the computation subtree is large enough or duplicated enough times.

[2] Essentially it boils down to this pattern:

reg A -- op C -- A
reg B -/      \- B

If A and B are split apart, then no matter where we place C, there would be a comb path that traverses the boundary at least twice.
To avoid this during graph construction, we would have to analyze regs pairwise to figure out if they need to be placed together, which is just too costly.

CircuitCoder · 2024-10-24T03:09:19Z

@SpriteOvO has just pushed an update, containing an initial PoC implementation of Arc-based state-directed partitioning.

In brief, the partitioning operates on arc.model, and is done by assigning each "real" state writes (outputs and hardware states) to one of the partitioned arc.models (called "chunk"). The original state WAR legalization pass is not sufficient, as there are also races between chunks, and thus is disabled when partitioning is enabled. Instead, the simulation entry point is split into two functions: The first (top_<chunk>_eval) does all computations and writes state updates into a temporary shadow state. The second (top_<chunk>_sync_eval) copies the shadow state into the persistent state.

Hence, a multithreaded driver should call the model in the following way:

For each thread,

<cycle start> -> Sync -> (the leader thread) updates IO
              -> Sync -> call top_<threadId>_eval
              -> Sync -> call top_<threadId>_sync_eval
              -> <cycle end, loop to cycle start>

Right now the partitioning happens at the end of the hw -> arc pipeline, contained in two passes: the first one (PartitionPass) is placed before state allocation. This pass marks state writes with assigned chunks and creates shadow states. The second one (PartitionClonePass) is placed after state allocation but before clock -> func. This pass splits the original arc.model into 2 * numChunks new arc.models, two for each chunk: top_<chunk> and top_<chunk>_sync.

Changes to other passes introduced in this commit

Introduced a new block-like op: arc.sync which resides inside the top-level block in an arc.model, or inside an arc.clock_tree. Its semantics is that the operations inside this op should be executed during the "sync" part of the chunk. This is used to facilitate race prevention (also RAW legalization within a chunk itself).
During the arc-lower-state pass, we added an option to create an additional storage for temporaries. It's used for the "old clock" state created in this pass, and also the shadow states created during partitioning. Thus each arc.model would have two storages: one for global data (I/O, hardware states) and is shared by all chunks, and one for temporaries used only in this chunk.

This will ultimately get transformed into arguments of the model entry point. Now users of partitioned models would need to pass two pointers into the model, one shared by all threads, and one unique for each thread.

Additionally, the "store old clock" operation is put in an arc.sync block if separate temporary storage is enabled, however adding a separate option may be a better choice.
The original WAR legalization is completely skipped if partitioning is requested.
arc-lower-clocks-to-funcs are modified to support multiple storages.
All AllocState ops would be created with an attribute partition-role to assist the partitioning pass.

Implementation detail

The PartitionPass does two things: Partition planning and shadow state creation.

Partition planning involves finding out the dependencies between states, and the computation workload required for updating each state. Based on this information, a partition plan is generated, and each arc.state_write op is marked with a chunk attribute if it is assigned to a chunk (typically this is the case for writes to global states).

Shadow states are created for each HW state, and are allocated in the chunks' temporary storage. Inside an arc.clock_tree, all state writes are redirected to the shadow storage. An arc.sync block is created inside the arc.clock_tree for each chunk, containing the operations for copying from the shadow state back to the global state.

Between these two passes, states are allocated, and now the size for each storage is fixed.

The PartitionClonePass is easier to explain: it clones the model twice for each chunk and removes operations with a chunk attribute not equal to the current chunk. Then, the two models are processed separately:

For the main model, all arc.sync ops are simply dropped.
For the sync model, all state writes besides those inside an arc.sync op are removed. arc.passthrough and arc.initial are also removed. Then, all arc.sync ops are moved to the end of the parent block and unwrapped. The following CSE pass is expected to clean up the other unused ops.

Unfinished works

There are some unfinished (but mostly isolated) works to be done. This list will be updated to reflect the implementation. Most notably, the partitioning planning algorithm is missing.

The partition planning is not implemented. Right now a random chunk is selected for each state. We plan to add an optional library to arcilator: METIS. When it's enabled, partitioning would use its graph partitioning algorithm. When it's disabled, we plan to sequentially split states into chunks with similar storage size.
Temporary state storage allocation is super inefficient. Right now it just uses the original lower state, which would allocate ~$\sum_{c \in chunks} temp_c$. Ideally, we would want ~$\max_{c \in chunks} temp_c$.

Also, we should sort the global state storage based on who writes the state for less cacheline contention.
Partitioning with memories is untested and most certainly broken. For memory writes, we should create shadows for pair(addr, data) for each write port. Besides that, memory supports should be relatively easy to implement.
Arcs right now have to be fully inlined for weight statistics to work properly. See questions below.
Shadow states should only be created when needed (is accessed outside the writing chunk). See questions below for RAW within a chunk.
Once we figured out the algorithm, the partition-role attribute on state writes should use an enum, not a string.

Questions

We have a few questions regarding the inner workings of the other parts of arcilator, which may produce a better implementation for the partitioning procedure:

Before the inlining pass, is each arc guaranteed to generate exactly one output value? If this assumption is true, then finding out the computation workload for computing each value would be much easier, and we can move the partitioning planning pass before inlining.
Is it possible to simply move all state writes to the end of the block to resolve RAW hazards? The original legalization pass also creates shadow states, so it seems that I may need some hint at a counterexample.

fabianschuiki · 2024-10-24T18:59:30Z

This is some really cool work that you are doing 🥳! Very exciting 🙌

One thing I'm wondering is how this interacts with the LowerState improvements and bugfixes landing as part of #7703. That PR gets rid of the LegalizeStateUpdate pass (which I think you can't use for your data race hazards anyway), and it also gets rid of arc.clock_tree and arc.passthrough entirely. This means that the model will get a lot simpler and it might be easier to work with.

Do you think we could somehow create an arc.task op or similar to encode the eval -> sync -> eval -> sync -> eval -> ... chain of the different threads you were describing? The data hazards you describe at the thread/task boundaries are challenging, and it would be cool if the synchronization that is necessary would somehow be implicitly encoded or communicated in the IR. If we manage to build up a task structure in the IR, that might be a way to allow for many threads to crunch through the tasks in parallel. 🤔

Another thing we need to figure out is how to deal with memory, because that is going to be a major source of headache. I like your idea of aggregating memory writes into a list of changes to be applied later, and then flushing those into simulation memory in the sync phase. That would be very neat!

Your implementation currently requires quite a few attributes on operations. These attributes are discardable however, so everything in theory should still work even when you delete these attributes. Is there a way how we could encode the partitioning using something like task ops, where you don't need to put attributes on ops and we could have a single lowering pass that takes all arc.tasks and pulls them out into separate functions that can be scheduled in a thread pool?

Having a representation of tasks in the IR would also allow you to group together states and memories that are only used inside a thread. The AllocateState pass already does this for you if all state allocation is nested inside a block, and you should be able to trigger this manually using an explicit arc.alloc_storage (not sure if this is fully implemented though). For example:

arc.task {
  %localState = arc.alloc_storage %arg0
  %0 = arc.alloc_state %localState
  %1= arc.alloc_state %localState
  ...
}

should in theory allocate all state in the task into one contiguous block, which is then allocated in the global storage (which should be fine because it's just one region of memory that only this thread touches):

arc.task {
  %localState = arc.storage.get %arg0 ...  // get pointer to the subregion of memory where all local state is
  ...
}

CircuitCoder · 2024-10-25T12:16:20Z

Thanks for the insightful reply, @fabianschuiki! 😻

I briefly looked into #7703, and it seems that it can simplify the implementation a lot. All attributes introduced in this PR can actually be dropped if rebased upon #7703 and using the arc.task abstraction. However, it made me realize that the register as clock use-cases may cause problems with our current cloning algorithm. If I understand correctly, after the LowerState pass in #7703, all state reads/writes now obey program order in MLIR. A clock that combinatorially depends on register will get lowered into a scf.if whose conditions depend on at least one arc.state_read, which in turn would observe a prior arc.state_write (the register value in the new phase). Then the writes within the scf.if cannot be arbitrarily made parallel with the prior write to the depended register.

I haven't come up with a good solution to this problem. One way would be to allow arc.tasks arbitrarily depend on each other (see below), but our original targeted use-cases for partitioning include static scheduling, which means that it's better to avoid dependencies. Our current idea is to clone the computation subtree of the written value into the following read, which means some duplicated computation. Our rationale is that in real hardware designs, gated clocks or other clocks that depend on registers should not have deep combinatory trees. This is somewhat equivalent to how we deal with other computations that are used in multiple chunks: they get duplicated (or didn't get DCE'ed).

@SpriteOvO and I will try to rebase this PR onto #7703, and see if the idea works.

Task dependencies

Do you think we could somehow create an arc.task op or similar to encode the eval -> sync -> eval -> sync -> eval -> ... chain of the different threads you were describing?

Assuming that we place writes to the same state into the same task, then I think we might be able to infer a finer dependency relation between tasks. If all operations are placed into tasks, then one task needs to be scheduled after another if and only if it reads a state written by another task prior in program order. We don't even need extra structures (operands, attributes) in the IR. Let's use a fancy notation to represent the transitive closure of this relation: A <_T B means that A needs to be scheduled before B.

This relation can also give us a simple procedure to merge tasks. A and B can be merged if and only if there does not exist C, such that A <_T C and C <_T B. The merge procedure would be (assuming B follows A in program order):

Move all tasks between A and B: if C is between A and B in program order, move C before A if C <_T B, or move C after B otherwise.
Concat A and B

Therefore we might be able to get an alternative bottom-up partitioning scheme by combining atomic arc.tasks, which contains writes for only one state. It might better fit dynamic scheduling by producing some number of "constant-sized tasks", instead of a constant number of "tasks with similar size" given by a top-down graph partitioning algorithm. Using this bottom-up method, we might be able to relax the clock-depends-on-register cloning requirement, instead allowing dependencies between tasks with minimal loss of performance.

Nevertheless, for the top-down partitioning method, this kind of dependency can still be used to reflect the semantics of arc.sync, so arc.sync can just be replaced by it.

State grouping

After some thought, It seems to me that arc.alloc_storage is a better choice when used together with arc.task, than our method of using another storage object. The benefit is that we can specify (in the IR) which task shares which storage.

The downside is that these storages would all be allocated flat, and the user of the model cannot control the allocation. It may be desirable for dynamically scheduled simulations to reuse unused storage spaces between tasks for better cache locality.

SpriteOvO requested a review from fabianschuiki September 30, 2024 14:22

[Arc] Initial implementation of partition pass

ef99d3d

SpriteOvO force-pushed the partition branch from 22bd7c9 to ef99d3d Compare October 24, 2024 03:02

SpriteOvO changed the title ~~[HW] Automatic HW module partition~~ [Arc] Automatic HW module partition Oct 24, 2024

SpriteOvO changed the title ~~[Arc] Automatic HW module partition~~ [Arc] Automatic module partition Oct 24, 2024

[Arc] Add a integration test for partition

be0a703

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Arc] Automatic module partition #7650

[Arc] Automatic module partition #7650

SpriteOvO commented Sep 30, 2024

fabianschuiki commented Sep 30, 2024

teqdruid commented Sep 30, 2024

sequencer commented Oct 2, 2024

CircuitCoder commented Oct 18, 2024

CircuitCoder commented Oct 24, 2024

fabianschuiki commented Oct 24, 2024 •

edited

Loading

CircuitCoder commented Oct 25, 2024

[Arc] Automatic module partition #7650

Are you sure you want to change the base?

[Arc] Automatic module partition #7650

Conversation

SpriteOvO commented Sep 30, 2024

Motivation

Unresolved Issues

Acknowledgement

fabianschuiki commented Sep 30, 2024

teqdruid commented Sep 30, 2024

sequencer commented Oct 2, 2024

CircuitCoder commented Oct 18, 2024

CircuitCoder commented Oct 24, 2024

Changes to other passes introduced in this commit

Implementation detail

Unfinished works

Questions

fabianschuiki commented Oct 24, 2024 • edited Loading

CircuitCoder commented Oct 25, 2024

Task dependencies

State grouping

fabianschuiki commented Oct 24, 2024 •

edited

Loading