-
Notifications
You must be signed in to change notification settings - Fork 11.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV][WIP] Let RA do the CSR saves. #90819
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-risc-v Author: Mikhail Gudim (mgudim) ChangesWe turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping. In Full diff: https://github.com/llvm/llvm-project/pull/90819.diff 6 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
index cb41577c5d9435..b725bfb56389bc 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
@@ -1026,12 +1026,51 @@ RISCVFrameLowering::getFrameIndexReference(const MachineFunction &MF, int FI,
return Offset;
}
-void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
- BitVector &SavedRegs,
- RegScavenger *RS) const {
- TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
- // Unconditionally spill RA and FP only if the function uses a frame
- // pointer.
+void RISCVFrameLowering::determineMustCalleeSaves(MachineFunction &MF,
+ BitVector &SavedRegs) const {
+ const TargetRegisterInfo &TRI = *MF.getSubtarget().getRegisterInfo();
+
+ // Resize before the early returns. Some backends expect that
+ // SavedRegs.size() == TRI.getNumRegs() after this call even if there are no
+ // saved registers.
+ SavedRegs.resize(TRI.getNumRegs());
+
+ // When interprocedural register allocation is enabled caller saved registers
+ // are preferred over callee saved registers.
+ if (MF.getTarget().Options.EnableIPRA &&
+ isSafeForNoCSROpt(MF.getFunction()) &&
+ isProfitableForNoCSROpt(MF.getFunction()))
+ return;
+
+ // Get the callee saved register list...
+ const MCPhysReg *CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+
+ // Early exit if there are no callee saved registers.
+ if (!CSRegs || CSRegs[0] == 0)
+ return;
+
+ // In Naked functions we aren't going to save any registers.
+ if (MF.getFunction().hasFnAttribute(Attribute::Naked))
+ return;
+
+ // Noreturn+nounwind functions never restore CSR, so no saves are needed.
+ // Purely noreturn functions may still return through throws, so those must
+ // save CSR for caller exception handlers.
+ //
+ // If the function uses longjmp to break out of its current path of
+ // execution we do not need the CSR spills either: setjmp stores all CSRs
+ // it was called with into the jmp_buf, which longjmp then restores.
+ if (MF.getFunction().hasFnAttribute(Attribute::NoReturn) &&
+ MF.getFunction().hasFnAttribute(Attribute::NoUnwind) &&
+ !MF.getFunction().hasFnAttribute(Attribute::UWTable) &&
+ enableCalleeSaveSkip(MF))
+ return;
+
+ // Functions which call __builtin_unwind_init get all their registers saved.
+ if (MF.callsUnwindInit()) {
+ SavedRegs.set();
+ return;
+ }
if (hasFP(MF)) {
SavedRegs.set(RISCV::X1);
SavedRegs.set(RISCV::X8);
@@ -1041,6 +1080,18 @@ void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
SavedRegs.set(RISCVABI::getBPReg());
}
+void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
+ BitVector &SavedRegs,
+ RegScavenger *RS) const {
+ const auto &ST = MF.getSubtarget<RISCVSubtarget>();
+ const Function &F = MF.getFunction();
+ determineMustCalleeSaves(MF, SavedRegs);
+ if (ST.doCSRSavesInRA() && F.doesNotThrow())
+ return;
+
+ TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
+}
+
std::pair<int64_t, Align>
RISCVFrameLowering::assignRVVStackObjectOffsets(MachineFunction &MF) const {
MachineFrameInfo &MFI = MF.getFrameInfo();
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.h b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
index 28ab4aff3b9d51..d7b9df8bd68515 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
@@ -31,6 +31,7 @@ class RISCVFrameLowering : public TargetFrameLowering {
StackOffset getFrameIndexReference(const MachineFunction &MF, int FI,
Register &FrameReg) const override;
+ void determineMustCalleeSaves(MachineFunction &MF, BitVector &SavedRegs) const;
void determineCalleeSaves(MachineFunction &MF, BitVector &SavedRegs,
RegScavenger *RS) const override;
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 19ef1f2f18ec1a..7978dac4aa7944 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -21314,6 +21314,83 @@ unsigned RISCVTargetLowering::getCustomCtpopCost(EVT VT,
return isCtpopFast(VT) ? 0 : 1;
}
+void RISCVTargetLowering::finalizeLowering(MachineFunction &MF) const {
+ const Function &F = MF.getFunction();
+ if (!Subtarget.doCSRSavesInRA() || !F.doesNotThrow()) {
+ TargetLoweringBase::finalizeLowering(MF);
+ return;
+ }
+
+ MachineRegisterInfo &MRI = MF.getRegInfo();
+ const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
+ const RISCVRegisterInfo &TRI = *Subtarget.getRegisterInfo();
+ const RISCVFrameLowering &TFI = *Subtarget.getFrameLowering();
+
+ SmallVector<MachineBasicBlock *, 4> RestoreMBBs;
+ SmallVector<MachineBasicBlock *, 4> SaveMBBs;
+ SaveMBBs.push_back(&MF.front());
+ for (MachineBasicBlock &MBB : MF) {
+ if (MBB.isReturnBlock())
+ RestoreMBBs.push_back(&MBB);
+ }
+
+ BitVector MustCalleeSavedRegs;
+ TFI.determineMustCalleeSaves(MF, MustCalleeSavedRegs);
+ const MCPhysReg * CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+ SmallVector<MCPhysReg, 4> EligibleRegs;
+ for (int i = 0; CSRegs[i]; ++i) {
+ if (!MustCalleeSavedRegs[i])
+ EligibleRegs.push_back(CSRegs[i]);
+ }
+
+ dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n";
+ SmallVector<Register, 4> VRegs;
+ for (MachineBasicBlock *SaveMBB : SaveMBBs) {
+ for (MCPhysReg Reg : EligibleRegs) {
+ SaveMBB->addLiveIn(Reg);
+ // TODO: should we use Maximal register class instead?
+ Register VReg = MRI.createVirtualRegister(TRI.getMinimalPhysRegClass(Reg));
+ VRegs.push_back(VReg);
+ BuildMI(
+ *SaveMBB,
+ SaveMBB->begin(),
+ SaveMBB->findDebugLoc(SaveMBB->begin()),
+ TII.get(TargetOpcode::COPY),
+ VReg
+ )
+ .addReg(Reg);
+ }
+ }
+
+ for (MachineBasicBlock *RestoreMBB : RestoreMBBs) {
+ MachineInstr &ReturnMI = RestoreMBB->back();
+ assert(ReturnMI.isReturn() && "Expected return instruction!");
+ auto VRegI = VRegs.begin();
+ for (MCPhysReg Reg : EligibleRegs) {
+ Register VReg = *VRegI;
+ BuildMI(
+ *RestoreMBB,
+ ReturnMI.getIterator(),
+ ReturnMI.getDebugLoc(),
+ TII.get(TargetOpcode::COPY),
+ Reg
+ )
+ .addReg(VReg);
+ ReturnMI.addOperand(
+ MF,
+ MachineOperand::CreateReg(
+ Reg,
+ /*isDef=*/false,
+ /*isImplicit=*/true
+ )
+ );
+ VRegI++;
+ }
+ }
+
+ TargetLoweringBase::finalizeLowering(MF);
+}
+
bool RISCVTargetLowering::fallBackToDAGISel(const Instruction &Inst) const {
// GISel support is in progress or complete for these opcodes.
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index 78f99e70c083a7..ea1079af2ead05 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -853,6 +853,8 @@ class RISCVTargetLowering : public TargetLowering {
bool fallBackToDAGISel(const Instruction &Inst) const override;
+ void finalizeLowering(MachineFunction &MF) const override;
+
bool lowerInterleavedLoad(LoadInst *LI,
ArrayRef<ShuffleVectorInst *> Shuffles,
ArrayRef<unsigned> Indices,
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
index d3236bb07d56d5..15476fc2d3c583 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
@@ -65,6 +65,11 @@ static cl::opt<unsigned> RISCVMinimumJumpTableEntries(
"riscv-min-jump-table-entries", cl::Hidden,
cl::desc("Set minimum number of entries to use a jump table on RISCV"));
+static cl::opt<bool> RISCVEnableSaveCSRByRA(
+ "riscv-enable-save-csr-in-ra",
+ cl::desc("Let register alloctor do csr saves/restores"),
+ cl::init(false), cl::Hidden);
+
void RISCVSubtarget::anchor() {}
RISCVSubtarget &
@@ -130,6 +135,10 @@ bool RISCVSubtarget::useConstantPoolForLargeInts() const {
return !RISCVDisableUsingConstantPoolForLargeInts;
}
+bool RISCVSubtarget::doCSRSavesInRA() const {
+ return RISCVEnableSaveCSRByRA;
+}
+
unsigned RISCVSubtarget::getMaxBuildIntsCost() const {
// Loading integer from constant pool needs two instructions (the reason why
// the minimum cost is 2): an address calculation instruction and a load
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index c880c9e921e0ea..f3d8a70c0df14e 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -270,6 +270,8 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
bool useConstantPoolForLargeInts() const;
+ bool doCSRSavesInRA() const;
+
// Maximum cost used for building integers, integers will be put into constant
// pool if exceeded.
unsigned getMaxBuildIntsCost() const;
|
You can test this locally with the following command:git-clang-format --diff 5192cb772ad58af4b557539791ff8de60ab450a3 10e60c750ec233404e86941d35c56b4f3c346d29 --extensions h,cpp -- llvm/lib/Target/RISCV/RISCVCFIInserter.cpp llvm/include/llvm/CodeGen/ReachingDefAnalysis.h llvm/include/llvm/CodeGen/TargetFrameLowering.h llvm/include/llvm/CodeGen/TargetSubtargetInfo.h llvm/lib/CodeGen/MachineLICM.cpp llvm/lib/CodeGen/PrologEpilogInserter.cpp llvm/lib/CodeGen/ReachingDefAnalysis.cpp llvm/lib/CodeGen/RegAllocEvictionAdvisor.cpp llvm/lib/CodeGen/TargetSubtargetInfo.cpp llvm/lib/Target/RISCV/RISCV.h llvm/lib/Target/RISCV/RISCVFrameLowering.cpp llvm/lib/Target/RISCV/RISCVFrameLowering.h llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVISelLowering.h llvm/lib/Target/RISCV/RISCVInstrInfo.h llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.cpp llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.h llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp llvm/lib/Target/RISCV/RISCVSubtarget.cpp llvm/lib/Target/RISCV/RISCVSubtarget.h llvm/lib/Target/RISCV/RISCVTargetMachine.cpp View the diff from clang-format here.diff --git a/llvm/include/llvm/CodeGen/TargetFrameLowering.h b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
index 3d0a5151cb..d63f3992cd 100644
--- a/llvm/include/llvm/CodeGen/TargetFrameLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
@@ -34,7 +34,7 @@ namespace llvm {
WasmLocal = 3,
NoAlloc = 255
};
-}
+ }
/// Information about stack frame layout on the target. It holds the direction
/// of stack growth, the known stack alignment on entry to each function, and
diff --git a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
index 24017c50df..b1a4bb050b 100644
--- a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
@@ -199,7 +199,8 @@ void RISCVCFIInstrInserter::calculateCFAInfo(MachineFunction &MF) {
updateSuccCFAInfo(MBBVector[MF.front().getNumber()]);
LLVM_DEBUG(dbgs() << "Calculated CFI info for " << MF.getName() << "\n";
- for (MachineBasicBlock &MBB : MF) {
+ for (MachineBasicBlock &MBB
+ : MF) {
dbgs() << "BasicBlock: " << MBB.getNumber() << " "
<< MBB.getName() << "\n";
dbgs() << "IncomingCSRLocations:\n";
|
Currently the shrink wrapping saves / restores all registers at the same point. We noticed that this is not optimal in several places in SPEC. GCC can do this on per-register basis and there is an option to turn this off. By using this flag in gcc we determined that this can make up to 2% difference in dynamic instruction count on gcc benchmark and some smaller impacts in other benchmarks. Instead of trying to improve shrink-wrapping in llvm, I decided to try this (crazy) idea since it was pretty easy to implement this prototype. I disabled this for functions which can unwind, because this is just an experiment to see if this is viable. Thus all c++ benchmarks that have exceptions were not affected. I got 1.7% improvement on gcc and 1.7% degradation on xz in dynamic instruciton count. Do you think it's worth pursuing this idea further? To fix the degradations, we need to improve register allocator / coalescing which may have more impact than just improving shrink wrapping. |
Will this still work with -msave-restore and cm.push/pop? |
I fought some old post about improving shrink wrapping. https://lists.llvm.org/pipermail/llvm-dev/2017-August/116137.html not sure how much of it was implemented. CC @francisvm |
EligibleRegs.push_back(CSRegs[i]); | ||
} | ||
|
||
dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please wrap this with LLVM_DEBUG
.
Does this update any existing tests? |
It's gated by a command line flag that isn't enabled by default. |
Not much of it was implemented in the end. It turned out that for us, the compile-time impact, the unwinding impact and the various regressions were not worth the speedup in the few benchmarks. I wish I had the old data to justify this, but that was a while ago. The unwinder impact was more significant than I initially thought at that time, since Apple platforms rely heavily on compact unwinding, therefore missing some frames and falling back to eh/debug_frame CFI is a huge penalty, and in some memory/permission-constrained environments, very very hard, or impossible for those that only look at prologues/epilogues to quickly unwind (some crash reporters, for example). That made enabling it by default a hard task, and we never pursued an attribute/flag-based approach. |
I think so. All it does is it makes a callee-saved register a spill, so this should work. I'll run spec with these options to check. If we need to optimize for size, than we can just turn this off with a flag. |
At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests. |
I think you should continue. I'm also curious how this interacts with -fno-omit-frame-pointer which makes the X8 register reserved, but its still in the callee save list. |
I support to continue to investigate where we can reach via this approach, I think this can be a good workaroud to enhance "shrink wrapping". Thanks for your great idea and I will do some evaluations too. |
Hi all, just just wanted to give an update on this: I looked at the xz degradation: Half of it is due to the fact that since now we have some long live ranges, early-machinelicm doesn't hoist as well as before. This is easy to fix though. If the register pressure for some class exceeds the limit then there will be a spill. If there are live ranges that are live-through entire loop without any uses inside the loop, then these ranges will be spilled first. So basically, I fixed this by ignoring such live-through-no-uses-inside-loop live ranges in The other problem is that RA seems to be doing something wrong and I got some extra spills. I hope that can be fixed too. The biggest problem is that we need to emit CFI directives that tell runtime how to restore ALL callee-saved registers. With my approach I don't know what is the best way to do this. I don't know this CFI stuff very well, so I am hoping someone can give an advise / suggestion. I can only think of having a pass at the very end of the pipeline which tracks what happens to callee-saved registers and emits the directives accordingly. BTW, there is something which looks relevant in @topperc @preames @mshockwave @wangpc-pp @asb any suggestions how to do this? |
How will GCC handle these CFIs? |
Hi @mgudim
The approach is not crazy at all! FWIW, I believe the whole CSR distinction in LLVM is wrong and the proper way to handle them would be to issue parallel copies in the entry block and exit blocks right during ISel (at the very beginning). As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail. One remark, since you're not using parallel copies, you're creating artificial interferences and that can have a cascading effect. (See below.) Even if you were to use parallel copies (in LLVM that would be bundles of copies) that wouldn't work out-of-the-box either since this constructs is not a first class citizen. The TL;DR is you're testing the RA heuristic in ways that has not been exercised and some things will get better, some things will get worse. Example of issue with CSR using sequential copies
If for some reason
instead of
|
Hi Quentin, Thank you for the comments. After I looked into some degradations, I see that you are exactly right!
Interesting, you mean we should put all the copies into one instruction bundle?
In particular in the register allocator I see that CSRs are treated differently: since there is some cost to use CSR for the first time, there is some logic trying to minimize these costs. In this approach, I think this logic is working against us. Here's another problem that I saw: when trying to allocate for the live range of
v1 = COPY csr1 some_reg = COPY csr1 csr1 = COPY csr1 # no-op I have tried some hacks which basically disable eviction if it breaks hints and they make the problem go away, but I am not sure what the proper solution to this should be, maybe we can discuss it during your office hour. |
Yes, but that will not be enough because RA needs to be taught how to break the bundle if some of the copies need to be spilled.
Not surprised and that's what I was referring to when I was saying that CSRs are treated differently left and right.
Sadly I don't think there's a good solution here. RA's decisions are currently too narrow-scoped to reason about this kind of things. The eviction chain heuristic is supposed to help with these cases, but this is a heuristic! (Plus it may be disabled by default, I don't remember.) Long story short, if we want to go down that path, I think that a more serious overhaul of the RA infrastructure is needed. |
de5e1cd
to
64d855a
Compare
I think something went wrong on the last push. |
158133e
to
08efec0
Compare
08efec0
to
5cbd4b0
Compare
Yes, thanks! I am working on cleaning up the code and posting the branch so others can try it. Then I'll break it up into commits and try to merge the one-by-one |
I've updated the branch with the latest prototype. To try the branch, add the following flags to clang: Looking forward to your feedback. If there are no principal issues moving forward I can start upstreaming this work in small commits. |
b89105e
to
ae5d2b4
Compare
We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping. In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.
ae5d2b4
to
10e60c7
Compare
What is the thought process of taking this approach over modifying the ShrinkWrap pass to communicate the PrologEpilogInserter where to put each callee saved register? |
We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.
In
finalizeLowering()
we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.