Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV][WIP] Let RA do the CSR saves. #90819

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

mgudim
Copy link
Contributor

@mgudim mgudim commented May 2, 2024

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.

@llvmbot
Copy link
Collaborator

llvmbot commented May 2, 2024

@llvm/pr-subscribers-backend-risc-v

Author: Mikhail Gudim (mgudim)

Changes

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.


Full diff: https://github.com/llvm/llvm-project/pull/90819.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVFrameLowering.cpp (+57-6)
  • (modified) llvm/lib/Target/RISCV/RISCVFrameLowering.h (+1)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+77)
  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.h (+2)
  • (modified) llvm/lib/Target/RISCV/RISCVSubtarget.cpp (+9)
  • (modified) llvm/lib/Target/RISCV/RISCVSubtarget.h (+2)
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
index cb41577c5d9435..b725bfb56389bc 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
@@ -1026,12 +1026,51 @@ RISCVFrameLowering::getFrameIndexReference(const MachineFunction &MF, int FI,
   return Offset;
 }
 
-void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
-                                              BitVector &SavedRegs,
-                                              RegScavenger *RS) const {
-  TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
-  // Unconditionally spill RA and FP only if the function uses a frame
-  // pointer.
+void RISCVFrameLowering::determineMustCalleeSaves(MachineFunction &MF,
+                                              BitVector &SavedRegs) const {
+  const TargetRegisterInfo &TRI = *MF.getSubtarget().getRegisterInfo();
+
+  // Resize before the early returns. Some backends expect that
+  // SavedRegs.size() == TRI.getNumRegs() after this call even if there are no
+  // saved registers.
+  SavedRegs.resize(TRI.getNumRegs());
+
+  // When interprocedural register allocation is enabled caller saved registers
+  // are preferred over callee saved registers.
+  if (MF.getTarget().Options.EnableIPRA &&
+      isSafeForNoCSROpt(MF.getFunction()) &&
+      isProfitableForNoCSROpt(MF.getFunction()))
+    return;
+
+  // Get the callee saved register list...
+  const MCPhysReg *CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+
+  // Early exit if there are no callee saved registers.
+  if (!CSRegs || CSRegs[0] == 0)
+    return;
+
+  // In Naked functions we aren't going to save any registers.
+  if (MF.getFunction().hasFnAttribute(Attribute::Naked))
+    return;
+
+  // Noreturn+nounwind functions never restore CSR, so no saves are needed.
+  // Purely noreturn functions may still return through throws, so those must
+  // save CSR for caller exception handlers.
+  //
+  // If the function uses longjmp to break out of its current path of
+  // execution we do not need the CSR spills either: setjmp stores all CSRs
+  // it was called with into the jmp_buf, which longjmp then restores.
+  if (MF.getFunction().hasFnAttribute(Attribute::NoReturn) &&
+        MF.getFunction().hasFnAttribute(Attribute::NoUnwind) &&
+        !MF.getFunction().hasFnAttribute(Attribute::UWTable) &&
+        enableCalleeSaveSkip(MF))
+    return;
+
+  // Functions which call __builtin_unwind_init get all their registers saved.
+  if (MF.callsUnwindInit()) {
+    SavedRegs.set();
+    return;
+  }
   if (hasFP(MF)) {
     SavedRegs.set(RISCV::X1);
     SavedRegs.set(RISCV::X8);
@@ -1041,6 +1080,18 @@ void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
     SavedRegs.set(RISCVABI::getBPReg());
 }
 
+void RISCVFrameLowering::determineCalleeSaves(MachineFunction &MF,
+                                              BitVector &SavedRegs,
+                                              RegScavenger *RS) const {
+  const auto &ST = MF.getSubtarget<RISCVSubtarget>();
+  const Function &F = MF.getFunction();
+  determineMustCalleeSaves(MF, SavedRegs);
+  if (ST.doCSRSavesInRA() && F.doesNotThrow())
+    return;
+
+  TargetFrameLowering::determineCalleeSaves(MF, SavedRegs, RS);
+}
+
 std::pair<int64_t, Align>
 RISCVFrameLowering::assignRVVStackObjectOffsets(MachineFunction &MF) const {
   MachineFrameInfo &MFI = MF.getFrameInfo();
diff --git a/llvm/lib/Target/RISCV/RISCVFrameLowering.h b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
index 28ab4aff3b9d51..d7b9df8bd68515 100644
--- a/llvm/lib/Target/RISCV/RISCVFrameLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVFrameLowering.h
@@ -31,6 +31,7 @@ class RISCVFrameLowering : public TargetFrameLowering {
   StackOffset getFrameIndexReference(const MachineFunction &MF, int FI,
                                      Register &FrameReg) const override;
 
+  void determineMustCalleeSaves(MachineFunction &MF, BitVector &SavedRegs) const;
   void determineCalleeSaves(MachineFunction &MF, BitVector &SavedRegs,
                             RegScavenger *RS) const override;
 
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 19ef1f2f18ec1a..7978dac4aa7944 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -21314,6 +21314,83 @@ unsigned RISCVTargetLowering::getCustomCtpopCost(EVT VT,
   return isCtpopFast(VT) ? 0 : 1;
 }
 
+void RISCVTargetLowering::finalizeLowering(MachineFunction &MF) const {
+  const Function &F = MF.getFunction();
+  if (!Subtarget.doCSRSavesInRA() || !F.doesNotThrow()) { 
+    TargetLoweringBase::finalizeLowering(MF);
+    return;
+  }
+
+  MachineRegisterInfo &MRI = MF.getRegInfo();
+  const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
+  const RISCVRegisterInfo &TRI = *Subtarget.getRegisterInfo();
+  const RISCVFrameLowering &TFI = *Subtarget.getFrameLowering();
+
+  SmallVector<MachineBasicBlock *, 4> RestoreMBBs;
+  SmallVector<MachineBasicBlock *, 4> SaveMBBs;
+  SaveMBBs.push_back(&MF.front());
+  for (MachineBasicBlock &MBB : MF) {
+    if (MBB.isReturnBlock())
+      RestoreMBBs.push_back(&MBB);
+  }
+
+  BitVector MustCalleeSavedRegs;
+  TFI.determineMustCalleeSaves(MF, MustCalleeSavedRegs);
+  const MCPhysReg * CSRegs = MF.getRegInfo().getCalleeSavedRegs();
+  SmallVector<MCPhysReg, 4> EligibleRegs;
+  for (int i = 0; CSRegs[i]; ++i) {
+    if (!MustCalleeSavedRegs[i])
+      EligibleRegs.push_back(CSRegs[i]);
+  }
+
+  dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n";
+  SmallVector<Register, 4> VRegs;
+  for (MachineBasicBlock *SaveMBB : SaveMBBs) {
+    for (MCPhysReg Reg : EligibleRegs) {
+      SaveMBB->addLiveIn(Reg);
+      // TODO: should we use Maximal register class instead?
+      Register VReg = MRI.createVirtualRegister(TRI.getMinimalPhysRegClass(Reg));
+      VRegs.push_back(VReg);
+      BuildMI(
+        *SaveMBB,
+        SaveMBB->begin(),
+        SaveMBB->findDebugLoc(SaveMBB->begin()),
+        TII.get(TargetOpcode::COPY),
+        VReg
+      )
+      .addReg(Reg);
+    }
+  }
+
+  for (MachineBasicBlock *RestoreMBB : RestoreMBBs) {
+    MachineInstr &ReturnMI = RestoreMBB->back();
+    assert(ReturnMI.isReturn() && "Expected return instruction!");
+    auto VRegI = VRegs.begin();
+    for (MCPhysReg Reg : EligibleRegs) {
+      Register VReg = *VRegI;
+      BuildMI(
+        *RestoreMBB,
+        ReturnMI.getIterator(),
+        ReturnMI.getDebugLoc(),
+        TII.get(TargetOpcode::COPY),
+        Reg
+      )
+      .addReg(VReg);
+      ReturnMI.addOperand(
+        MF,
+        MachineOperand::CreateReg(
+          Reg,
+          /*isDef=*/false,
+          /*isImplicit=*/true
+        )
+      );
+      VRegI++;
+    }
+  }
+
+  TargetLoweringBase::finalizeLowering(MF);
+}
+
 bool RISCVTargetLowering::fallBackToDAGISel(const Instruction &Inst) const {
 
   // GISel support is in progress or complete for these opcodes.
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.h b/llvm/lib/Target/RISCV/RISCVISelLowering.h
index 78f99e70c083a7..ea1079af2ead05 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.h
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.h
@@ -853,6 +853,8 @@ class RISCVTargetLowering : public TargetLowering {
 
   bool fallBackToDAGISel(const Instruction &Inst) const override;
 
+  void finalizeLowering(MachineFunction &MF) const override;
+
   bool lowerInterleavedLoad(LoadInst *LI,
                             ArrayRef<ShuffleVectorInst *> Shuffles,
                             ArrayRef<unsigned> Indices,
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
index d3236bb07d56d5..15476fc2d3c583 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.cpp
@@ -65,6 +65,11 @@ static cl::opt<unsigned> RISCVMinimumJumpTableEntries(
     "riscv-min-jump-table-entries", cl::Hidden,
     cl::desc("Set minimum number of entries to use a jump table on RISCV"));
 
+static cl::opt<bool> RISCVEnableSaveCSRByRA(
+    "riscv-enable-save-csr-in-ra",
+    cl::desc("Let register alloctor do csr saves/restores"),
+    cl::init(false), cl::Hidden);
+
 void RISCVSubtarget::anchor() {}
 
 RISCVSubtarget &
@@ -130,6 +135,10 @@ bool RISCVSubtarget::useConstantPoolForLargeInts() const {
   return !RISCVDisableUsingConstantPoolForLargeInts;
 }
 
+bool RISCVSubtarget::doCSRSavesInRA() const {
+  return RISCVEnableSaveCSRByRA;
+}
+
 unsigned RISCVSubtarget::getMaxBuildIntsCost() const {
   // Loading integer from constant pool needs two instructions (the reason why
   // the minimum cost is 2): an address calculation instruction and a load
diff --git a/llvm/lib/Target/RISCV/RISCVSubtarget.h b/llvm/lib/Target/RISCV/RISCVSubtarget.h
index c880c9e921e0ea..f3d8a70c0df14e 100644
--- a/llvm/lib/Target/RISCV/RISCVSubtarget.h
+++ b/llvm/lib/Target/RISCV/RISCVSubtarget.h
@@ -270,6 +270,8 @@ class RISCVSubtarget : public RISCVGenSubtargetInfo {
 
   bool useConstantPoolForLargeInts() const;
 
+  bool doCSRSavesInRA() const;
+
   // Maximum cost used for building integers, integers will be put into constant
   // pool if exceeded.
   unsigned getMaxBuildIntsCost() const;

Copy link

github-actions bot commented May 2, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff 5192cb772ad58af4b557539791ff8de60ab450a3 10e60c750ec233404e86941d35c56b4f3c346d29 --extensions h,cpp -- llvm/lib/Target/RISCV/RISCVCFIInserter.cpp llvm/include/llvm/CodeGen/ReachingDefAnalysis.h llvm/include/llvm/CodeGen/TargetFrameLowering.h llvm/include/llvm/CodeGen/TargetSubtargetInfo.h llvm/lib/CodeGen/MachineLICM.cpp llvm/lib/CodeGen/PrologEpilogInserter.cpp llvm/lib/CodeGen/ReachingDefAnalysis.cpp llvm/lib/CodeGen/RegAllocEvictionAdvisor.cpp llvm/lib/CodeGen/TargetSubtargetInfo.cpp llvm/lib/Target/RISCV/RISCV.h llvm/lib/Target/RISCV/RISCVFrameLowering.cpp llvm/lib/Target/RISCV/RISCVFrameLowering.h llvm/lib/Target/RISCV/RISCVISelLowering.cpp llvm/lib/Target/RISCV/RISCVISelLowering.h llvm/lib/Target/RISCV/RISCVInstrInfo.h llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.cpp llvm/lib/Target/RISCV/RISCVMachineFunctionInfo.h llvm/lib/Target/RISCV/RISCVRegisterInfo.cpp llvm/lib/Target/RISCV/RISCVSubtarget.cpp llvm/lib/Target/RISCV/RISCVSubtarget.h llvm/lib/Target/RISCV/RISCVTargetMachine.cpp
View the diff from clang-format here.
diff --git a/llvm/include/llvm/CodeGen/TargetFrameLowering.h b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
index 3d0a5151cb..d63f3992cd 100644
--- a/llvm/include/llvm/CodeGen/TargetFrameLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetFrameLowering.h
@@ -34,7 +34,7 @@ namespace llvm {
     WasmLocal = 3,
     NoAlloc = 255
   };
-}
+  }
 
 /// Information about stack frame layout on the target.  It holds the direction
 /// of stack growth, the known stack alignment on entry to each function, and
diff --git a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
index 24017c50df..b1a4bb050b 100644
--- a/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
+++ b/llvm/lib/Target/RISCV/RISCVCFIInserter.cpp
@@ -199,7 +199,8 @@ void RISCVCFIInstrInserter::calculateCFAInfo(MachineFunction &MF) {
   updateSuccCFAInfo(MBBVector[MF.front().getNumber()]);
 
   LLVM_DEBUG(dbgs() << "Calculated CFI info for " << MF.getName() << "\n";
-             for (MachineBasicBlock &MBB : MF) {
+             for (MachineBasicBlock &MBB
+                  : MF) {
                dbgs() << "BasicBlock: " << MBB.getNumber() << " "
                       << MBB.getName() << "\n";
                dbgs() << "IncomingCSRLocations:\n";

@mgudim
Copy link
Contributor Author

mgudim commented May 2, 2024

Currently the shrink wrapping saves / restores all registers at the same point. We noticed that this is not optimal in several places in SPEC. GCC can do this on per-register basis and there is an option to turn this off. By using this flag in gcc we determined that this can make up to 2% difference in dynamic instruction count on gcc benchmark and some smaller impacts in other benchmarks.

Instead of trying to improve shrink-wrapping in llvm, I decided to try this (crazy) idea since it was pretty easy to implement this prototype. I disabled this for functions which can unwind, because this is just an experiment to see if this is viable. Thus all c++ benchmarks that have exceptions were not affected. I got 1.7% improvement on gcc and 1.7% degradation on xz in dynamic instruciton count.

Do you think it's worth pursuing this idea further? To fix the degradations, we need to improve register allocator / coalescing which may have more impact than just improving shrink wrapping.

@topperc
Copy link
Collaborator

topperc commented May 7, 2024

Will this still work with -msave-restore and cm.push/pop?

@topperc
Copy link
Collaborator

topperc commented May 7, 2024

I fought some old post about improving shrink wrapping. https://lists.llvm.org/pipermail/llvm-dev/2017-August/116137.html not sure how much of it was implemented.

CC @francisvm

EligibleRegs.push_back(CSRegs[i]);
}

dbgs() << "EligibleRegs: " << EligibleRegs.size() << "\n";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap this with LLVM_DEBUG.

@mshockwave
Copy link
Member

Does this update any existing tests?

@topperc
Copy link
Collaborator

topperc commented May 7, 2024

Does this update any existing tests?

It's gated by a command line flag that isn't enabled by default.

@francisvm
Copy link
Collaborator

I fought some old post about improving shrink wrapping. https://lists.llvm.org/pipermail/llvm-dev/2017-August/116137.html not sure how much of it was implemented.

CC @francisvm

Not much of it was implemented in the end. It turned out that for us, the compile-time impact, the unwinding impact and the various regressions were not worth the speedup in the few benchmarks. I wish I had the old data to justify this, but that was a while ago. The unwinder impact was more significant than I initially thought at that time, since Apple platforms rely heavily on compact unwinding, therefore missing some frames and falling back to eh/debug_frame CFI is a huge penalty, and in some memory/permission-constrained environments, very very hard, or impossible for those that only look at prologues/epilogues to quickly unwind (some crash reporters, for example). That made enabling it by default a hard task, and we never pursued an attribute/flag-based approach.

@mgudim
Copy link
Contributor Author

mgudim commented May 9, 2024

Will this still work with -msave-restore and cm.push/pop?

I think so. All it does is it makes a callee-saved register a spill, so this should work. I'll run spec with these options to check. If we need to optimize for size, than we can just turn this off with a flag.

@mgudim
Copy link
Contributor Author

mgudim commented May 9, 2024

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

@wangpc-pp wangpc-pp removed the request for review from pcwang-thead May 20, 2024 04:15
@topperc
Copy link
Collaborator

topperc commented May 21, 2024

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

I think you should continue.

I'm also curious how this interacts with -fno-omit-frame-pointer which makes the X8 register reserved, but its still in the callee save list.

@wangpc-pp
Copy link
Contributor

I support to continue to investigate where we can reach via this approach, I think this can be a good workaroud to enhance "shrink wrapping". Thanks for your great idea and I will do some evaluations too.

@mgudim
Copy link
Contributor Author

mgudim commented Jun 7, 2024

Hi all, just just wanted to give an update on this:

I looked at the xz degradation:

Half of it is due to the fact that since now we have some long live ranges, early-machinelicm doesn't hoist as well as before. This is easy to fix though. If the register pressure for some class exceeds the limit then there will be a spill. If there are live ranges that are live-through entire loop without any uses inside the loop, then these ranges will be spilled first. So basically, I fixed this by ignoring such live-through-no-uses-inside-loop live ranges in InitRegPressure.

The other problem is that RA seems to be doing something wrong and I got some extra spills. I hope that can be fixed too.

The biggest problem is that we need to emit CFI directives that tell runtime how to restore ALL callee-saved registers. With my approach I don't know what is the best way to do this. I don't know this CFI stuff very well, so I am hoping someone can give an advise / suggestion. I can only think of having a pass at the very end of the pipeline which tracks what happens to callee-saved registers and emits the directives accordingly. BTW, there is something which looks relevant in CFIFixup and CFIInstrInserter, but I haven't looked closely yet.

@topperc @preames @mshockwave @wangpc-pp @asb any suggestions how to do this?

@wangpc-pp
Copy link
Contributor

How will GCC handle these CFIs?

@qcolombet
Copy link
Collaborator

Hi @mgudim

At this point I just want to understand if I should continue with this approach or not. If "yes", I'll investigate the degradation on xz, figure out what goes wrong if function unwinds and add tests.

The approach is not crazy at all!
In fact, I believe LLVM current approach is the crazy one :).
I believe you will have a bunch of regressions as it stands though. Up to you to pursue.
Since you're doing that only for 1 target and not all of them, you may have an easier time to track and fix the regression (as opposed to changing RA for all targets at once.)

FWIW, I believe the whole CSR distinction in LLVM is wrong and the proper way to handle them would be to issue parallel copies in the entry block and exit blocks right during ISel (at the very beginning).
In a sense what you are doing is reproducing exactly that, although you do not use parallel copies.

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

One remark, since you're not using parallel copies, you're creating artificial interferences and that can have a cascading effect. (See below.) Even if you were to use parallel copies (in LLVM that would be bundles of copies) that wouldn't work out-of-the-box either since this constructs is not a first class citizen.

The TL;DR is you're testing the RA heuristic in ways that has not been exercised and some things will get better, some things will get worse.

Example of issue with CSR using sequential copies

v1 = COPY csr1
v2 = COPY csr2
v3 = COPY csr3

If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1
csr1 = COPY csr2
csr2 = COPY csr3

instead of

csr1 = COPY csr1 # no-op
csr2 = COPY csr2 # no-op
some_reg = COPY csr3

@mgudim
Copy link
Contributor Author

mgudim commented Jun 25, 2024

@qcolombet

Hi Quentin,

Thank you for the comments. After I looked into some degradations, I see that you are exactly right!

proper way to handle them would be to issue parallel copies

Interesting, you mean we should put all the copies into one instruction bundle?

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

In particular in the register allocator I see that CSRs are treated differently: since there is some cost to use CSR for the first time, there is some logic trying to minimize these costs. In this approach, I think this logic is working against us.

Here's another problem that I saw: when trying to allocate for the live range of x1 (the return-address register on RISCV), the allocator thinks its a good idea to evict previously allocated CSR's live range (even if it means breaking a hint). This eventually leads to many unnecessary spills. Looks like this is exactly the situation you described here:

Example of issue with CSR using sequential copies

v1 = COPY csr1
v2 = COPY csr2
v3 = COPY csr3
If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1
csr1 = COPY csr2
csr2 = COPY csr3
instead of

csr1 = COPY csr1 # no-op
csr2 = COPY csr2 # no-op
some_reg = COPY csr3

I have tried some hacks which basically disable eviction if it breaks hints and they make the problem go away, but I am not sure what the proper solution to this should be, maybe we can discuss it during your office hour.

@qcolombet
Copy link
Collaborator

@qcolombet

Hi Quentin,

Thank you for the comments. After I looked into some degradations, I see that you are exactly right!

proper way to handle them would be to issue parallel copies

Interesting, you mean we should put all the copies into one instruction bundle?

Yes, but that will not be enough because RA needs to be taught how to break the bundle if some of the copies need to be spilled.

As you saw, this change in representation creates regression because the CSR distinction is baked left and right in the backend. So we can fix them as we find them, but there'll be a long tail.

In particular in the register allocator I see that CSRs are treated differently: since there is some cost to use CSR for the first time, there is some logic trying to minimize these costs. In this approach, I think this logic is working against us.

Not surprised and that's what I was referring to when I was saying that CSRs are treated differently left and right.

Here's another problem that I saw: when trying to allocate for the live range of x1 (the return-address register on RISCV), the allocator thinks its a good idea to evict previously allocated CSR's live range (even if it means breaking a hint). This eventually leads to many unnecessary spills. Looks like this is exactly the situation you described here:

Example of issue with CSR using sequential copies

v1 = COPY csr1 v2 = COPY csr2 v3 = COPY csr3 If for some reason csr3 gets used for something else, you are at the mercy of the RA ordering for the coalescing. I.e., worst case, you may end up with:

some_reg = COPY csr1 csr1 = COPY csr2 csr2 = COPY csr3 instead of

csr1 = COPY csr1 # no-op csr2 = COPY csr2 # no-op some_reg = COPY csr3

I have tried some hacks which basically disable eviction if it breaks hints and they make the problem go away, but I am not sure what the proper solution to this should be, maybe we can discuss it during your office hour.

Sadly I don't think there's a good solution here. RA's decisions are currently too narrow-scoped to reason about this kind of things. The eviction chain heuristic is supposed to help with these cases, but this is a heuristic! (Plus it may be disabled by default, I don't remember.)

Long story short, if we want to go down that path, I think that a more serious overhaul of the RA infrastructure is needed.

@michaelmaitland
Copy link
Contributor

I think something went wrong on the last push.

@mgudim
Copy link
Contributor Author

mgudim commented Oct 30, 2024

I think something went wrong on the last push.

Yes, thanks! I am working on cleaning up the code and posting the branch so others can try it. Then I'll break it up into commits and try to merge the one-by-one

@mgudim mgudim closed this Nov 1, 2024
@mgudim mgudim deleted the save_csr_in_ra branch November 1, 2024 12:28
@mgudim mgudim restored the save_csr_in_ra branch November 6, 2024 17:26
@mgudim mgudim deleted the save_csr_in_ra branch November 6, 2024 17:26
@mgudim mgudim restored the save_csr_in_ra branch November 6, 2024 17:32
@mgudim mgudim reopened this Nov 6, 2024
@mgudim
Copy link
Contributor Author

mgudim commented Nov 6, 2024

I've updated the branch with the latest prototype. To try the branch, add the following flags to clang:
-mllvm -riscv-enable-save-csr-in-ra=true
mllvm -min-weight-ratio-needed-to-evict-hint=7.5 - this controls a new heuristic I added to register allocator. I've arrived at 7.5 by experiment, so other value could work better in your case.

Looking forward to your feedback. If there are no principal issues moving forward I can start upstreaming this work in small commits.

We turn the problem of saving and restoring callee-saved registers efficiently into a register allocation problem. This has the advantage that the register allocator can essentialy do shrink-wrapping on per register basis. Currently, shrink-wrapping pass saves all CSR in the same place which may be suboptimal. Also, improvements to register allocation / coalescing will translate to improvements in shrink-wrapping.

In finalizeLowering() we copy all callee-saved registers from a physical register to a virtual one. In all return blocks we copy do the reverse.
@michaelmaitland
Copy link
Contributor

What is the thought process of taking this approach over modifying the ShrinkWrap pass to communicate the PrologEpilogInserter where to put each callee saved register?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants