Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CostModel][AArch64] Make extractelement, with fmul user, free whenev… #111479

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sushgokh
Copy link
Contributor

@sushgokh sushgokh commented Oct 8, 2024

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @foo(<2 x double> %a) { 
  %1 = extractelement <2 x double> %a, i32 0 
  %2 = extractelement <2 x double> %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.


Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
  • (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
  • (modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-llvm-analysis

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.


Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
  • (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
  • (modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

@llvmbot
Copy link
Collaborator

llvmbot commented Oct 8, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

  1. extractelement does not necessitate a move from vector_reg -> GPR
  2. extractelement result feeds into fmul
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
    then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.


Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

  • (modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
  • (modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
  • (modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
  • (modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
  • (modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
  • (modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 2471f47 to 85ca8b2 Compare October 9, 2024 07:57
Copy link

github-actions bot commented Oct 9, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/include/llvm/Analysis/TargetTransformInfo.h Outdated Show resolved Hide resolved
llvm/include/llvm/Analysis/TargetTransformInfo.h Outdated Show resolved Hide resolved
llvm/include/llvm/CodeGen/BasicTTIImpl.h Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved
@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch 2 times, most recently from c88143d to 08e2879 Compare October 10, 2024 05:59
@sushgokh
Copy link
Contributor Author

ping

Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi - I think what this patch is trying to do makes a lot of sense. I guess there might be a reluctance due to the way it has to do it. There is quite a lot of new code. In the past we have tried to have the cost model functions able to test generic (non-existent) instructions. Sometimes they need to know more info (especially about uses and operands), so operands get added and sometimes a CxtI instruction to get the uses from. Am I right that that approach of passing a CxtI instruction doesn't work very well in this case because there is no instruction yet? The extract is new and the mul the new user of it. It would be nice if the interface was fairly generic if we an make it, not specific to the SLP vectorizer.

unsigned Index,
bool HasRealUse) {
InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove variant? I don't think it is necessary if we always pass the Opcode and then treat the instruction like we do elsewhere as an optional argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sometimes Instruction itself is passed to this API to get more information. So, const Instruction *I has to be one of the arguments
  2. The thinking behind using single variable for Instruction and Opcode(using variant) is Instruction and Opcode is somewhat related but we never pass both unless its required in future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to avoid all the variant code:

getVectorInstrCostHelper(unsigned Opcode, Type *Val, unsigned Index, bool HasRealUse, const Instruction *CxtI)

If you have a CxtI, you can easily get the Opcode from it when calling the function. Then if just needs to check if (CxtI) for any checks that use the instruction. It feels like a lot of the added code that has to handle Opcode and I can then fold together into a single version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really cant merge the code for instruction and opcode, atleast for the patch, because the change wont be reflected in the cost-model tests.

  1. When opcode!=null , less information is available and we bail out early in general.
  2. When I!=null, more analysis needs to be done.

I will keep seperate instruction and opcode parameter as you say.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp Outdated Show resolved Hide resolved

// InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
// with fmul does not happen.
auto IsFMulUserFAddFSub = [](const Value *FMul) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Can a fma not do the same trick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.
What this check does is we bail out of the optimisation if there is chance of generating fma.
I was getting regression either in one of the regression tests or SPEC17 benchmarks before this check(dont remember which one exactly).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have more details? This patch feels quite complex for what it does, and anything we can do to simplify it would help.

We would usually want to avoid things that are O(n) in the number of uses, more-so if it is O(n^2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say the situation goes like this:

define double @foo(<2 x double> %a, double %b)
{
        %1 = extractelement <2 x double> %a, i32 0
        %2 = extractelement <2 x double> %a, i32 1
        %3 = fmul double %1, %2
        %4 = fadd double %3, %b
        ret double %4
}

Codegen with ./llc -mtriple=aarch64 -fp-contract=fast test.ll is

foo:                                    // @foo
        .cfi_startproc
// %bb.0:
        mov     d2, v0.d[1]
        fmadd   d0, d0, d2, d1
        ret

So, even if you need the requirements of fmul and extractelement fusion, if fmul has fadd/fsub, we need to bail out.
Hence the patch checks if fmul has fadd/fsub as users.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. That might be a problem with the tablegen patterns. I believe it works for other indices and when using -O3 (it takes a different route, using the machine combiner). I'll see if I can put together a fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing just in tablegen might not be sufficient. This might need fixing in instcombine(I believe it generates fma instrinsic).
Additionally, I think fma is better than fmul/extract fusion(I think I had measured it on Neoverse-v2)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some extra patterns in #114976 that might or might not help. Can you give more details on the instcombine/fma issue? I think it might be OK because we are only looking at mul here, and DAG will create the fma's depending on the optimization level (but that is hopefully fine with the new patterns).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch #114976. This fixes the concern I had cited in terms of the additional code. I will now just remove the code.

@sushgokh
Copy link
Contributor Author

sushgokh commented Oct 24, 2024

Hi - I think what this patch is trying to do makes a lot of sense. I guess there might be a reluctance due to the way it has to do it. There is quite a lot of new code. In the past we have tried to have the cost model functions able to test generic (non-existent) instructions. Sometimes they need to know more info (especially about uses and operands), so operands get added and sometimes a CxtI instruction to get the uses from. Am I right that that approach of passing a CxtI instruction doesn't work very well in this case because there is no instruction yet? The extract is new and the mul the new user of it. It would be nice if the interface was fairly generic if we an make it, not specific to the SLP vectorizer.

Currently, the way SLP vect works is

  1. If the extract user is used out-of-SLP-tree, you know the exact user instruction
  2. If the extract user is going to be used in-SLP-tree, we dont have information about the user. User=nullptr for such cases. SLP updates the users at later stage post vectorization.

So, currently, the patch is just catering to out-of-tree users(and I realised this post uploading this patch due to my limited knowledge). I will amend this patch for in-tree users.

Now why not have have API which uses something like CtxI ?
If the user is out-of-tree, I can directly pass the user as single argument. But for in-tree user, we have no information(User=nullptr). In such case, we need to make more analysis and for the same reason, I need to pass array, having all extract information, as an API argument.
I hope this answers your question.

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 08e2879 to 72e6397 Compare November 4, 2024 05:29
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you say:

I will amend this patch for in-tree users.

Do you mean to do that here?

unsigned Index,
bool HasRealUse) {
InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to avoid all the variant code:

getVectorInstrCostHelper(unsigned Opcode, Type *Val, unsigned Index, bool HasRealUse, const Instruction *CxtI)

If you have a CxtI, you can easily get the Opcode from it when calling the function. Then if just needs to check if (CxtI) for any checks that use the instruction. It feels like a lot of the added code that has to handle Opcode and I can then fold together into a single version.


// InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
// with fmul does not happen.
auto IsFMulUserFAddFSub = [](const Value *FMul) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have more details? This patch feels quite complex for what it does, and anything we can do to simplify it would help.

We would usually want to avoid things that are O(n) in the number of uses, more-so if it is O(n^2).

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp Outdated Show resolved Hide resolved
@sushgokh
Copy link
Contributor Author

sushgokh commented Nov 4, 2024

When you say:

I will amend this patch for in-tree users.

Do you mean to do that here?

Yes, amended the current patch

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 72e6397 to f4f9396 Compare November 5, 2024 05:18
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at fixing what is wrong with the testing?


// InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
// with fmul does not happen.
auto IsFMulUserFAddFSub = [](const Value *FMul) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some extra patterns in #114976 that might or might not help. Can you give more details on the instcombine/fma issue? I think it might be OK because we are only looking at mul here, and DAG will create the fma's depending on the optimization level (but that is hopefully fine with the new patterns).

Comment on lines +1403 to +1420
Value *Scalar,
ArrayRef<std::tuple<Value *, User *, int>> ScalarUserAndIdx) const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try to explain these parameters in the description (hopefully in a way that doesn't sound tightly coupled with the SLP vectorizer). I don't know of a better way of passing the information this needs, if it needs to know that the user of the imaginary extract has operands which are also imaginary extracts. I'm wondering if there is a better/higher level way, or maybe we don't need to be quite so precise and just say that the extract to a mul is free. Maybe we just need to work on reducing the cost of an fp extract to 1. I wasn't able to look at this patch as much as I would have liked as all the tests are failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know of a better way of passing the information this needs,

Extract information passed to the backend is not substantial. Tried using Densemap to record such information but then it creates complications in terms of API declaration unnecessarily. Also, since the information passed is not much, finding the information in the array does not impose compile time overhead.
But yes, will try to describe this information in more detail in non-slp specific manner.

From the cases, its quite clear that we cant simply say that "extract to fmul is free" because it depends on how the extract is used elsewhere.

Maybe we just need to work on reducing the cost of an fp extract to 1.

Not sure on this. I suspect reducing the cost to 1 might cause excessive extracts(I had given it a try but it causes excessive churn). This patch goes even further to reduce the cost to 0 and fortunately this patch is not imposing compile time overhead.

I wasn't able to look at this patch as much as I would have liked as all the tests are failing.

I havent rebased it. Sorry. Will rebase it.

@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch 2 times, most recently from 656f8cb to 242576a Compare November 6, 2024 11:29
…er possible

In case of Neon, if there exists extractelement from lane != 0 such that
  1. extractelement does not necessitate a move from vector_reg -> GPR.
  2. extractelement result feeds into fmul.
  3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0.
  then the extractelement can be merged with fmul in the backend and it incurs no cost.
  e.g.
  define double @foo(<2 x double> %a) {
    %1 = extractelement <2 x double> %a, i32 0
    %2 = extractelement <2 x double> %a, i32 1
    %res = fmul double %1, %2    ret double %res
  }
  %2 and %res can be merged in the backend to generate:
  fmul    d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.
@sushgokh sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 242576a to 5205ae1 Compare November 6, 2024 19:21
@sushgokh
Copy link
Contributor Author

sushgokh commented Nov 6, 2024

Can you look at fixing what is wrong with the testing?

Check was missing. Fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants