[CostModel][AArch64] Make extractelement, with fmul user, free whenev… #111479

sushgokh · 2024-10-08T04:45:42Z

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

extractelement does not necessitate a move from vector_reg -> GPR
extractelement result feeds into fmul
Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @foo(<2 x double> %a) { 
  %1 = extractelement <2 x double> %a, i32 0 
  %2 = extractelement <2 x double> %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.

llvmbot · 2024-10-08T04:46:19Z

@llvm/pr-subscribers-backend-aarch64

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

extractelement does not necessitate a move from vector_reg -> GPR
extractelement result feeds into fmul
Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.

Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
(modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
(modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

llvmbot · 2024-10-08T04:46:19Z

@llvm/pr-subscribers-llvm-analysis

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

extractelement does not necessitate a move from vector_reg -> GPR
extractelement result feeds into fmul
Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.

Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
(modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
(modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

llvmbot · 2024-10-08T04:46:19Z

@llvm/pr-subscribers-llvm-transforms

Author: Sushant Gokhale (sushgokh)

Changes

…er possible

In case of Neon, if there exists extractelement from lane != 0 such that

extractelement does not necessitate a move from vector_reg -> GPR
extractelement result feeds into fmul
Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0
then the extractelement can be merged with fmul in the backend and it incurs no cost.

e.g.

define double @<!-- -->foo(&lt;2 x double&gt; %a) { 
  %1 = extractelement &lt;2 x double&gt; %a, i32 0 
  %2 = extractelement &lt;2 x double&gt; %a, i32 1
  %res = fmul double %1, %2    
  ret double %res
}

%2 and %res can be merged in the backend to generate:
fmul d0, d0, v0.d[1]

The change was tested with SPEC FP(C/C++) on Neoverse-v2.
Compile time impact: None
Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.

Patch is 31.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111479.diff

9 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+32)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+10)
(modified) llvm/include/llvm/CodeGen/BasicTTIImpl.h (+14-3)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+17)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+159-4)
(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h (+19-2)
(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+14-2)
(modified) llvm/test/Analysis/CostModel/AArch64/extract_float.ll (+16-13)
(modified) llvm/test/Transforms/SLPVectorizer/consecutive-access.ll (+22-48)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 89a85bc8a90864..7a84c31679a888 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1392,6 +1392,19 @@ class TargetTransformInfo {
                                      unsigned Index = -1, Value *Op0 = nullptr,
                                      Value *Op1 = nullptr) const;
 
+  /// \return The expected cost of vector Insert and Extract.
+  /// Use -1 to indicate that there is no information on the index value.
+  /// This is used when the instruction is not available; a typical use
+  /// case is to provision the cost of vectorization/scalarization in
+  /// vectorizer passes.
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const;
+
   /// \return The expected cost of vector Insert and Extract.
   /// This is used when instruction is available, and implementation
   /// asserts 'I' is not nullptr.
@@ -2062,6 +2075,14 @@ class TargetTransformInfo::Concept {
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index, Value *Op0,
                                              Value *Op1) = 0;
+
+  virtual InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) = 0;
   virtual InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                              TTI::TargetCostKind CostKind,
                                              unsigned Index) = 0;
@@ -2726,6 +2747,17 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
                                      Value *Op1) override {
     return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
   }
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) override {
+    return Impl.getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1,
+                                   Scalar, ScalarAndIdxToUser,
+                                   UserToScalarAndIdx);
+  }
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) override {
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 50040dc8f6165b..deeecc462e19c7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -683,6 +683,16 @@ class TargetTransformInfoImplBase {
     return 1;
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) const {
+    return 1;
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) const {
diff --git a/llvm/include/llvm/CodeGen/BasicTTIImpl.h b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
index c36a346c1b2e05..48bd91c5d8c6e5 100644
--- a/llvm/include/llvm/CodeGen/BasicTTIImpl.h
+++ b/llvm/include/llvm/CodeGen/BasicTTIImpl.h
@@ -1277,12 +1277,23 @@ class BasicTTIImplBase : public TargetTransformInfoImplCRTPBase<T> {
     return 1;
   }
 
-  InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
-                                     TTI::TargetCostKind CostKind,
-                                     unsigned Index, Value *Op0, Value *Op1) {
+  virtual InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
+                                             TTI::TargetCostKind CostKind,
+                                             unsigned Index, Value *Op0,
+                                             Value *Op1) {
     return getRegUsageForType(Val->getScalarType());
   }
 
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx) {
+    return getVectorInstrCost(Opcode, Val, CostKind, Index, Op0, Op1);
+  }
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index) {
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index b5195f764cbd1c..7c588e717c4bba 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1037,6 +1037,23 @@ InstructionCost TargetTransformInfo::getVectorInstrCost(
   return Cost;
 }
 
+InstructionCost TargetTransformInfo::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) const {
+  // FIXME: Assert that Opcode is either InsertElement or ExtractElement.
+  // This is mentioned in the interface description and respected by all
+  // callers, but never asserted upon.
+  InstructionCost Cost = TTIImpl->getVectorInstrCost(
+      Opcode, Val, CostKind, Index, Op0, Op1, Scalar, ScalarAndIdxToUser,
+      UserToScalarAndIdx);
+  assert(Cost >= 0 && "TTI should not produce negative costs!");
+  return Cost;
+}
+
 InstructionCost
 TargetTransformInfo::getVectorInstrCost(const Instruction &I, Type *Val,
                                         TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index 80d5168ae961ab..f86e7a3715b4e4 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -16,14 +16,20 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/CodeGen/CostTable.h"
 #include "llvm/CodeGen/TargetLowering.h"
+#include "llvm/IR/DerivedTypes.h"
+#include "llvm/IR/InstrTypes.h"
+#include "llvm/IR/Instruction.h"
+#include "llvm/IR/Instructions.h"
 #include "llvm/IR/IntrinsicInst.h"
 #include "llvm/IR/Intrinsics.h"
 #include "llvm/IR/IntrinsicsAArch64.h"
 #include "llvm/IR/PatternMatch.h"
+#include "llvm/Support/Casting.h"
 #include "llvm/Support/Debug.h"
 #include "llvm/Transforms/InstCombine/InstCombiner.h"
 #include "llvm/Transforms/Vectorize/LoopVectorizationLegality.h"
 #include <algorithm>
+#include <cassert>
 #include <optional>
 using namespace llvm;
 using namespace llvm::PatternMatch;
@@ -3145,12 +3151,20 @@ InstructionCost AArch64TTIImpl::getCFInstrCost(unsigned Opcode,
   return 0;
 }
 
-InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
-                                                         Type *Val,
-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+    unsigned Index, bool HasRealUse, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
   assert(Val->isVectorTy() && "This must be a vector type");
 
+  const Instruction *I =
+      (std::holds_alternative<const Instruction *>(InstOrOpcode)
+           ? get<const Instruction *>(InstOrOpcode)
+           : nullptr);
+
   if (Index != -1U) {
     // Legalize the type.
     std::pair<InstructionCost, MVT> LT = getTypeLegalizationCost(Val);
@@ -3194,6 +3208,134 @@ InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(const Instruction *I,
     // compile-time considerations.
   }
 
+  // In case of Neon, if there exists extractelement from lane != 0 such that
+  // 1. extractelement does not necessitate a move from vector_reg -> GPR.
+  // 2. extractelement result feeds into fmul.
+  // 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane
+  // equivalent to 0.
+  // then the extractelement can be merged with fmul in the backend and it
+  // incurs no cost.
+  // e.g.
+  // define double @foo(<2 x double> %a) {
+  //   %1 = extractelement <2 x double> %a, i32 0
+  //   %2 = extractelement <2 x double> %a, i32 1
+  //   %res = fmul double %1, %2
+  //   ret double %res
+  // }
+  // %2 and %res can be merged in the backend to generate fmul v0, v0, v1.d[1]
+  auto ExtractCanFuseWithFmul = [&]() {
+    // We bail out if the extract is from lane 0.
+    if (Index == 0)
+      return false;
+
+    // Check if the scalar element type of the vector operand of ExtractElement
+    // instruction is one of the allowed types.
+    auto IsAllowedScalarTy = [&](const Type *T) {
+      return T->isFloatTy() || T->isDoubleTy() ||
+             (T->isHalfTy() && ST->hasFullFP16());
+    };
+
+    // Check if the extractelement user is scalar fmul.
+    auto IsUserFMulScalarTy = [](const Value *EEUser) {
+      // Check if the user is scalar fmul.
+      const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(EEUser);
+      return BO && BO->getOpcode() == BinaryOperator::FMul &&
+             !BO->getType()->isVectorTy();
+    };
+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {
+      return any_of(FMul->users(), [](const User *U) {
+        const BinaryOperator *BO = dyn_cast_if_present<BinaryOperator>(U);
+        return (BO && (BO->getOpcode() == BinaryOperator::FAdd ||
+                       BO->getOpcode() == BinaryOperator::FSub));
+      });
+    };
+
+    // Check if the type constraints on input vector type and result scalar type
+    // of extractelement instruction are satisfied.
+    auto TypeConstraintsOnEESatisfied =
+        [&IsAllowedScalarTy](const Type *VectorTy, const Type *ScalarTy) {
+          return isa<FixedVectorType>(VectorTy) && IsAllowedScalarTy(ScalarTy);
+        };
+
+    // Check if the extract index is from lane 0 or lane equivalent to 0 for a
+    // certain scalar type and a certain vector register width.
+    auto IsExtractLaneEquivalentToZero = [&](const unsigned &Idx,
+                                             const unsigned &EltSz) {
+      auto RegWidth =
+          getRegisterBitWidth(TargetTransformInfo::RGK_FixedWidthVector)
+              .getFixedValue();
+      return (Idx == 0 || (Idx * EltSz) % RegWidth == 0);
+    };
+
+    if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+      if (!TypeConstraintsOnEESatisfied(Val, Val->getScalarType()))
+        return false;
+      const auto &ScalarIdxPair = std::make_pair(Scalar, Index);
+      return ScalarAndIdxToUser.find(ScalarIdxPair) !=
+                 ScalarAndIdxToUser.end() &&
+             all_of(ScalarAndIdxToUser.at(ScalarIdxPair), [&](Value *U) {
+               if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+                 return false;
+               // 1. Check if the other operand is extract from lane 0 or lane
+               // equivalent to 0.
+               // 2. In case of SLP, if the other operand is not extract from
+               // same tree, we bail out since we can not analyze that extract.
+               return UserToScalarAndIdx.at(U).size() == 2 &&
+                      all_of(UserToScalarAndIdx.at(U), [&](auto &P) {
+                        if (ScalarIdxPair == P)
+                          return true; // Skip.
+                        return IsExtractLaneEquivalentToZero(
+                            P.second, Val->getScalarSizeInBits());
+                      });
+             });
+    } else {
+      const ExtractElementInst *EE = cast<ExtractElementInst>(I);
+
+      const ConstantInt *IdxOp = dyn_cast<ConstantInt>(EE->getIndexOperand());
+      if (!IdxOp)
+        return false;
+
+      if (!TypeConstraintsOnEESatisfied(EE->getVectorOperand()->getType(),
+                                        EE->getType()))
+        return false;
+
+      return !EE->users().empty() && all_of(EE->users(), [&](const User *U) {
+        if (!IsUserFMulScalarTy(U) || IsFMulUserFAddFSub(U))
+          return false;
+
+        // Check if the other operand of extractelement is also extractelement
+        // from lane equivalent to 0.
+        const BinaryOperator *BO = cast<BinaryOperator>(U);
+        const ExtractElementInst *OtherEE = dyn_cast<ExtractElementInst>(
+            BO->getOperand(0) == EE ? BO->getOperand(1) : BO->getOperand(0));
+        if (OtherEE) {
+          const ConstantInt *IdxOp =
+              dyn_cast<ConstantInt>(OtherEE->getIndexOperand());
+          if (!IdxOp)
+            return false;
+          return IsExtractLaneEquivalentToZero(
+              cast<ConstantInt>(OtherEE->getIndexOperand())
+                  ->getValue()
+                  .getZExtValue(),
+              OtherEE->getType()->getScalarSizeInBits());
+        }
+        return true;
+      });
+    }
+    return false;
+  };
+
+  if (std::holds_alternative<const unsigned>(InstOrOpcode)) {
+    const unsigned &Opcode = get<const unsigned>(InstOrOpcode);
+    if (Opcode == Instruction::ExtractElement && ExtractCanFuseWithFmul())
+      return 0;
+  } else if (I && I->getOpcode() == Instruction::ExtractElement &&
+             ExtractCanFuseWithFmul())
+    return 0;
+
   // All other insert/extracts cost this much.
   return ST->getVectorInsertExtractBaseCost();
 }
@@ -3207,6 +3349,19 @@ InstructionCost AArch64TTIImpl::getVectorInstrCost(unsigned Opcode, Type *Val,
   return getVectorInstrCostHelper(nullptr, Val, Index, HasRealUse);
 }
 
+InstructionCost AArch64TTIImpl::getVectorInstrCost(
+    unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+    Value *Op0, Value *Op1, Value *Scalar,
+    const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+        &ScalarAndIdxToUser,
+    const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+        &UserToScalarAndIdx) {
+  bool HasRealUse =
+      Opcode == Instruction::InsertElement && Op0 && !isa<UndefValue>(Op0);
+  return getVectorInstrCostHelper(Opcode, Val, Index, HasRealUse, Scalar,
+                                  ScalarAndIdxToUser, UserToScalarAndIdx);
+}
+
 InstructionCost AArch64TTIImpl::getVectorInstrCost(const Instruction &I,
                                                    Type *Val,
                                                    TTI::TargetCostKind CostKind,
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
index 28e45207596ecd..12952ba41708b1 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
@@ -24,6 +24,7 @@
 #include "llvm/CodeGen/BasicTTIImpl.h"
 #include "llvm/IR/Function.h"
 #include "llvm/IR/Intrinsics.h"
+#include <climits>
 #include <cstdint>
 #include <optional>
 
@@ -66,8 +67,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   // 'Val' and 'Index' are forwarded from 'getVectorInstrCost'; 'HasRealUse'
   // indicates whether the vector instruction is available in the input IR or
   // just imaginary in vectorizer passes.
-  InstructionCost getVectorInstrCostHelper(const Instruction *I, Type *Val,
-                                           unsigned Index, bool HasRealUse);
+  InstructionCost getVectorInstrCostHelper(
+      std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,
+      unsigned Index, bool HasRealUse, Value *Scalar = nullptr,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser =
+              DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>(),
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx = DenseMap<
+              Value *, SmallVector<std::pair<Value *, unsigned>, 4>>());
 
 public:
   explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
@@ -185,6 +193,15 @@ class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
   InstructionCost getVectorInstrCost(unsigned Opcode, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index, Value *Op0, Value *Op1);
+
+  InstructionCost getVectorInstrCost(
+      unsigned Opcode, Type *Val, TTI::TargetCostKind CostKind, unsigned Index,
+      Value *Op0, Value *Op1, Value *Scalar,
+      const DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+          &ScalarAndIdxToUser,
+      const DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+          &UserToScalarAndIdx);
+
   InstructionCost getVectorInstrCost(const Instruction &I, Type *Val,
                                      TTI::TargetCostKind CostKind,
                                      unsigned Index);
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 401597af35bdac..90a070fc3b3c48 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -11633,6 +11633,17 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
   std::optional<DenseMap<Value *, unsigned>> ValueToExtUses;
   DenseMap<const TreeEntry *, DenseSet<Value *>> ExtractsCount;
   SmallPtrSet<Value *, 4> ScalarOpsFromCasts;
+  // Keep track of {Scalar, Index} -> User and User -> {Scalar, Index}.
+  // On AArch64, this helps in fusing a mov instruction, associated with
+  // extractelement, with fmul in the backend so that extractelement is free.
+  DenseMap<std::pair<Value *, unsigned>, SmallVector<Value *, 4>>
+      ScalarAndIdxToUser;
+  DenseMap<Value *, SmallVector<std::pair<Value *, unsigned>, 4>>
+      UserToScalarAndIdx;
+  for (ExternalUser &EU : ExternalUses) {
+    UserToScalarAndIdx[EU.User].push_back({EU.Scalar, EU.Lane});
+    ScalarAndIdxToUser[{EU.Scalar, EU.Lane}].push_back(EU.User);
+  }
   for (ExternalUser &EU : ExternalUses) {
     // Uses by ephemeral values are free (because the ephemeral value will be
     // removed prior to code generation, and so the extraction will be
@@ -11739,8 +11750,9 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       ExtraCost = TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
                                                 VecTy, EU.Lane);
     } else {
-      ExtraCost = TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
-                                          CostKind, EU.Lane);
+   ...
[truncated]

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/test/Analysis/CostModel/AArch64/extract_float.ll

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

github-actions · 2024-10-09T08:02:53Z

✅ With the latest revision this PR passed the C/C++ code formatter.

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

sushgokh · 2024-10-17T06:58:16Z

ping

davemgreen

Hi - I think what this patch is trying to do makes a lot of sense. I guess there might be a reluctance due to the way it has to do it. There is quite a lot of new code. In the past we have tried to have the cost model functions able to test generic (non-existent) instructions. Sometimes they need to know more info (especially about uses and operands), so operands get added and sometimes a CxtI instruction to get the uses from. Am I right that that approach of passing a CxtI instruction doesn't work very well in this case because there is no instruction yet? The extract is new and the mul the new user of it. It would be nice if the interface was fairly generic if we an make it, not specific to the SLP vectorizer.

davemgreen · 2024-10-17T07:43:57Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,


Can you remove variant? I don't think it is necessary if we always pass the Opcode and then treat the instruction like we do elsewhere as an optional argument.

Sometimes Instruction itself is passed to this API to get more information. So, const Instruction *I has to be one of the arguments

The thinking behind using single variable for Instruction and Opcode(using variant) is Instruction and Opcode is somewhat related but we never pass both unless its required in future.

I would change this to avoid all the variant code:

getVectorInstrCostHelper(unsigned Opcode, Type *Val, unsigned Index, bool HasRealUse, const Instruction *CxtI)

If you have a CxtI, you can easily get the Opcode from it when calling the function. Then if just needs to check if (CxtI) for any checks that use the instruction. It feels like a lot of the added code that has to handle Opcode and I can then fold together into a single version.

You really cant merge the code for instruction and opcode, atleast for the patch, because the change wont be reflected in the cost-model tests.

When opcode!=null , less information is available and we bail out early in general.

When I!=null, more analysis needs to be done.

I will keep seperate instruction and opcode parameter as you say.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

davemgreen · 2024-10-24T09:59:22Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {


Why is this needed? Can a fma not do the same trick?

Yes, you are right.
What this check does is we bail out of the optimisation if there is chance of generating fma.
I was getting regression either in one of the regression tests or SPEC17 benchmarks before this check(dont remember which one exactly).

Do you have more details? This patch feels quite complex for what it does, and anything we can do to simplify it would help.

We would usually want to avoid things that are O(n) in the number of uses, more-so if it is O(n^2).

Lets say the situation goes like this:

define double @foo(<2 x double> %a, double %b) { %1 = extractelement <2 x double> %a, i32 0 %2 = extractelement <2 x double> %a, i32 1 %3 = fmul double %1, %2 %4 = fadd double %3, %b ret double %4 }

Codegen with ./llc -mtriple=aarch64 -fp-contract=fast test.ll is

foo: // @foo .cfi_startproc // %bb.0: mov d2, v0.d[1] fmadd d0, d0, d2, d1 ret

So, even if you need the requirements of fmul and extractelement fusion, if fmul has fadd/fsub, we need to bail out.
Hence the patch checks if fmul has fadd/fsub as users.

Oh I see. That might be a problem with the tablegen patterns. I believe it works for other indices and when using -O3 (it takes a different route, using the machine combiner). I'll see if I can put together a fix.

fixing just in tablegen might not be sufficient. This might need fixing in instcombine(I believe it generates fma instrinsic).
Additionally, I think fma is better than fmul/extract fusion(I think I had measured it on Neoverse-v2)

There are some extra patterns in #114976 that might or might not help. Can you give more details on the instcombine/fma issue? I think it might be OK because we are only looking at mul here, and DAG will create the fma's depending on the optimization level (but that is hopefully fine with the new patterns).

Thanks for the patch #114976. This fixes the concern I had cited in terms of the additional code. I will now just remove the code.

sushgokh · 2024-10-24T12:03:04Z

Hi - I think what this patch is trying to do makes a lot of sense. I guess there might be a reluctance due to the way it has to do it. There is quite a lot of new code. In the past we have tried to have the cost model functions able to test generic (non-existent) instructions. Sometimes they need to know more info (especially about uses and operands), so operands get added and sometimes a CxtI instruction to get the uses from. Am I right that that approach of passing a CxtI instruction doesn't work very well in this case because there is no instruction yet? The extract is new and the mul the new user of it. It would be nice if the interface was fairly generic if we an make it, not specific to the SLP vectorizer.

Currently, the way SLP vect works is

If the extract user is used out-of-SLP-tree, you know the exact user instruction
If the extract user is going to be used in-SLP-tree, we dont have information about the user. User=nullptr for such cases. SLP updates the users at later stage post vectorization.

So, currently, the patch is just catering to out-of-tree users(and I realised this post uploading this patch due to my limited knowledge). I will amend this patch for in-tree users.

Now why not have have API which uses something like CtxI ?
If the user is out-of-tree, I can directly pass the user as single argument. But for in-tree user, we have no information(User=nullptr). In such case, we need to make more analysis and for the same reason, I need to pass array, having all extract information, as an API argument.
I hope this answers your question.

davemgreen

When you say:

I will amend this patch for in-tree users.

Do you mean to do that here?

davemgreen · 2024-11-04T09:46:18Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

-                                                         unsigned Index,
-                                                         bool HasRealUse) {
+InstructionCost AArch64TTIImpl::getVectorInstrCostHelper(
+    std::variant<const Instruction *, const unsigned> InstOrOpcode, Type *Val,


I would change this to avoid all the variant code:

getVectorInstrCostHelper(unsigned Opcode, Type *Val, unsigned Index, bool HasRealUse, const Instruction *CxtI)

If you have a CxtI, you can easily get the Opcode from it when calling the function. Then if just needs to check if (CxtI) for any checks that use the instruction. It feels like a lot of the added code that has to handle Opcode and I can then fold together into a single version.

davemgreen · 2024-11-04T09:47:39Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {


Do you have more details? This patch feels quite complex for what it does, and anything we can do to simplify it would help.

We would usually want to avoid things that are O(n) in the number of uses, more-so if it is O(n^2).

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

sushgokh · 2024-11-04T12:53:07Z

When you say:

I will amend this patch for in-tree users.

Do you mean to do that here?

Yes, amended the current patch

davemgreen

Can you look at fixing what is wrong with the testing?

davemgreen · 2024-11-06T08:21:21Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+
+    // InstCombine combines fmul with fadd/fsub. Hence, extractelement fusion
+    // with fmul does not happen.
+    auto IsFMulUserFAddFSub = [](const Value *FMul) {


There are some extra patterns in #114976 that might or might not help. Can you give more details on the instcombine/fma issue? I think it might be OK because we are only looking at mul here, and DAG will create the fma's depending on the optimization level (but that is hopefully fine with the new patterns).

davemgreen · 2024-11-06T08:47:42Z

llvm/include/llvm/Analysis/TargetTransformInfo.h

+      Value *Scalar,
+      ArrayRef<std::tuple<Value *, User *, int>> ScalarUserAndIdx) const;


Can you try to explain these parameters in the description (hopefully in a way that doesn't sound tightly coupled with the SLP vectorizer). I don't know of a better way of passing the information this needs, if it needs to know that the user of the imaginary extract has operands which are also imaginary extracts. I'm wondering if there is a better/higher level way, or maybe we don't need to be quite so precise and just say that the extract to a mul is free. Maybe we just need to work on reducing the cost of an fp extract to 1. I wasn't able to look at this patch as much as I would have liked as all the tests are failing.

I don't know of a better way of passing the information this needs,

Extract information passed to the backend is not substantial. Tried using Densemap to record such information but then it creates complications in terms of API declaration unnecessarily. Also, since the information passed is not much, finding the information in the array does not impose compile time overhead.
But yes, will try to describe this information in more detail in non-slp specific manner.

From the cases, its quite clear that we cant simply say that "extract to fmul is free" because it depends on how the extract is used elsewhere.

Maybe we just need to work on reducing the cost of an fp extract to 1.

Not sure on this. I suspect reducing the cost to 1 might cause excessive extracts(I had given it a try but it causes excessive churn). This patch goes even further to reduce the cost to 0 and fortunately this patch is not imposing compile time overhead.

I wasn't able to look at this patch as much as I would have liked as all the tests are failing.

I havent rebased it. Sorry. Will rebase it.

@foo

…er possible In case of Neon, if there exists extractelement from lane != 0 such that 1. extractelement does not necessitate a move from vector_reg -> GPR. 2. extractelement result feeds into fmul. 3. Other operand of fmul is a scalar or extractelement from lane 0 or lane equivalent to 0. then the extractelement can be merged with fmul in the backend and it incurs no cost. e.g. define double @foo(<2 x double> %a) { %1 = extractelement <2 x double> %a, i32 0 %2 = extractelement <2 x double> %a, i32 1 %res = fmul double %1, %2 ret double %res } %2 and %res can be merged in the backend to generate: fmul d0, d0, v0.d[1] The change was tested with SPEC FP(C/C++) on Neoverse-v2. Compile time impact: None Performance impact: Observing 1.3-1.7% uplift on lbm benchmark with -flto depending upon the config.

sushgokh · 2024-11-06T19:42:15Z

Can you look at fixing what is wrong with the testing?

Check was missing. Fixed it.

sushgokh requested review from madhur13490, alexey-bataev, davemgreen, sjoerdmeijer and david-arm October 8, 2024 04:45

llvmbot added backend:AArch64 vectorizers llvm:analysis llvm:transforms labels Oct 8, 2024

david-arm reviewed Oct 8, 2024

View reviewed changes

alexey-bataev reviewed Oct 8, 2024

View reviewed changes

sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 2471f47 to 85ca8b2 Compare October 9, 2024 07:57

sushgokh requested review from david-arm and alexey-bataev October 9, 2024 08:11

alexey-bataev reviewed Oct 9, 2024

View reviewed changes

sushgokh force-pushed the extractelement-costing-with-fmul-user branch 2 times, most recently from c88143d to 08e2879 Compare October 10, 2024 05:59

sushgokh requested a review from alexey-bataev October 10, 2024 07:31

davemgreen reviewed Oct 24, 2024

View reviewed changes

sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 08e2879 to 72e6397 Compare November 4, 2024 05:29

sushgokh requested a review from davemgreen November 4, 2024 05:31

davemgreen reviewed Nov 4, 2024

View reviewed changes

sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 72e6397 to f4f9396 Compare November 5, 2024 05:18

sushgokh requested a review from davemgreen November 5, 2024 05:19

davemgreen reviewed Nov 6, 2024

View reviewed changes

sushgokh force-pushed the extractelement-costing-with-fmul-user branch 2 times, most recently from 656f8cb to 242576a Compare November 6, 2024 11:29

sushgokh force-pushed the extractelement-costing-with-fmul-user branch from 242576a to 5205ae1 Compare November 6, 2024 19:21

sushgokh requested a review from davemgreen November 6, 2024 19:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CostModel][AArch64] Make extractelement, with fmul user, free whenev… #111479

[CostModel][AArch64] Make extractelement, with fmul user, free whenev… #111479

sushgokh commented Oct 8, 2024

llvmbot commented Oct 8, 2024

llvmbot commented Oct 8, 2024

llvmbot commented Oct 8, 2024

github-actions bot commented Oct 9, 2024 •

edited

Loading

sushgokh commented Oct 17, 2024

davemgreen left a comment

davemgreen Oct 17, 2024

sushgokh Oct 24, 2024

davemgreen Nov 4, 2024

sushgokh Nov 4, 2024

davemgreen Oct 24, 2024

sushgokh Oct 24, 2024

davemgreen Nov 4, 2024

sushgokh Nov 4, 2024

davemgreen Nov 4, 2024

sushgokh Nov 5, 2024

davemgreen Nov 6, 2024

sushgokh Nov 6, 2024

sushgokh commented Oct 24, 2024 •

edited

Loading

davemgreen left a comment

davemgreen Nov 4, 2024

davemgreen Nov 4, 2024

sushgokh commented Nov 4, 2024

davemgreen left a comment

davemgreen Nov 6, 2024

davemgreen Nov 6, 2024

sushgokh Nov 6, 2024

sushgokh commented Nov 6, 2024

		Value *Scalar,
		ArrayRef<std::tuple<Value , User , int>> ScalarUserAndIdx) const;

[CostModel][AArch64] Make extractelement, with fmul user, free whenev… #111479

Are you sure you want to change the base?

[CostModel][AArch64] Make extractelement, with fmul user, free whenev… #111479

Conversation

sushgokh commented Oct 8, 2024

llvmbot commented Oct 8, 2024

llvmbot commented Oct 8, 2024

llvmbot commented Oct 8, 2024

github-actions bot commented Oct 9, 2024 • edited Loading

sushgokh commented Oct 17, 2024

davemgreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sushgokh commented Oct 24, 2024 • edited Loading

davemgreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sushgokh commented Nov 4, 2024

davemgreen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sushgokh commented Nov 6, 2024

github-actions bot commented Oct 9, 2024 •

edited

Loading

sushgokh commented Oct 24, 2024 •

edited

Loading