[NVPTX] Add patterns for fma.relu.{f16|bf16} #114977

hdelan · 2024-11-05T12:42:53Z

Add patterns to lower fma(a, b, c) > 0 ? fma(a, b, c) : 0 for f16 and bf16 types.

llvmbot · 2024-11-05T12:43:32Z

@llvm/pr-subscribers-backend-nvptx

Author: Hugh Delaney (hdelan)

Changes

Add patterns to lower fma(a, b, c) > 0 ? fma(a, b, c) : 0 for f16 and bf16 types.

Full diff: https://github.com/llvm/llvm-project/pull/114977.diff

2 Files Affected:

(modified) llvm/lib/Target/NVPTX/NVPTXInstrInfo.td (+16)
(added) llvm/test/CodeGen/NVPTX/fma-relu.ll (+77)

diff --git a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
index 5f6cba397c5352..52312fa9afbd7e 100644
--- a/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
+++ b/llvm/lib/Target/NVPTX/NVPTXInstrInfo.td
@@ -3917,3 +3917,19 @@ def atomic_thread_fence_seq_cst_cta :
 def atomic_thread_fence_acq_rel_cta :
   NVPTXInst<(outs), (ins), "fence.acq_rel.cta;", []>,
   Requires<[hasPTX<60>, hasSM<70>]>;
+
+def fpimm0 : FPImmLeaf<fAny, [{
+  return Imm.isExactlyValue(+0.0);
+}]>;
+
+def FMARELU :
+  NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$a, Int16Regs:$b, Int16Regs:$c),
+            "fma.rn.relu \t$dst, $a, $b, $c;", []>;
+
+def : Pat<(f16 (fmaxnum (fma Int16Regs:$a, Int16Regs:$b, Int16Regs:$c), fpimm0)),
+  (FMARELU Int16Regs:$a, Int16Regs:$b, Int16Regs:$c)>,
+  Requires<[useFP16Math, allowFMA, allowUnsafeFPMath, hasPTX<60>, hasSM<70>]>;
+
+def : Pat<(bf16 (fmaxnum (fma Int16Regs:$a, Int16Regs:$b, Int16Regs:$c), fpimm0)),
+  (FMARELU Int16Regs:$a, Int16Regs:$b, Int16Regs:$c)>,
+  Requires<[hasBF16Math, allowFMA, allowUnsafeFPMath, hasPTX<60>, hasSM<70>]>;
diff --git a/llvm/test/CodeGen/NVPTX/fma-relu.ll b/llvm/test/CodeGen/NVPTX/fma-relu.ll
new file mode 100644
index 00000000000000..6c340ef9d53015
--- /dev/null
+++ b/llvm/test/CodeGen/NVPTX/fma-relu.ll
@@ -0,0 +1,77 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc < %s -march=nvptx64 --enable-unsafe-fp-math -mcpu=sm_80 -mattr=ptx70 -verify-machineinstrs -fp-contract=fast -nvptx-fma-level=2 | FileCheck %s
+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=ptx70 -verify-machineinstrs -fp-contract=fast -nvptx-fma-level=2 | %ptxas-verify -arch=sm_80 %}
+
+define half @fma_f16(half %a, half %b, half %c) {
+; CHECK-LABEL: fma_f16(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<5>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.b16 %rs1, [fma_f16_param_0];
+; CHECK-NEXT:    ld.param.b16 %rs2, [fma_f16_param_1];
+; CHECK-NEXT:    ld.param.b16 %rs3, [fma_f16_param_2];
+; CHECK-NEXT:    fma.rn.relu %rs4, %rs1, %rs2, %rs3;
+; CHECK-NEXT:    st.param.b16 [func_retval0], %rs4;
+; CHECK-NEXT:    ret;
+  %1 = call half @llvm.fma.f16(half %a, half %b, half %c)
+  %2 = fcmp ogt half %1, 0.0
+  %3 = select i1 %2, half %1, half 0.0
+  ret half %3
+}
+
+define half @fma_f16_expanded(half %a, half %b, half %c) {
+; CHECK-LABEL: fma_f16_expanded(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<5>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.b16 %rs1, [fma_f16_expanded_param_0];
+; CHECK-NEXT:    ld.param.b16 %rs2, [fma_f16_expanded_param_1];
+; CHECK-NEXT:    ld.param.b16 %rs3, [fma_f16_expanded_param_2];
+; CHECK-NEXT:    fma.rn.relu %rs4, %rs1, %rs2, %rs3;
+; CHECK-NEXT:    st.param.b16 [func_retval0], %rs4;
+; CHECK-NEXT:    ret;
+  %1 = fmul half %a, %b
+  %2 = fadd half %1, %c
+  %3 = fcmp ogt half %2, 0.0
+  %4 = select i1 %3, half %2, half 0.0
+  ret half %4
+}
+
+define bfloat @fma_bf16(bfloat %a, bfloat %b, bfloat %c) {
+; CHECK-LABEL: fma_bf16(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<5>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.b16 %rs1, [fma_bf16_param_0];
+; CHECK-NEXT:    ld.param.b16 %rs2, [fma_bf16_param_1];
+; CHECK-NEXT:    ld.param.b16 %rs3, [fma_bf16_param_2];
+; CHECK-NEXT:    fma.rn.relu %rs4, %rs1, %rs2, %rs3;
+; CHECK-NEXT:    st.param.b16 [func_retval0], %rs4;
+; CHECK-NEXT:    ret;
+  %1 = call bfloat @llvm.fma.bf16(bfloat %a, bfloat %b, bfloat %c)
+  %2 = fcmp ogt bfloat %1, 0.0
+  %3 = select i1 %2, bfloat %1, bfloat 0.0
+  ret bfloat %3
+}
+
+define bfloat @fma_bf16_expanded(bfloat %a, bfloat %b, bfloat %c) {
+; CHECK-LABEL: fma_bf16_expanded(
+; CHECK:       {
+; CHECK-NEXT:    .reg .b16 %rs<5>;
+; CHECK-EMPTY:
+; CHECK-NEXT:  // %bb.0:
+; CHECK-NEXT:    ld.param.b16 %rs1, [fma_bf16_expanded_param_0];
+; CHECK-NEXT:    ld.param.b16 %rs2, [fma_bf16_expanded_param_1];
+; CHECK-NEXT:    ld.param.b16 %rs3, [fma_bf16_expanded_param_2];
+; CHECK-NEXT:    fma.rn.relu %rs4, %rs1, %rs2, %rs3;
+; CHECK-NEXT:    st.param.b16 [func_retval0], %rs4;
+; CHECK-NEXT:    ret;
+  %1 = fmul bfloat %a, %b
+  %2 = fadd bfloat %1, %c
+  %3 = fcmp ogt bfloat %2, 0.0
+  %4 = select i1 %3, bfloat %2, bfloat 0.0
+  ret bfloat %4
+}

hdelan · 2024-11-05T14:43:26Z

Ping @ldrumm @frasercrmck

justinfargnoli

Overall, LGTM!

Please wait for @AlexMaclean's review though as he's more familiar with NVPTXInstrInfo.td than I am.

justinfargnoli · 2024-11-05T16:11:49Z

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

+
+def FMARELU_F16 :
+  NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$a, Int16Regs:$b, Int16Regs:$c),
+            "fma.rn.relu.f16 \t$dst, $a, $b, $c;", []>;


Do we need the Requires<...> on the instruction too?

It's only being used by the anonymous patterns below, which have the necessary Requires. I don't think we need to introduce extra noise by repeating them here.

I think applying constraint to the instruction itself is the right thing to do. We do not want them to be emitted unintentionally, even if we do not do it now.

I do not know whether the constraint propagates to the pattern, but I think it may, so applying it here should do the job. It's easy enough to test by running the tests while targeting an older GPU.

I've added the PTX and arch requirements on the instruction, and the pattern Requires just on the pattern.

AlexMaclean · 2024-11-05T17:28:19Z

Suppose the fma has more uses in addition to the fmaxnum, If this optimization kicks in it may increase the register pressure and won't be a clear win in terms of performance. I'm not sure this will be a problem, but to be conservative it may be better to implement this as a DAG combine and verify the fma has a single use.

Artem-B · 2024-11-05T18:18:19Z

llvm/test/CodeGen/NVPTX/fma-relu.ll

+  %1 = fmul half %a, %b
+  %2 = fadd half %1, %c


It would be good to add a couple more test runs:

one with w/o mul/add -> fma contraction to make sure we do not use fma.rn.relu unintentionally.

one targeting older GPUs to make sure we do not emit fma.rn.relu there.

I've added more tests to cover these cases.

Is it worth also having a test case that uses llvm.maxnum? I believe that if the IR was given the right fast-math flags, InstCombine would transform this select into an llvm.maxnum anyway.

Speaking of, should we also have tests with fast-math flags? My feeling is that we should see fast-math flags in the IR as if this was really coming from a frontend with -ffast-math (or equivalent). IIRC the NVPTX backend relies on the unsafe-fp-math function attribute being set, which enables these fast math optimizations. I think we should have a test with fast-math flags, fast-math function attributes, and the default llc flags (no --enable-unsafe-fp-math, no -nvptx-fma-level. We should still generate fma.relu in that case, right? This, imo, should be "the" canonical test of this optimization - using various llc flags like this is a less standardised approach.

Artem-B · 2024-11-05T18:29:24Z

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

+
+def FMARELU_F16 :
+  NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$a, Int16Regs:$b, Int16Regs:$c),
+            "fma.rn.relu.f16 \t$dst, $a, $b, $c;", []>;


I think applying constraint to the instruction itself is the right thing to do. We do not want them to be emitted unintentionally, even if we do not do it now.

I do not know whether the constraint propagates to the pattern, but I think it may, so applying it here should do the job. It's easy enough to test by running the tests while targeting an older GPU.

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

Artem-B · 2024-11-05T20:06:28Z

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

+
+def FMARELU_F16 :
+  NVPTXInst<(outs Int16Regs:$dst), (ins Int16Regs:$a, Int16Regs:$b, Int16Regs:$c),
+    "fma.rn.relu.f16 \t$dst, $a, $b, $c;", []>,


Next question is -- what do we want to do about .ftz?
We handle it for regular FMA instructions and it's probably needed here, too.

frasercrmck · 2024-11-06T09:22:23Z

llvm/test/CodeGen/NVPTX/fma-relu.ll

+; RUN: %if ptxas %{ llc < %s -march=nvptx64 -mcpu=sm_80 -mattr=ptx70 -verify-machineinstrs -fp-contract=fast -nvptx-fma-level=2 | %ptxas-verify -arch=sm_80 %}
+; RUN: llc < %s -march=nvptx64 --enable-unsafe-fp-math -mcpu=sm_80 -mattr=ptx70 -verify-machineinstrs -fp-contract=fast -nvptx-fma-level=0 | FileCheck %s --check-prefixes=CHECK-NO-FMA
+; RUN: llc < %s -march=nvptx64 --enable-unsafe-fp-math -mcpu=sm_70 -mattr=ptx70 -verify-machineinstrs -fp-contract=fast -nvptx-fma-level=2 | FileCheck %s --check-prefixes=CHECK-NO-ARCH
+; RUN: llc < %s -march=nvptx64 --enable-unsafe-fp-math -mcpu=sm_70 -mattr=ptx70 -verify-machineinstrs -fp-contract=fast -nvptx-fma-level=2 | FileCheck %s --check-prefixes=CHECK-NO-PTX


This RUN line is the same as the one above.

Also maybe CHECK-SM80 and CHECK-SM70 are better check names? CHECK-NO-ARCH and CHECK-NO-PTX don't really explain to me what they're checking or why.

frasercrmck · 2024-11-06T09:31:33Z

llvm/test/CodeGen/NVPTX/fma-relu.ll

+  %1 = fmul half %a, %b
+  %2 = fadd half %1, %c


Is it worth also having a test case that uses llvm.maxnum? I believe that if the IR was given the right fast-math flags, InstCombine would transform this select into an llvm.maxnum anyway.

Speaking of, should we also have tests with fast-math flags? My feeling is that we should see fast-math flags in the IR as if this was really coming from a frontend with -ffast-math (or equivalent). IIRC the NVPTX backend relies on the unsafe-fp-math function attribute being set, which enables these fast math optimizations. I think we should have a test with fast-math flags, fast-math function attributes, and the default llc flags (no --enable-unsafe-fp-math, no -nvptx-fma-level. We should still generate fma.relu in that case, right? This, imo, should be "the" canonical test of this optimization - using various llc flags like this is a less standardised approach.

ldrumm

Looks good!

ldrumm · 2024-11-06T11:40:01Z

llvm/test/CodeGen/NVPTX/fma-relu.ll

@@ -0,0 +1,349 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5


The tests look reasonable to me in the behaviour, but I'd prefer they -stop-after=finalize-isel so we can test just the isel in isolation

ldrumm · 2024-11-06T11:44:01Z

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

@@ -3917,3 +3917,40 @@ def atomic_thread_fence_seq_cst_cta :
 def atomic_thread_fence_acq_rel_cta :
  NVPTXInst<(outs), (ins), "fence.acq_rel.cta;", []>,
  Requires<[hasPTX<60>, hasSM<70>]>;
+
+def fpimm0 : FPImmLeaf<fAny, [{


It's not clear from the name that this is strictly positive zero. Maybe def positive_zero_fp or a better equivalent if you can

Add patterns to lower fma(a, b, c) > 0 ? fma(a, b, c) : 0 for f16 and bf16 types.

llvmbot added the backend:NVPTX label Nov 5, 2024

hdelan changed the title ~~Add patterns for fma.relu.{f16|bf16}~~ [NVPTX] Add patterns for fma.relu.{f16|bf16} Nov 5, 2024

hdelan force-pushed the fma-relu branch from 5b150ff to 2b9441b Compare November 5, 2024 15:00

justinfargnoli requested review from AlexMaclean and justinfargnoli November 5, 2024 16:10

justinfargnoli assigned hdelan Nov 5, 2024

justinfargnoli approved these changes Nov 5, 2024

View reviewed changes

AlexMaclean requested a review from Artem-B November 5, 2024 17:30

Artem-B reviewed Nov 5, 2024

View reviewed changes

hdelan force-pushed the fma-relu branch from 2b9441b to dacc23b Compare November 5, 2024 19:46

Artem-B reviewed Nov 5, 2024

View reviewed changes

frasercrmck reviewed Nov 6, 2024

View reviewed changes

hdelan force-pushed the fma-relu branch 4 times, most recently from 4759e15 to 9456007 Compare November 6, 2024 11:20

ldrumm approved these changes Nov 6, 2024

View reviewed changes

hdelan force-pushed the fma-relu branch from 9456007 to b2f6135 Compare November 6, 2024 11:49

Add patterns for fma.relu.{f16|bf16}

f5eea93

Add patterns to lower fma(a, b, c) > 0 ? fma(a, b, c) : 0 for f16 and bf16 types.

hdelan force-pushed the fma-relu branch from b2f6135 to f5eea93 Compare November 6, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVPTX] Add patterns for fma.relu.{f16|bf16} #114977

[NVPTX] Add patterns for fma.relu.{f16|bf16} #114977

hdelan commented Nov 5, 2024 •

edited

Loading

llvmbot commented Nov 5, 2024

hdelan commented Nov 5, 2024

justinfargnoli left a comment

justinfargnoli Nov 5, 2024

hdelan Nov 5, 2024

Artem-B Nov 5, 2024

hdelan Nov 5, 2024

AlexMaclean commented Nov 5, 2024

Artem-B Nov 5, 2024

hdelan Nov 5, 2024

frasercrmck Nov 6, 2024

Artem-B Nov 5, 2024

Artem-B Nov 5, 2024

frasercrmck Nov 6, 2024

frasercrmck Nov 6, 2024

ldrumm left a comment

ldrumm Nov 6, 2024

ldrumm Nov 6, 2024

		@@ -0,0 +1,349 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5

[NVPTX] Add patterns for fma.relu.{f16|bf16} #114977

Are you sure you want to change the base?

[NVPTX] Add patterns for fma.relu.{f16|bf16} #114977

Conversation

hdelan commented Nov 5, 2024 • edited Loading

llvmbot commented Nov 5, 2024

hdelan commented Nov 5, 2024

justinfargnoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexMaclean commented Nov 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ldrumm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hdelan commented Nov 5, 2024 •

edited

Loading