Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Releases/gcc 12 #65

Open
wants to merge 2,603 commits into
base: master
Choose a base branch
from
Open

Releases/gcc 12 #65

wants to merge 2,603 commits into from

Conversation

jacopobrusini
Copy link

Support for Apple Silicon!!!

@jwakely
Copy link
Contributor

jwakely commented Feb 21, 2024

This is an unofficial mirror that has nothing to do with the GCC project, so submitting pull requests here is a waste of time.

Also, I have no idea what this pull request is trying to do but it would never be accepted even if it was submitted to the right place.

GCC Administrator and others added 28 commits August 11, 2024 00:19
For below pattern, RA may still allocate r162 as v/k register, try to
reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi
which result a linker error.

(set (reg:DI 162)
     (mem/u/c:DI
       (const:DI (unspec:DI
		 [(symbol_ref:DI ("a") [flags 0x60]  <var_decl 0x7f621f6e1c60 a>)]
		 UNSPEC_GOTNTPOFF))

Quote from H.J for why linker issue an error.
>What do these do:
>
>        leaq    __libc_tsd_CTYPE_B@gottpoff(%rip), %rax
>        vmovq   (%rax), %xmm0
>
>From x86-64 TLS psABI:
>
>The assembler generates for the x@gottpoff(%rip) expressions a R X86
>64 GOTTPOFF relocation for the symbol x which requests the linker to
>generate a GOT entry with a R X86 64 TPOFF64 relocation. The offset of
>the GOT entry relative to the end of the instruction is then used in
>the instruction. The R X86 64 TPOFF64 relocation is pro- cessed at
>program startup time by the dynamic linker by looking up the symbol x
>in the modules loaded at that point. The offset is written in the GOT
>entry and later loaded by the addq instruction.
>
>The above code sequence looks wrong to me.

gcc/ChangeLog:

	PR target/116043
	* config/i386/constraints.md (Bk): Refine to
	define_special_memory_constraint.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr116043.c: New test.

(cherry picked from commit bc1fda0)
Not sure how this happend, but: svsudot is supposed to be expanded
as USDOT with the operands swapped.  However, a thinko in the
expansion of svsudot meant that the arguments weren't in fact
swapped; the attempted swap was just a no-op.  And the testcases
blithely accepted that.

gcc/
	PR target/114607
	* config/aarch64/aarch64-sve-builtins-base.cc
	(svusdot_impl::expand): Fix botched attempt to swap the operands
	for svsudot.

gcc/testsuite/
	PR target/114607
	* gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test.

(cherry picked from commit 2c1c248)
aarch64-sve.md had a pattern that combined:

	cmpeq	pb.T, pa/z, zc.T, #0
	mov	zd.T, pb/z, #1

into:

	cnot	zd.T, pa/m, zc.T

But this is only valid if pa.T is a ptrue.  In other cases, the
original would set inactive elements of zd.T to 0, whereas the
combined form would copy elements from zc.T.

gcc/
	PR target/114603
	* config/aarch64/aarch64-sve.md (@aarch64_pred_cnot<mode>): Replace
	with...
	(@aarch64_ptrue_cnot<mode>): ...this, requiring operand 1 to be
	a ptrue.
	(*cnot<mode>): Require operand 1 to be a ptrue.
	* config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand):
	Use aarch64_ptrue_cnot<mode> for _x operations that are predicated
	with a ptrue.  Represent other _x operations as fully-defined _m
	operations.

gcc/testsuite/
	PR target/114603
	* gcc.target/aarch64/sve/acle/general/cnot_1.c: New test.

(cherry picked from commit 67cbb1c)
The test was too optimistic, alas.  We used to vectorize shifts by
clamping the shift counts below the bit width of the types (e.g. at 15
for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is
well defined (because of promotion to 32-bit int) and must yield 0,
not 1 (as before the fix).

Unfortunately, in the gimple model of vector units, such large shift
counts wouldn't be well-defined, so we won't vectorize such shifts any
more, unless we can tell they're in range or undefined.

So the test that expected the vectorization we no longer performed
needs to be adjusted.  Instead of nobbling the test, Richard Earnshaw
suggested annotating the test with the expected ranges so as to enable
the optimization, and Christophe Lyon suggested a further
simplification.

Co-Authored-By: Richard Earnshaw <[email protected]>

for  gcc/testsuite/ChangeLog

	PR tree-optimization/113281
	* gcc.target/arm/simd/mve-vshr.c: Add expected ranges.

(cherry picked from commit 54d2339)
When none of mprefer-vector-width, avx256_optimal/avx128_optimal,
avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will
set ix86_{move_max,store_max} as max available vector length except
for AVX part.

	      if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
		  && TARGET_EVEX512_P (opts->x_ix86_isa_flags2))
		opts->x_ix86_move_max = PVW_AVX512;
	      else
		opts->x_ix86_move_max = PVW_AVX128;

So for -mavx2, vectorizer will choose 256-bit for vectorization, but
128-bit is used for struct copy, there could be a potential STLF issue
due to this "misalign".

The patch fixes that.

gcc/ChangeLog:

	* config/i386/i386-options.cc (ix86_option_override_internal):
	set ix86_{move_max,store_max} to PVW_AVX256 when TARGET_AVX
	instead of PVW_AVX128.

gcc/testsuite/ChangeLog:
	* gcc.target/i386/pieces-memcpy-10.c: Add -mprefer-vector-width=128.
	* gcc.target/i386/pieces-memcpy-6.c: Ditto.
	* gcc.target/i386/pieces-memset-38.c: Ditto.
	* gcc.target/i386/pieces-memset-40.c: Ditto.
	* gcc.target/i386/pieces-memset-41.c: Ditto.
	* gcc.target/i386/pieces-memset-42.c: Ditto.
	* gcc.target/i386/pieces-memset-43.c: Ditto.
	* gcc.target/i386/pieces-strcpy-2.c: Ditto.
	* gcc.target/i386/pieces-memcpy-22.c: New test.
	* gcc.target/i386/pieces-memset-51.c: New test.
	* gcc.target/i386/pieces-strcpy-3.c: New test.

(cherry picked from commit aea3742)
gcc/testsuite/ChangeLog:

	* gcc.target/i386/pieces-memcpy-10.c: Use -mmove-max=256 and
	-mstore-max=256.
	* gcc.target/i386/pieces-memcpy-6.c: Ditto.
	* gcc.target/i386/pieces-memset-38.c: Ditto.
	* gcc.target/i386/pieces-memset-40.c: Ditto.
	* gcc.target/i386/pieces-memset-41.c: Ditto.
	* gcc.target/i386/pieces-memset-42.c: Ditto.
	* gcc.target/i386/pieces-memset-43.c: Ditto.
	* gcc.target/i386/pieces-strcpy-2.c: Ditto.

(cherry picked from commit ea9c508)
GCC Administrator and others added 30 commits January 7, 2025 00:21
this patch adds support for new fussion in znver5 documented in the
optimization manual:

   The Zen5 microarchitecture adds support to fuse reg-reg MOV Instructions
   with certain ALU instructions. The following conditions need to be met for
   fusion to happen:
     - The MOV should be reg-reg mov with Opcode 0x89 or 0x8B
     - The MOV is followed by an ALU instruction where the MOV and ALU destination register match.
     - The ALU instruction may source only registers or immediate data. There cannot be any memory source.
     - The ALU instruction sources either the source or dest of MOV instruction.
     - If ALU instruction has 2 reg sources, they should be different.
     - The following ALU instructions can fuse with an older qualified MOV instruction:
       ADD ADC AND XOR OP SUB SBB INC DEC NOT SAL / SHL SHR SAR
       (I assume OP is OR)

I also increased issue rate from 4 to 6.  Theoretically znver5 can do more, but
with our model we can't realy use it.
Increasing issue rate to 8 leads to infinite loop in scheduler.

Finally, I also enabled fuse_alu_and_branch since it is supported by
znver5 (I think by earlier zens too).

New fussion pattern moves quite few instructions around in common code:
@@ -2210,13 +2210,13 @@
        .cfi_offset 3, -32
        leaq    63(%rsi), %rbx
        movq    %rbx, %rbp
+       shrq    $6, %rbp
+       salq    $3, %rbp
        subq    $16, %rsp
        .cfi_def_cfa_offset 48
        movq    %rdi, %r12
-       shrq    $6, %rbp
-       movq    %rsi, 8(%rsp)
-       salq    $3, %rbp
        movq    %rbp, %rdi
+       movq    %rsi, 8(%rsp)
        call    _Znwm
        movq    8(%rsp), %rsi
        movl    $0, 8(%r12)
@@ -2224,8 +2224,8 @@
        movq    %rax, (%r12)
        movq    %rbp, 32(%r12)
        testq   %rsi, %rsi
-       movq    %rsi, %rdx
        cmovns  %rsi, %rbx
+       movq    %rsi, %rdx
        sarq    $63, %rdx
        shrq    $58, %rdx
        sarq    $6, %rbx
which should help decoder bandwidth and perhaps also cache, though I was not
able to measure off-noise effect on SPEC.

gcc/ChangeLog:

	* config/i386/i386.h (TARGET_FUSE_MOV_AND_ALU): New tune.
	* config/i386/x86-tune-sched.cc (ix86_issue_rate): Updat for znver5.
	(ix86_adjust_cost): Add TODO about znver5 memory latency.
	(ix86_fuse_mov_alu_p): New.
	(ix86_macro_fusion_pair_p): Use it.
	* config/i386/x86-tune.def (X86_TUNE_FUSE_ALU_AND_BRANCH): Add ZNVER5.
	(X86_TUNE_FUSE_MOV_AND_ALU): New tune;

(cherry picked from commit e2125a6)
Zen5 has 6 instead of 4 ALUs and the integer multiplication can now execute in
3 of them.  FP units can do 2 additions and 2 multiplications with latency 2
and 3.  This patch updates reassociation width accordingly.  This has potential
of increasing register pressure but unlike while benchmarking znver1 tuning
I did not noticed this actually causing problem on spec, so this patch bumps
up reassociation width to 6 for everything except for integer vectors, where
there are 4 units with typical latency of 1.

Bootstrapped/regtested x86_64-linux, comitted.

gcc/ChangeLog:

	* config/i386/i386.cc (ix86_reassociation_width): Update for Znver5.
	* config/i386/x86-tune-costs.h (znver5_costs): Update reassociation
	widths.

(cherry picked from commit f0ab3de)
gcc/ChangeLog:

	* doc/cpp.texi (Common Predefined Macros): Fix syntax.
The following makes analysis and transform agree on constraints.

	PR tree-optimization/115646
	* tree-call-cdce.cc (check_pow): Check for bit_sz values
	as allowed by transform.

	* gcc.dg/pr115646.c: New testcase.

(cherry picked from commit 453b1d2)
The following avoids associating a reduction path as that might
get STMT_VINFO_REDUC_IDX out-of-sync with the SLP operand order.
This is a latent issue with SLP reductions but now easily exposed
as we're doing single-lane SLP reductions.

When we achieved SLP only we can move and update this meta-data.

	PR tree-optimization/115669
	* tree-vect-slp.cc (vect_build_slp_tree_2): Do not reassociate
	chains that participate in a reduction.

	* gcc.dg/vect/pr115669.c: New testcase.

(cherry picked from commit 7886830)
The following fixes an issue with CCPs likely_value when faced with
a vector CTOR containing undef SSA names and constants.  This should
be classified as CONSTANT and not UNDEFINED.

	PR tree-optimization/116057
	* tree-ssa-ccp.cc (likely_value): Also walk CTORs in stmt
	operands to look for constants.

	* gcc.dg/torture/pr116057.c: New testcase.

(cherry picked from commit 1ea5515)
	PR fortran/106692

gcc/fortran/ChangeLog:

	* trans-expr.cc (gfc_conv_expr_op): Inhibit excessive optimization
	of Cray pointers by treating them as volatile in comparisons.

gcc/testsuite/ChangeLog:

	* gfortran.dg/cray_pointers_13.f90: New test.

(cherry picked from commit c7754a2)
there is nothing exciting in this patch.  I measured latencies and also compared
them with newly released optimization guide.  There are no dramatic changes
compared to zen4.  One interesting new bit is that addss is faster and can be
2 cycles when fed by another addss.

I also increased the large insn bound since decoders seems no longer require
instructions to be 8 bytes or less.

gcc/ChangeLog:

	* config/i386/x86-tune-costs.h (znver5_cost): Update instruction
	costs.

(cherry picked from commit 4292297)
The following addresses a long standing issue with not preserving
accesses to non-volatile objects through volatile qualified
pointers in the case that object gets expanded to a register.  The
fix is to treat accesses to an object with a volatile qualified
access as forcing that object to memory.  This issue got more
exposed recently so it regressed more since GCC 11.

	PR middle-end/69482
	* cfgexpand.cc (discover_nonconstant_array_refs_r): Volatile
	qualified accesses also force objects to memory.

	* gcc.target/i386/pr69482-1.c: New testcase.
	* gcc.target/i386/pr69482-2.c: Likewise.

(cherry picked from commit a5a8242)
Loop distribution does different analysis with -g0/-g due to counting
a debug stmt starting a BB against a limit which will everntually
lead to different IVOPTs choices.  I've fixed a possible IVOPTs
issue on the way even though it doesn't make a difference here.

	PR tree-optimization/116290
	* tree-loop-distribution.cc (determine_reduction_stmt_1): PHIs
	have no debug variants.  Start with first non-debug real stmt.
	* tree-ssa-loop-ivopts.cc (find_givs_in_bb): Do not analyze
	debug stmts.

	* gcc.dg/pr116290.c: New testcase.

(cherry picked from commit 5667400)
The following reverts a bogus fix done for PR101009 and instead makes
sure we get into the same_access_functions () case when computing
the distance vector for g[1] and g[1] where the constants ended up
having different types.  The generic code doesn't seem to handle
loop invariant dependences.  The special case gets us both
( 0 ) and ( 1 ) as distance vectors while formerly we got ( 1 ),
which the PR101009 fix changed to ( 0 ) with bad effects on other
cases as shown in this PR.

	PR tree-optimization/116768
	* tree-data-ref.cc (build_classic_dist_vector_1): Revert
	PR101009 change.
	* tree-chrec.cc (eq_evolutions_p): Make sure (sizetype)1
	and (int)1 compare equal.

	* gcc.dg/torture/pr116768.c: New testcase.

(cherry picked from commit 5b5a36b)
 @2)

Transforming -fma (-a, b, -c) to fma (a, b, c) is only valid when
not rounding towards -inf or +inf as the sign of the multiplication
changes.

	PR middle-end/116891
	* match.pd ((negate (IFN_FNMS@3 @0 @1 @2)) -> (IFN_FMA @0 @1 @2)):
	Only enable for !HONOR_SIGN_DEPENDENT_ROUNDING.

(cherry picked from commit c53bd48)
On Mon, Oct 14, 2024 at 08:53:29AM +0200, Jakub Jelinek wrote:
> >     PR middle-end/116891
> >     * match.pd ((negate (IFN_FNMS@3 @0 @1 @2)) -> (IFN_FMA @0 @1 @2)):
> >     Only enable for !HONOR_SIGN_DEPENDENT_ROUNDING.
>
> Guess it would be nice to have a testcase which FAILs without the patch and
> PASSes with it, but it can be added later.

I've added such a testcase now, and additionally found the fix only fixed
one of the 4 problematic similar cases.

Here is a patch which fixes the others too and adds the testcases.
fma-pr116891.c FAILed without your patch, FAILs with your patch too (but
only due to the bar/baz/qux checks) and PASSes with the patch.

2024-10-15  Jakub Jelinek  <[email protected]>

	PR middle-end/116891
	* match.pd ((negate (fmas@3 @0 @1 @2)) -> (IFN_FNMS @0 @1 @2)):
	Only enable for !HONOR_SIGN_DEPENDENT_ROUNDING.
	((negate (IFN_FMS@3 @0 @1 @2)) -> (IFN_FNMA @0 @1 @2)): Likewise.
	((negate (IFN_FNMA@3 @0 @1 @2)) -> (IFN_FMS @0 @1 @2)): Likewise.

	* gcc.dg/pr116891.c: New test.
	* gcc.target/i386/fma-pr116891.c: New test.

(cherry picked from commit 4366f0c)
…ication

For vector types we have to make sure the comparison result is a vector
type and the resulting compare operation is supported.  As the resulting
compare is never an equality compare I didn't bother to check for the
cbranch case.

	PR tree-optimization/117104
	* match.pd ((cmp:c (minmax:c @0 @1) @0) -> (out @0 @1)): Properly
	guard the vector case.

	* gcc.dg/pr117104.c: New testcase.

(cherry picked from commit f54d42e)
The diagnostics code fails to handle non-constant domain max.

	PR tree-optimization/117254
	* gimple-ssa-warn-access.cc (maybe_warn_nonstring_arg):
	Check the array domain max is constant before using it.

	* gcc.dg/pr117254.c: New testcase.

(cherry picked from commit d464a52)
STMT_VINFO_SLP_VECT_ONLY isn't properly computed as union of all
group members and when the group is later split due to duplicates
not all sub-groups inherit the flag.

	PR tree-optimization/117307
	* tree-vect-data-refs.cc (vect_analyze_data_ref_accesses):
	Properly compute STMT_VINFO_SLP_VECT_ONLY.  Set it on all
	parts of a split group.

	* gcc.dg/vect/pr117307.c: New testcase.

(cherry picked from commit 1972230)
When we decompose a complex load only used as real and imaginary
parts we fail to honor IL constraints which are that a BIT_FIELD_REF
of register type should be outermost in a ref.  The following
simply avoids the transform when the complex load has such a
BIT_FIELD_REF.

	PR tree-optimization/117417
	* tree-ssa-forwprop.cc (pass_forwprop::execute): Avoid
	decomposing BIT_FIELD_REF complex load.

	* gcc.dg/torture/pr117417.c: New testcase.

(cherry picked from commit d976daa)
This patch removes the (unnecessary) CPP_PRAGMA_EOL case from
cp_parser_cache_defarg, which currently has the result that any pragmas
in the NSDMI cause an error.

	PR c++/118147

gcc/cp/ChangeLog:

	* parser.cc (cp_parser_cache_defarg): Don't error when
	CPP_PRAGMA_EOL.

gcc/testsuite/ChangeLog:

	* g++.dg/cpp0x/nsdmi-defer7.C: New test.

Signed-off-by: Nathaniel Shead <[email protected]>
(cherry picked from commit f3ccc57)
We are initializing both the call graph node count and
the entry block count of the function with the head_count value
from the profile.

Count propagation algorithm may refine the entry block count
and we may end up with a case where the call graph node count
is set to zero but the entry block count is non-zero. That becomes
a problem because we have this code in execute_fixup_cfg:

 profile_count num = node->count;
 profile_count den = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count;
 bool scale = num.initialized_p () && !(num == den);

Here if num is 0 but den is not 0, scale becomes true and we
lose the counts in

if (scale)
  bb->count = bb->count.apply_scale (num, den);

This is what happened in the issue reported in PR116743
(a 10% regression in MySQL HAMMERDB tests).
3d9e676 made an improvement in
AutoFDO count propagation, which caused a mismatch between
the call graph node count (zero) and the entry block count (non-zero)
and subsequent loss of counts as described above.

The fix is to update the call graph node count once we've done count propagation.

Tested on x86_64-pc-linux-gnu.

gcc/ChangeLog:
	PR gcov-profile/116743
	* auto-profile.cc (afdo_annotate_cfg): Fix mismatch between the call graph node count
	and the entry block count.

(cherry picked from commit e683c6b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.