mirrored from git://gcc.gnu.org/git/gcc.git
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Releases/gcc 12 #65
Open
jacopobrusini
wants to merge
2,603
commits into
master
Choose a base branch
from
releases/gcc-12
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Releases/gcc 12 #65
+278,448
−138,821
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is an unofficial mirror that has nothing to do with the GCC project, so submitting pull requests here is a waste of time. Also, I have no idea what this pull request is trying to do but it would never be accepted even if it was submitted to the right place. |
atahanozbayram
approved these changes
Apr 2, 2024
For below pattern, RA may still allocate r162 as v/k register, try to reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi which result a linker error. (set (reg:DI 162) (mem/u/c:DI (const:DI (unspec:DI [(symbol_ref:DI ("a") [flags 0x60] <var_decl 0x7f621f6e1c60 a>)] UNSPEC_GOTNTPOFF)) Quote from H.J for why linker issue an error. >What do these do: > > leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rax > vmovq (%rax), %xmm0 > >From x86-64 TLS psABI: > >The assembler generates for the x@gottpoff(%rip) expressions a R X86 >64 GOTTPOFF relocation for the symbol x which requests the linker to >generate a GOT entry with a R X86 64 TPOFF64 relocation. The offset of >the GOT entry relative to the end of the instruction is then used in >the instruction. The R X86 64 TPOFF64 relocation is pro- cessed at >program startup time by the dynamic linker by looking up the symbol x >in the modules loaded at that point. The offset is written in the GOT >entry and later loaded by the addq instruction. > >The above code sequence looks wrong to me. gcc/ChangeLog: PR target/116043 * config/i386/constraints.md (Bk): Refine to define_special_memory_constraint. gcc/testsuite/ChangeLog: * gcc.target/i386/pr116043.c: New test. (cherry picked from commit bc1fda0)
Not sure how this happend, but: svsudot is supposed to be expanded as USDOT with the operands swapped. However, a thinko in the expansion of svsudot meant that the arguments weren't in fact swapped; the attempted swap was just a no-op. And the testcases blithely accepted that. gcc/ PR target/114607 * config/aarch64/aarch64-sve-builtins-base.cc (svusdot_impl::expand): Fix botched attempt to swap the operands for svsudot. gcc/testsuite/ PR target/114607 * gcc.target/aarch64/sve/acle/asm/sudot_s32.c: New test. (cherry picked from commit 2c1c248)
aarch64-sve.md had a pattern that combined: cmpeq pb.T, pa/z, zc.T, #0 mov zd.T, pb/z, #1 into: cnot zd.T, pa/m, zc.T But this is only valid if pa.T is a ptrue. In other cases, the original would set inactive elements of zd.T to 0, whereas the combined form would copy elements from zc.T. gcc/ PR target/114603 * config/aarch64/aarch64-sve.md (@aarch64_pred_cnot<mode>): Replace with... (@aarch64_ptrue_cnot<mode>): ...this, requiring operand 1 to be a ptrue. (*cnot<mode>): Require operand 1 to be a ptrue. * config/aarch64/aarch64-sve-builtins-base.cc (svcnot_impl::expand): Use aarch64_ptrue_cnot<mode> for _x operations that are predicated with a ptrue. Represent other _x operations as fully-defined _m operations. gcc/testsuite/ PR target/114603 * gcc.target/aarch64/sve/acle/general/cnot_1.c: New test. (cherry picked from commit 67cbb1c)
The test was too optimistic, alas. We used to vectorize shifts by clamping the shift counts below the bit width of the types (e.g. at 15 for 16-bit vector elements), but (uint16_t)32768 >> (uint16_t)16 is well defined (because of promotion to 32-bit int) and must yield 0, not 1 (as before the fix). Unfortunately, in the gimple model of vector units, such large shift counts wouldn't be well-defined, so we won't vectorize such shifts any more, unless we can tell they're in range or undefined. So the test that expected the vectorization we no longer performed needs to be adjusted. Instead of nobbling the test, Richard Earnshaw suggested annotating the test with the expected ranges so as to enable the optimization, and Christophe Lyon suggested a further simplification. Co-Authored-By: Richard Earnshaw <[email protected]> for gcc/testsuite/ChangeLog PR tree-optimization/113281 * gcc.target/arm/simd/mve-vshr.c: Add expected ranges. (cherry picked from commit 54d2339)
When none of mprefer-vector-width, avx256_optimal/avx128_optimal, avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will set ix86_{move_max,store_max} as max available vector length except for AVX part. if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) && TARGET_EVEX512_P (opts->x_ix86_isa_flags2)) opts->x_ix86_move_max = PVW_AVX512; else opts->x_ix86_move_max = PVW_AVX128; So for -mavx2, vectorizer will choose 256-bit for vectorization, but 128-bit is used for struct copy, there could be a potential STLF issue due to this "misalign". The patch fixes that. gcc/ChangeLog: * config/i386/i386-options.cc (ix86_option_override_internal): set ix86_{move_max,store_max} to PVW_AVX256 when TARGET_AVX instead of PVW_AVX128. gcc/testsuite/ChangeLog: * gcc.target/i386/pieces-memcpy-10.c: Add -mprefer-vector-width=128. * gcc.target/i386/pieces-memcpy-6.c: Ditto. * gcc.target/i386/pieces-memset-38.c: Ditto. * gcc.target/i386/pieces-memset-40.c: Ditto. * gcc.target/i386/pieces-memset-41.c: Ditto. * gcc.target/i386/pieces-memset-42.c: Ditto. * gcc.target/i386/pieces-memset-43.c: Ditto. * gcc.target/i386/pieces-strcpy-2.c: Ditto. * gcc.target/i386/pieces-memcpy-22.c: New test. * gcc.target/i386/pieces-memset-51.c: New test. * gcc.target/i386/pieces-strcpy-3.c: New test. (cherry picked from commit aea3742)
gcc/testsuite/ChangeLog: * gcc.target/i386/pieces-memcpy-10.c: Use -mmove-max=256 and -mstore-max=256. * gcc.target/i386/pieces-memcpy-6.c: Ditto. * gcc.target/i386/pieces-memset-38.c: Ditto. * gcc.target/i386/pieces-memset-40.c: Ditto. * gcc.target/i386/pieces-memset-41.c: Ditto. * gcc.target/i386/pieces-memset-42.c: Ditto. * gcc.target/i386/pieces-memset-43.c: Ditto. * gcc.target/i386/pieces-strcpy-2.c: Ditto. (cherry picked from commit ea9c508)
this patch adds support for new fussion in znver5 documented in the optimization manual: The Zen5 microarchitecture adds support to fuse reg-reg MOV Instructions with certain ALU instructions. The following conditions need to be met for fusion to happen: - The MOV should be reg-reg mov with Opcode 0x89 or 0x8B - The MOV is followed by an ALU instruction where the MOV and ALU destination register match. - The ALU instruction may source only registers or immediate data. There cannot be any memory source. - The ALU instruction sources either the source or dest of MOV instruction. - If ALU instruction has 2 reg sources, they should be different. - The following ALU instructions can fuse with an older qualified MOV instruction: ADD ADC AND XOR OP SUB SBB INC DEC NOT SAL / SHL SHR SAR (I assume OP is OR) I also increased issue rate from 4 to 6. Theoretically znver5 can do more, but with our model we can't realy use it. Increasing issue rate to 8 leads to infinite loop in scheduler. Finally, I also enabled fuse_alu_and_branch since it is supported by znver5 (I think by earlier zens too). New fussion pattern moves quite few instructions around in common code: @@ -2210,13 +2210,13 @@ .cfi_offset 3, -32 leaq 63(%rsi), %rbx movq %rbx, %rbp + shrq $6, %rbp + salq $3, %rbp subq $16, %rsp .cfi_def_cfa_offset 48 movq %rdi, %r12 - shrq $6, %rbp - movq %rsi, 8(%rsp) - salq $3, %rbp movq %rbp, %rdi + movq %rsi, 8(%rsp) call _Znwm movq 8(%rsp), %rsi movl $0, 8(%r12) @@ -2224,8 +2224,8 @@ movq %rax, (%r12) movq %rbp, 32(%r12) testq %rsi, %rsi - movq %rsi, %rdx cmovns %rsi, %rbx + movq %rsi, %rdx sarq $63, %rdx shrq $58, %rdx sarq $6, %rbx which should help decoder bandwidth and perhaps also cache, though I was not able to measure off-noise effect on SPEC. gcc/ChangeLog: * config/i386/i386.h (TARGET_FUSE_MOV_AND_ALU): New tune. * config/i386/x86-tune-sched.cc (ix86_issue_rate): Updat for znver5. (ix86_adjust_cost): Add TODO about znver5 memory latency. (ix86_fuse_mov_alu_p): New. (ix86_macro_fusion_pair_p): Use it. * config/i386/x86-tune.def (X86_TUNE_FUSE_ALU_AND_BRANCH): Add ZNVER5. (X86_TUNE_FUSE_MOV_AND_ALU): New tune; (cherry picked from commit e2125a6)
Zen5 has 6 instead of 4 ALUs and the integer multiplication can now execute in 3 of them. FP units can do 2 additions and 2 multiplications with latency 2 and 3. This patch updates reassociation width accordingly. This has potential of increasing register pressure but unlike while benchmarking znver1 tuning I did not noticed this actually causing problem on spec, so this patch bumps up reassociation width to 6 for everything except for integer vectors, where there are 4 units with typical latency of 1. Bootstrapped/regtested x86_64-linux, comitted. gcc/ChangeLog: * config/i386/i386.cc (ix86_reassociation_width): Update for Znver5. * config/i386/x86-tune-costs.h (znver5_costs): Update reassociation widths. (cherry picked from commit f0ab3de)
gcc/ChangeLog: * doc/cpp.texi (Common Predefined Macros): Fix syntax.
The following makes analysis and transform agree on constraints. PR tree-optimization/115646 * tree-call-cdce.cc (check_pow): Check for bit_sz values as allowed by transform. * gcc.dg/pr115646.c: New testcase. (cherry picked from commit 453b1d2)
The following avoids associating a reduction path as that might get STMT_VINFO_REDUC_IDX out-of-sync with the SLP operand order. This is a latent issue with SLP reductions but now easily exposed as we're doing single-lane SLP reductions. When we achieved SLP only we can move and update this meta-data. PR tree-optimization/115669 * tree-vect-slp.cc (vect_build_slp_tree_2): Do not reassociate chains that participate in a reduction. * gcc.dg/vect/pr115669.c: New testcase. (cherry picked from commit 7886830)
The following fixes an issue with CCPs likely_value when faced with a vector CTOR containing undef SSA names and constants. This should be classified as CONSTANT and not UNDEFINED. PR tree-optimization/116057 * tree-ssa-ccp.cc (likely_value): Also walk CTORs in stmt operands to look for constants. * gcc.dg/torture/pr116057.c: New testcase. (cherry picked from commit 1ea5515)
PR fortran/106692 gcc/fortran/ChangeLog: * trans-expr.cc (gfc_conv_expr_op): Inhibit excessive optimization of Cray pointers by treating them as volatile in comparisons. gcc/testsuite/ChangeLog: * gfortran.dg/cray_pointers_13.f90: New test. (cherry picked from commit c7754a2)
there is nothing exciting in this patch. I measured latencies and also compared them with newly released optimization guide. There are no dramatic changes compared to zen4. One interesting new bit is that addss is faster and can be 2 cycles when fed by another addss. I also increased the large insn bound since decoders seems no longer require instructions to be 8 bytes or less. gcc/ChangeLog: * config/i386/x86-tune-costs.h (znver5_cost): Update instruction costs. (cherry picked from commit 4292297)
The following addresses a long standing issue with not preserving accesses to non-volatile objects through volatile qualified pointers in the case that object gets expanded to a register. The fix is to treat accesses to an object with a volatile qualified access as forcing that object to memory. This issue got more exposed recently so it regressed more since GCC 11. PR middle-end/69482 * cfgexpand.cc (discover_nonconstant_array_refs_r): Volatile qualified accesses also force objects to memory. * gcc.target/i386/pr69482-1.c: New testcase. * gcc.target/i386/pr69482-2.c: Likewise. (cherry picked from commit a5a8242)
Loop distribution does different analysis with -g0/-g due to counting a debug stmt starting a BB against a limit which will everntually lead to different IVOPTs choices. I've fixed a possible IVOPTs issue on the way even though it doesn't make a difference here. PR tree-optimization/116290 * tree-loop-distribution.cc (determine_reduction_stmt_1): PHIs have no debug variants. Start with first non-debug real stmt. * tree-ssa-loop-ivopts.cc (find_givs_in_bb): Do not analyze debug stmts. * gcc.dg/pr116290.c: New testcase. (cherry picked from commit 5667400)
The following reverts a bogus fix done for PR101009 and instead makes sure we get into the same_access_functions () case when computing the distance vector for g[1] and g[1] where the constants ended up having different types. The generic code doesn't seem to handle loop invariant dependences. The special case gets us both ( 0 ) and ( 1 ) as distance vectors while formerly we got ( 1 ), which the PR101009 fix changed to ( 0 ) with bad effects on other cases as shown in this PR. PR tree-optimization/116768 * tree-data-ref.cc (build_classic_dist_vector_1): Revert PR101009 change. * tree-chrec.cc (eq_evolutions_p): Make sure (sizetype)1 and (int)1 compare equal. * gcc.dg/torture/pr116768.c: New testcase. (cherry picked from commit 5b5a36b)
@2) Transforming -fma (-a, b, -c) to fma (a, b, c) is only valid when not rounding towards -inf or +inf as the sign of the multiplication changes. PR middle-end/116891 * match.pd ((negate (IFN_FNMS@3 @0 @1 @2)) -> (IFN_FMA @0 @1 @2)): Only enable for !HONOR_SIGN_DEPENDENT_ROUNDING. (cherry picked from commit c53bd48)
On Mon, Oct 14, 2024 at 08:53:29AM +0200, Jakub Jelinek wrote: > > PR middle-end/116891 > > * match.pd ((negate (IFN_FNMS@3 @0 @1 @2)) -> (IFN_FMA @0 @1 @2)): > > Only enable for !HONOR_SIGN_DEPENDENT_ROUNDING. > > Guess it would be nice to have a testcase which FAILs without the patch and > PASSes with it, but it can be added later. I've added such a testcase now, and additionally found the fix only fixed one of the 4 problematic similar cases. Here is a patch which fixes the others too and adds the testcases. fma-pr116891.c FAILed without your patch, FAILs with your patch too (but only due to the bar/baz/qux checks) and PASSes with the patch. 2024-10-15 Jakub Jelinek <[email protected]> PR middle-end/116891 * match.pd ((negate (fmas@3 @0 @1 @2)) -> (IFN_FNMS @0 @1 @2)): Only enable for !HONOR_SIGN_DEPENDENT_ROUNDING. ((negate (IFN_FMS@3 @0 @1 @2)) -> (IFN_FNMA @0 @1 @2)): Likewise. ((negate (IFN_FNMA@3 @0 @1 @2)) -> (IFN_FMS @0 @1 @2)): Likewise. * gcc.dg/pr116891.c: New test. * gcc.target/i386/fma-pr116891.c: New test. (cherry picked from commit 4366f0c)
…ication For vector types we have to make sure the comparison result is a vector type and the resulting compare operation is supported. As the resulting compare is never an equality compare I didn't bother to check for the cbranch case. PR tree-optimization/117104 * match.pd ((cmp:c (minmax:c @0 @1) @0) -> (out @0 @1)): Properly guard the vector case. * gcc.dg/pr117104.c: New testcase. (cherry picked from commit f54d42e)
The diagnostics code fails to handle non-constant domain max. PR tree-optimization/117254 * gimple-ssa-warn-access.cc (maybe_warn_nonstring_arg): Check the array domain max is constant before using it. * gcc.dg/pr117254.c: New testcase. (cherry picked from commit d464a52)
STMT_VINFO_SLP_VECT_ONLY isn't properly computed as union of all group members and when the group is later split due to duplicates not all sub-groups inherit the flag. PR tree-optimization/117307 * tree-vect-data-refs.cc (vect_analyze_data_ref_accesses): Properly compute STMT_VINFO_SLP_VECT_ONLY. Set it on all parts of a split group. * gcc.dg/vect/pr117307.c: New testcase. (cherry picked from commit 1972230)
When we decompose a complex load only used as real and imaginary parts we fail to honor IL constraints which are that a BIT_FIELD_REF of register type should be outermost in a ref. The following simply avoids the transform when the complex load has such a BIT_FIELD_REF. PR tree-optimization/117417 * tree-ssa-forwprop.cc (pass_forwprop::execute): Avoid decomposing BIT_FIELD_REF complex load. * gcc.dg/torture/pr117417.c: New testcase. (cherry picked from commit d976daa)
This patch removes the (unnecessary) CPP_PRAGMA_EOL case from cp_parser_cache_defarg, which currently has the result that any pragmas in the NSDMI cause an error. PR c++/118147 gcc/cp/ChangeLog: * parser.cc (cp_parser_cache_defarg): Don't error when CPP_PRAGMA_EOL. gcc/testsuite/ChangeLog: * g++.dg/cpp0x/nsdmi-defer7.C: New test. Signed-off-by: Nathaniel Shead <[email protected]> (cherry picked from commit f3ccc57)
We are initializing both the call graph node count and the entry block count of the function with the head_count value from the profile. Count propagation algorithm may refine the entry block count and we may end up with a case where the call graph node count is set to zero but the entry block count is non-zero. That becomes a problem because we have this code in execute_fixup_cfg: profile_count num = node->count; profile_count den = ENTRY_BLOCK_PTR_FOR_FN (cfun)->count; bool scale = num.initialized_p () && !(num == den); Here if num is 0 but den is not 0, scale becomes true and we lose the counts in if (scale) bb->count = bb->count.apply_scale (num, den); This is what happened in the issue reported in PR116743 (a 10% regression in MySQL HAMMERDB tests). 3d9e676 made an improvement in AutoFDO count propagation, which caused a mismatch between the call graph node count (zero) and the entry block count (non-zero) and subsequent loss of counts as described above. The fix is to update the call graph node count once we've done count propagation. Tested on x86_64-pc-linux-gnu. gcc/ChangeLog: PR gcov-profile/116743 * auto-profile.cc (afdo_annotate_cfg): Fix mismatch between the call graph node count and the entry block count. (cherry picked from commit e683c6b)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support for Apple Silicon!!!