Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RISC-V RVV implementation #898

Closed
wants to merge 3 commits into from
Closed

Conversation

WoWaster
Copy link

Add RVV support. Current implementation targets both RVV 0.7.1 and RVV 1.0, this is bad for two reason:

  • we need some workarounds with macros' because of __riscv prefix for intrinsics in newer compilers
  • we need more instructions because some useful ones were unavailable in RVV 0.7.1

RVV 1.0 can only be tested with QEMU because there is no boards with up-to-date vectors. QEMU also cannot be used for performance testing because it executes all vector ops on scalar registers.

I've tested RVV 0.7.1 on Lichee Pi4A using cross-compilation with Xuantie GCC toolchain. Default xxhsum -b results were uninformative for me, so here's full benchmarks.
With -DXXH_VECTOR=XXH_SCALAR:

➜ ./xxhsum --benchmark-all
xxhsum 0.8.2 by Yann Collet
compiled as 64-bit riscv little endian with GCC 10.2.0
Sample of 100 KB...
 1#XXH32                         :     102400 ->    17771 it/s ( 1735.5 MB/s)
 2#XXH32 unaligned               :     102400 ->     9218 it/s (  900.2 MB/s)
 3#XXH64                         :     102400 ->    29864 it/s ( 2916.4 MB/s)
 4#XXH64 unaligned               :     102400 ->    11421 it/s ( 1115.3 MB/s)
 5#XXH3_64b                      :     102400 ->     6608 it/s (  645.3 MB/s)
 6#XXH3_64b unaligned            :     102400 ->     6613 it/s (  645.8 MB/s)
 7#XXH3_64b w/seed               :     102400 ->     6613 it/s (  645.8 MB/s)
 8#XXH3_64b w/seed unaligned     :     102400 ->     6615 it/s (  646.0 MB/s)
 9#XXH3_64b w/secret             :     102400 ->     2671 it/s (  260.8 MB/s)
10#XXH3_64b w/secret unaligned   :     102400 ->     2669 it/s (  260.7 MB/s)
11#XXH128                        :     102400 ->     6597 it/s (  644.2 MB/s)
12#XXH128 unaligned              :     102400 ->     6607 it/s (  645.2 MB/s)
13#XXH128 w/seed                 :     102400 ->     6605 it/s (  645.0 MB/s)
14#XXH128 w/seed unaligned       :     102400 ->     6611 it/s (  645.6 MB/s)
15#XXH128 w/secret               :     102400 ->     2668 it/s (  260.5 MB/s)
16#XXH128 w/secret unaligned     :     102400 ->     2668 it/s (  260.6 MB/s)
17#XXH32_stream                  :     102400 ->     7556 it/s (  737.8 MB/s)
18#XXH32_stream unaligned        :     102400 ->     7646 it/s (  746.7 MB/s)
19#XXH64_stream                  :     102400 ->    10274 it/s ( 1003.3 MB/s)
20#XXH64_stream unaligned        :     102400 ->    10234 it/s (  999.4 MB/s)
21#XXH3_stream                   :     102400 ->     2732 it/s (  266.8 MB/s)
22#XXH3_stream unaligned         :     102400 ->     2724 it/s (  266.0 MB/s)
23#XXH3_stream w/seed            :     102400 ->     2728 it/s (  266.4 MB/s)
24#XXH3_stream w/seed unaligned  :     102400 ->     2720 it/s (  265.6 MB/s)
25#XXH128_stream                 :     102400 ->     2730 it/s (  266.6 MB/s)
26#XXH128_stream unaligned       :     102400 ->     2724 it/s (  266.0 MB/s)
27#XXH128_stream w/seed          :     102400 ->     2726 it/s (  266.2 MB/s)
28#XXH128_stream w/seed unaligne :     102400 ->     2719 it/s (  265.5 MB/s)

With -DXXH_VECTOR=XXH_RVV:

➜ ./xxhsum --benchmark-all -i10
xxhsum 0.8.2 by Yann Collet
compiled as 64-bit riscv little endian with GCC 10.2.0
Sample of 100 KB...
 1#XXH32                         :     102400 ->    17779 it/s ( 1736.2 MB/s)
 2#XXH32 unaligned               :     102400 ->     8977 it/s (  876.7 MB/s)
 3#XXH64                         :     102400 ->    32211 it/s ( 3145.6 MB/s)
 4#XXH64 unaligned               :     102400 ->     9707 it/s (  947.9 MB/s)
 5#XXH3_64b                      :     102400 ->     4840 it/s (  472.6 MB/s)
 6#XXH3_64b unaligned            :     102400 ->     4676 it/s (  456.6 MB/s)
 7#XXH3_64b w/seed               :     102400 ->     4717 it/s (  460.6 MB/s)
 8#XXH3_64b w/seed unaligned     :     102400 ->     4620 it/s (  451.2 MB/s)
 9#XXH3_64b w/secret             :     102400 ->     4521 it/s (  441.5 MB/s)
10#XXH3_64b w/secret unaligned   :     102400 ->     4420 it/s (  431.7 MB/s)
11#XXH128                        :     102400 ->     4759 it/s (  464.7 MB/s)
12#XXH128 unaligned              :     102400 ->     4651 it/s (  454.2 MB/s)
13#XXH128 w/seed                 :     102400 ->     4671 it/s (  456.2 MB/s)
14#XXH128 w/seed unaligned       :     102400 ->     4564 it/s (  445.7 MB/s)
15#XXH128 w/secret               :     102400 ->     4428 it/s (  432.4 MB/s)
16#XXH128 w/secret unaligned     :     102400 ->     4335 it/s (  423.3 MB/s)
17#XXH32_stream                  :     102400 ->     7542 it/s (  736.5 MB/s)
18#XXH32_stream unaligned        :     102400 ->     7630 it/s (  745.2 MB/s)
19#XXH64_stream                  :     102400 ->    10239 it/s (  999.9 MB/s)
20#XXH64_stream unaligned        :     102400 ->    10217 it/s (  997.7 MB/s)
21#XXH3_stream                   :     102400 ->     4910 it/s (  479.5 MB/s)
22#XXH3_stream unaligned         :     102400 ->     4737 it/s (  462.6 MB/s)
23#XXH3_stream w/seed            :     102400 ->     4908 it/s (  479.3 MB/s)
24#XXH3_stream w/seed unaligned  :     102400 ->     4730 it/s (  461.9 MB/s)
25#XXH128_stream                 :     102400 ->     4906 it/s (  479.1 MB/s)
26#XXH128_stream unaligned       :     102400 ->     4732 it/s (  462.1 MB/s)
27#XXH128_stream w/seed          :     102400 ->     4903 it/s (  478.8 MB/s)
28#XXH128_stream w/seed unaligne :     102400 ->     4726 it/s (  461.5 MB/s)

As soon as I get a board with RVV 1.0, I'll try to adjust code and benchmark it, but for now let it be a draft.

@Cyan4973
Copy link
Owner

Cyan4973 commented Dec 8, 2024

This PR has remained blocked in draft mode and not seen much progress for a long period time.
I presume it means it's abandoned ? In which case, the logical next step would be to close it.

@WoWaster
Copy link
Author

I think this PR can be closed. This branch is definitely abandoned. I have a new version in another branch for RVV 1.0 solely. It showed 4x speed-up on Banana Pi BPI-F3 with a 256-bit wide vector unit. IMO the code still has some room for refinement, and unfortunately, I'm currently unable to find time to finish this.

@Cyan4973 Cyan4973 closed this Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants