feat(xxhash3): Support LASX instruction set and refactor LSX implement #996

24bit-xjkp · 2025-01-12T11:39:22Z

Use __lsx_vmul_d dircetly instead of using 2 32-bit multiply to emulate a 64-bit multiply.
Add LASX support.

1. Use __lsx_vmul_d dircetly instead of using 2 32-bit multiply to emulate a 64-bit multiply. 2. Add LASX support.

24bit-xjkp · 2025-01-12T11:40:12Z

LSX:

xxhsum 0.8.3 by Yann Collet
compiled as 64-bit loongarch little endian with GCC 15.0.0 20240714 (experimental)
Sample of 100 KB...
 1#XXH32                         :     102400 ->    49781 it/s ( 4861.5 MB/s)
 2#XXH32 unaligned               :     102400 ->    49787 it/s ( 4862.0 MB/s)
 3#XXH64                         :     102400 ->   121442 it/s (11859.6 MB/s)
 4#XXH64 unaligned               :     102400 ->   117311 it/s (11456.1 MB/s)
 5#XXH3_64b                      :     102400 ->   207791 it/s (20292.1 MB/s)
 6#XXH3_64b unaligned            :     102400 ->   190468 it/s (18600.4 MB/s)
 7#XXH3_64b w/seed               :     102400 ->   208711 it/s (20381.9 MB/s)
 8#XXH3_64b w/seed unaligned     :     102400 ->   192219 it/s (18771.4 MB/s)
 9#XXH3_64b w/secret             :     102400 ->   158258 it/s (15454.9 MB/s)
10#XXH3_64b w/secret unaligned   :     102400 ->   146024 it/s (14260.1 MB/s)
11#XXH128                        :     102400 ->   207481 it/s (20261.8 MB/s)
12#XXH128 unaligned              :     102400 ->   190216 it/s (18575.8 MB/s)
13#XXH128 w/seed                 :     102400 ->   207458 it/s (20259.6 MB/s)
14#XXH128 w/seed unaligned       :     102400 ->   191485 it/s (18699.7 MB/s)
15#XXH128 w/secret               :     102400 ->   154919 it/s (15128.8 MB/s)
16#XXH128 w/secret unaligned     :     102400 ->   147794 it/s (14433.0 MB/s)
17#XXH32_stream                  :     102400 ->    60791 it/s ( 5936.6 MB/s)
18#XXH32_stream unaligned        :     102400 ->    59631 it/s ( 5823.3 MB/s)
19#XXH64_stream                  :     102400 ->   121414 it/s (11856.8 MB/s)
20#XXH64_stream unaligned        :     102400 ->   119730 it/s (11692.4 MB/s)
21#XXH3_stream                   :     102400 ->   199493 it/s (19481.7 MB/s)
22#XXH3_stream unaligned         :     102400 ->   193645 it/s (18910.7 MB/s)
23#XXH3_stream w/seed            :     102400 ->   199307 it/s (19463.6 MB/s)
24#XXH3_stream w/seed unaligned  :     102400 ->   193333 it/s (18880.2 MB/s)
25#XXH128_stream                 :     102400 ->   199387 it/s (19471.3 MB/s)
26#XXH128_stream unaligned       :     102400 ->   193626 it/s (18908.8 MB/s)
27#XXH128_stream w/seed          :     102400 ->   199291 it/s (19462.0 MB/s)
28#XXH128_stream w/seed unaligne :     102400 ->   193416 it/s (18888.3 MB/s)

LASX:

xxhsum 0.8.3 by Yann Collet
compiled as 64-bit loongarch little endian with GCC 15.0.0 20240714 (experimental)
Sample of 100 KB...
 1#XXH32                         :     102400 ->    48038 it/s ( 4691.2 MB/s)
 2#XXH32 unaligned               :     102400 ->    49793 it/s ( 4862.6 MB/s)
 3#XXH64                         :     102400 ->   121443 it/s (11859.6 MB/s)
 4#XXH64 unaligned               :     102400 ->   117315 it/s (11456.6 MB/s)
 5#XXH3_64b                      :     102400 ->   277294 it/s (27079.5 MB/s)
 6#XXH3_64b unaligned            :     102400 ->   277269 it/s (27077.1 MB/s)
 7#XXH3_64b w/seed               :     102400 ->   276368 it/s (26989.1 MB/s)
 8#XXH3_64b w/seed unaligned     :     102400 ->   276325 it/s (26984.9 MB/s)
 9#XXH3_64b w/secret             :     102400 ->   265865 it/s (25963.3 MB/s)
10#XXH3_64b w/secret unaligned   :     102400 ->   265864 it/s (25963.3 MB/s)
11#XXH128                        :     102400 ->   276331 it/s (26985.5 MB/s)
12#XXH128 unaligned              :     102400 ->   276379 it/s (26990.1 MB/s)
13#XXH128 w/seed                 :     102400 ->   274741 it/s (26830.2 MB/s)
14#XXH128 w/seed unaligned       :     102400 ->   274736 it/s (26829.7 MB/s)
15#XXH128 w/secret               :     102400 ->   264432 it/s (25823.4 MB/s)
16#XXH128 w/secret unaligned     :     102400 ->   264332 it/s (25813.7 MB/s)
17#XXH32_stream                  :     102400 ->    60791 it/s ( 5936.6 MB/s)
18#XXH32_stream unaligned        :     102400 ->    59625 it/s ( 5822.8 MB/s)
19#XXH64_stream                  :     102400 ->   121413 it/s (11856.7 MB/s)
20#XXH64_stream unaligned        :     102400 ->   119726 it/s (11692.0 MB/s)
21#XXH3_stream                   :     102400 ->   274586 it/s (26815.0 MB/s)
22#XXH3_stream unaligned         :     102400 ->   274407 it/s (26797.6 MB/s)
23#XXH3_stream w/seed            :     102400 ->   273574 it/s (26716.3 MB/s)
24#XXH3_stream w/seed unaligned  :     102400 ->   273352 it/s (26694.6 MB/s)
25#XXH128_stream                 :     102400 ->   275256 it/s (26880.4 MB/s)
26#XXH128_stream unaligned       :     102400 ->   275257 it/s (26880.5 MB/s)
27#XXH128_stream w/seed          :     102400 ->   274371 it/s (26794.1 MB/s)
28#XXH128_stream w/seed unaligne :     102400 ->   274349 it/s (26791.9 MB/s)

24bit-xjkp · 2025-01-12T11:41:49Z

The benchmark of previous implement for LSX is in #981.

Cyan4973 · 2025-01-13T00:18:04Z

xxhash.h

@@ -1125,6 +1125,7 @@ XXH_PUBLIC_API XXH_PUREF XXH64_hash_t XXH64_hashFromCanonical(XXH_NOESCAPE const
 #  define XXH_VSX    5 /*!< VSX and ZVector for POWER8/z13 (64-bit) */
 #  define XXH_SVE    6 /*!< SVE for some ARMv8-A and ARMv9-A */
 #  define XXH_LSX    7 /*!< LSX (128-bit SIMD) for LoongArch64 */
+#  define XXH_LASX    8 /*!< LASX (256-bit SIMD) for LoongArch64 */


minor nit: alignment

Cyan4973 · 2025-01-13T00:31:25Z

Great work @24bit-xjkp !

I mostly have some comments to make sure the context is well understood.

LSX and LASX are both vector extensions specific to LoongArch. LSX is 128-bit, while LASX is 256-bit.
The LSX implementation is improved in this PR, but since the modification only impacts the scrambling stage, performance change is expected to be minor.

Also a request :

compiled as 64-bit loongarch little endian with GCC 15.0.0 20240714 (experimental)

It would be great to have an indication of which mode (scalar/lsx/lasx) is being compiled in the welcome message, this would help identify with certainty which mode is being benchmarked.

The display logic is mostly there : https://github.com/Cyan4973/xxHash/blob/dev/cli/xsum_arch.h#L165

24bit-xjkp · 2025-01-13T03:27:48Z

I have added the SIMD extension we used to the welcome message.

The LSX implementation is improved in this PR, but since the modification only impacts the scrambling stage, performance change is expected to be minor.

Yes, I just fix the improper use of 32-bit multiply since LSX and LASX can execute a 64-bit multiply as fast as a 32-bit multiply. This also slightly reduces the size of binary.

… LoongArch 1. Display the mode which is used as below: "loongarch64 + lasx" -> LoongArch64 platform with LoongArch Advanced SIMD Extension "loongarch64 + lsx" -> LoongArch64 platform with LoongArch SIMD Extension "loongarch64" -> LoongArch64 platform, use scalar implement 2. Align the define in xxhash.h

24bit-xjkp · 2025-01-17T13:36:50Z

Shall we merge this pr?

Cyan4973 · 2025-01-17T17:57:03Z

Yes!

feat(xxhash3): Support LASX instruction set and refactor LSX implement

63e083c

1. Use __lsx_vmul_d dircetly instead of using 2 32-bit multiply to emulate a 64-bit multiply. 2. Add LASX support.

Cyan4973 self-assigned this Jan 13, 2025

Cyan4973 reviewed Jan 13, 2025

View reviewed changes

24bit-xjkp force-pushed the dev branch from 0588fd9 to 7d6bd4e Compare January 13, 2025 03:30

Cyan4973 approved these changes Jan 13, 2025

View reviewed changes

24bit-xjkp mentioned this pull request Jan 13, 2025

英雄帖：Add LoongArch SIMD support to xxHash loongson-community/discussions#72

Closed

Cyan4973 merged commit 51fa4ef into Cyan4973:dev Jan 17, 2025
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(xxhash3): Support LASX instruction set and refactor LSX implement #996

feat(xxhash3): Support LASX instruction set and refactor LSX implement #996

24bit-xjkp commented Jan 12, 2025

24bit-xjkp commented Jan 12, 2025

24bit-xjkp commented Jan 12, 2025

Cyan4973 Jan 13, 2025

Cyan4973 commented Jan 13, 2025 •

edited

Loading

24bit-xjkp commented Jan 13, 2025

24bit-xjkp commented Jan 17, 2025

Cyan4973 commented Jan 17, 2025 •

edited

Loading

feat(xxhash3): Support LASX instruction set and refactor LSX implement #996

feat(xxhash3): Support LASX instruction set and refactor LSX implement #996

Conversation

24bit-xjkp commented Jan 12, 2025

24bit-xjkp commented Jan 12, 2025

24bit-xjkp commented Jan 12, 2025

Cyan4973 Jan 13, 2025

Choose a reason for hiding this comment

Cyan4973 commented Jan 13, 2025 • edited Loading

24bit-xjkp commented Jan 13, 2025

24bit-xjkp commented Jan 17, 2025

Cyan4973 commented Jan 17, 2025 • edited Loading

Cyan4973 commented Jan 13, 2025 •

edited

Loading

Cyan4973 commented Jan 17, 2025 •

edited

Loading