This repository contains benchmarks for common vertical and horizontal operations that
leverage SIMD, comparing different implementations of the same algorithms
in them using packed_simd2
, core_simd
and Rust arrays.
Things implemented:
- sum of values
- sum of nullable values where nulls are represented as
Vec<bool>
- sum of nullable values where nulls are represented as
Bitmap
- min of values
Algorithms implemented:
core_simd
: vertical sum over lanes with a reduce at the end usingcore_simd
packed_simd
: vertical sum over lanes with a reduce at the end usingpacked_simd
nonsimd
: vertical sum over lanes with a reduce at the end using Rust arraysnaive
: sum using rust iterators
Command:
RUSTFLAGS="-C target-cpu=native" cargo bench -- "2\^20"
core_simd_sum 2^20 f32 [156.96 us 158.06 us 159.40 us]
packed_simd_sum 2^20 f32 [184.17 us 184.47 us 184.85 us]
nonsimd_sum 2^20 f32 [175.05 us 176.26 us 177.95 us]
naive_sum 2^20 f32 [1.6636 ms 1.6700 ms 1.6778 ms]
core_simd_sum null 2^20 f32 [2.3610 ms 2.3713 ms 2.3831 ms]
packed_simd_sum null 2^20 f32 [1.5737 ms 1.5869 ms 1.6022 ms]
nonsimd_sum null 2^20 f32 [1.8009 ms 1.8133 ms 1.8276 ms]
naive_sum null 2^20 f32 [1.6418 ms 1.6520 ms 1.6660 ms]
core_simd_sum bitmap 2^20 f32 [174.24 us 175.10 us 176.21 us]
nonsimd_sum bitmap 2^20 f32 [541.78 us 545.16 us 549.09 us]
naive_sum bitmap 2^20 f32 [1.6740 ms 1.6922 ms 1.7149 ms]
core_simd_min 2^20 f32 [376.98 us 378.40 us 379.72 us]
packed_simd_min 2^20 f32 [181.77 us 182.95 us 185.05 us]
nonsimd_min 2^20 f32 [185.89 us 186.35 us 186.83 us]
naive_min 2^20 f32 [2.0208 ms 2.0274 ms 2.0341 ms]
Command:
cargo bench -- "2\^20"
core_simd_sum 2^20 f32 [184.95 us 185.86 us 186.97 us]
packed_simd_sum 2^20 f32 [184.97 us 186.85 us 189.59 us]
nonsimd_sum 2^20 f32 [191.35 us 192.67 us 194.46 us]
naive_sum 2^20 f32 [1.6385 ms 1.6426 ms 1.6466 ms]
core_simd_sum null 2^20 f32 [882.21 us 889.56 us 897.74 us]
packed_simd_sum null 2^20 f32 [824.37 us 835.77 us 849.63 us]
nonsimd_sum null 2^20 f32 [695.79 us 707.87 us 721.98 us]
naive_sum null 2^20 f32 [1.6418 ms 1.6520 ms 1.6660 ms]
core_simd_sum bitmap 2^20 f32 [929.95 us 936.31 us 943.64 us]
nonsimd_sum bitmap 2^20 f32 [454.78 us 462.08 us 471.82 us]
naive_sum bitmap 2^20 f32 [1.7633 ms 1.7736 ms 1.7855 ms]
core_simd_min 2^20 f32 [286.86 us 289.22 us 292.03 us]
packed_simd_min 2^20 f32 [230.50 us 234.12 us 238.86 us]
nonsimd_min 2^20 f32 [245.75 us 249.19 us 254.00 us]
naive_min 2^20 f32 [2.8560 ms 2.8721 ms 2.8885 ms]
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz
Stepping: 4
CPU MHz: 2095.077
BogoMIPS: 4190.15
Virtualization: VT-x
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti tpr_shadow vnmi ept vpid fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec xsaves md_clear
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.