Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using Vector capabilities of the CPU for sha256 in ssz merkelization of lists #213

Open
4 tasks
g11tech opened this issue Nov 22, 2021 · 7 comments
Open
4 tasks
Assignees

Comments

@g11tech
Copy link
Contributor

g11tech commented Nov 22, 2021

In discussion with @potuz, it was discovered that there is scope for using capabilities of SIMD enabled processors, use case: ssz merkalization of the lists for which @potuz has reported 10x improvment.

goos: linux
goarch: amd64
cpu: AMD Ryzen 5 3600 6-Core Processor
BenchmarkHashBalanceShani-12                  160       7629704 ns/op
BenchmarkHashBalanceShaniPrysm-12              15      74012328 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz
BenchmarkHashBalanceAVX-4               68      26677965 ns/op
BenchmarkHashBalancePrysm-4              7     165434686 ns/op
PASS

goos: linux
goarch: amd64
cpu: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
BenchmarkHashBalanceAVX2-4             121       9711482 ns/op
BenchmarkHashBalancePrysm-4             10     103716714 ns/op
PASS

Reference Links:
https://github.com/potuz/mammon/blob/main/ssz/sha256_avx2.asm#L635-L659
https://github.com/potuz/mammon/blob/main/ssz/hasher.hpp#L27

Based on this, digged through to realize that assembly script has support for SIMD vector processing: https://v8.dev/features/simd
There are two was this can be done:

  1. Via compiler flags for auto optimization of vector loops for single digest
  2. Via using assembly script wrapper functions to vectorize the computation for parallelizing multiple digest processings (the approach followed by @potuz in his reference implementation, more optimal wherever multi digest & SIMD compatible workload available)

Task:

  • Investigate and get familiar SIMD support directives in assembly script
  • Investigate and develop if possible, loop parallelization for SIMD
  • Investigate and develop multiple digest feeds
  • Integrate the multi digest support in ssz merkelization of lists
@g11tech g11tech self-assigned this Nov 22, 2021
@dapplion
Copy link
Contributor

Oh damn! 🚀 How can check if my host supports SIMD?

@potuz
Copy link

potuz commented Nov 23, 2021

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

@dapplion
Copy link
Contributor

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

Thank you!

$ cpuid
CPU 0:
...
   feature information (1/edx):
...
      SSE extensions                         = true
      SSE2 extensions                        = true
   feature information (1/ecx):
...
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true

😍 how common is support on modern CPUs?

@potuz
Copy link

potuz commented Nov 23, 2021

Oh damn! rocket How can check if my host supports SIMD?

cpuid gives you this. A C++ call is here https://github.com/potuz/mammon/blob/main/ssz/hasher.cpp#L43-L55

Thank you!

$ cpuid
CPU 0:
...
   feature information (1/edx):
...
      SSE extensions                         = true
      SSE2 extensions                        = true
   feature information (1/ecx):
...
      SSE4.1 extensions                       = true
      SSE4.2 extensions                       = true

heart_eyes how common is support on modern CPUs?

I'm making the case to implement in prysm expecting at least SSE3 which has been the standard since 2004/5 I don't expect a single CPU out there without SSE3 actually staking. In practical terms, I don't think there's a single one without AVX. This is Intel speaking. I haven't looked yet into ARM assembly.

@dapplion
Copy link
Contributor

dapplion commented Nov 23, 2021

That's huge then! Would love to see this in Lodestar.

I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement @g11tech

ChainSafe/lodestar#2206

@potuz
Copy link

potuz commented Nov 23, 2021

That's huge then! Would love to see this in Lodestar.

I did some comparisons with Lighthouse on our hashing throughput and somehow Lodestar is x5 slower when bench-marking hashing a full state but when bench-marking hashing a single 64 bytes value performance is the same. Would be worth to research forward to get the most of this improvement

ChainSafe/lodestar#2206

This requires both changes in the assembly to return buffers with all roots at the same time, and changes in the hashing logic to call the whole block at the same time instead of pairwise leaves. I put out a stupid implementation in the design document, surely it can be improved, but this is already giving those x10 benches against production prysm on large lists:
https://hackmd.io/@potuz/BJyrx9DOF

@potuz
Copy link

potuz commented Jan 3, 2022

We'll most probably be using https://github.com/prysmaticlabs/hashtree, It's on very early stages of development, but I'll be happy to see some benchmarks from Lodestar if you could test it. If you decide to use it I'll be happy to provide bindings or whatever you need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants