-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add new parallel implementation for permute_expression_pair
#189
base: main
Are you sure you want to change the base?
feat: add new parallel implementation for permute_expression_pair
#189
Conversation
get (A', S') that is fully multi-threaded: this is a different algorithm than the original `permute_expression_pair_seq`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need to analyze the algorithm in depth. So far looks well. Will give it another look later today.
Also, left some comments. And wanted to mention that it would be nice to see at least some numbers in regards performance changes/memory consumption changes so that we know whether this is feasible to be merged.
if input_ranges.is_empty() { | ||
input_ranges.push((coeff, 0..count)); | ||
} else { | ||
let prev_end = input_ranges.last().unwrap().1.end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're already checking for empty range, then we should unwrap
let prev_end = input_ranges.last().unwrap().1.end; | |
let prev_end = unsafe{ input_ranges.last().unwrap_unchecked().1.end}; |
}, | ||
) | ||
.reduce_with(|r1, mut r2| { | ||
let r1_end = r1.last().unwrap().1.end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how we know we will never panic here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each r1
is the result of the previous fold
step. As long as the fold
is over nonempty iterator, the output should be nonempty. So r1
is nonempty unless input_uniques
is empty I believe.
// didn't want to bother with Sync rng or anything so just do this part sequentially | ||
let blinding: Vec<(C::Scalar, C::Scalar)> = (usable_rows..params.n() as usize) | ||
.into_iter() | ||
.map(|_| (C::Scalar::random(&mut rng), C::Scalar::random(&mut rng))) | ||
.collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmm We can maybe file an issue in case we see this being critical.
I'm planning to write a polynomial lib for |
Sure, what kind of machine do you usually bench on? Not sure my Macbook will be a good standard heh.
Sounds nice! Right now I don't have to use polynomial stuff much outside of just |
I bench on my laptop too. It has 16CPUs. So enough to see if it's indeed better performance-wise. |
We have done some end-to-end benchmarking on AWS servers with number of cores between 16 and 128. We haven't really studied the bench results with high number of cores, so I think for this PR benchmarks with 8 or 16 cores should be good enough |
@jonathanpwang Did you collect some numbers on your Mac? |
ping @jonathanpwang |
For now, the benchmarks on the `bench_lookup` I added are: Using `permute_expression_pair_par`: ``` Benchmarking bench-lookup/14: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.425916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.426375ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.261ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.694833ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.660875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.887375ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.239875ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.8s. Benchmarking bench-lookup/14: Collecting 10 samples in estimated 6.8417 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.548208ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.5575ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.678708ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.956375ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 20.183791ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.00175ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.986916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.358875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 21.128708ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.92425ms bench-lookup/14 time: [678.46 ms 686.72 ms 694.16 ms] change: [-0.7211% +1.2598% +3.1373%] (p = 0.25 > 0.05) No change in performance detected. Benchmarking bench-lookup/15: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 40.454916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.66425ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.290333ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.5s. Benchmarking bench-lookup/15: Collecting 10 samples in estimated 12.503 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 40.871916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 39.03175ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 44.727416ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 42.948333ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 40.489958ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 43.823041ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 39.592ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 40.593375ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.861708ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 45.023333ms bench-lookup/15 time: [1.2285 s 1.2341 s 1.2393 s] change: [-6.0038% -4.4123% -2.7223%] (p = 0.00 < 0.05) Performance has improved. Found 2 outliers among 10 measurements (20.00%) 1 (10.00%) low mild 1 (10.00%) high mild Benchmarking bench-lookup/16: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 92.282041ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 90.784875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 95.368958ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 23.9s. Benchmarking bench-lookup/16: Collecting 10 samples in estimated 23.937 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 93.599166ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 95.992583ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 91.913625ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 89.482625ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 90.111875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 86.671916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 96.854666ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 102.468125ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 97.830583ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 87.925708ms bench-lookup/16 time: [2.3417 s 2.3644 s 2.3901 s] change: [+2.2283% +4.1485% +6.0108%] (p = 0.00 < 0.05) Performance has regressed. Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild Benchmarking bench-lookup/17: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 214.102916ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 45.0s. Benchmarking bench-lookup/17: Collecting 10 samples in estimated 45.000 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 199.65025ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 208.088875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 208.299666ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 199.684416ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 199.761666ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 193.034458ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 202.182375ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 200.825375ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 226.314541ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 211.914291ms bench-lookup/17 time: [4.3987 s 4.4299 s 4.4605 s] change: [+0.7989% +1.9668% +3.0962%] (p = 0.01 < 0.05) Change within noise threshold. Benchmarking bench-lookup/18: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 423.016291ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 85.7s. Benchmarking bench-lookup/18: Collecting 10 samples in estimated 85.748 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 451.549291ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 469.336ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 429.5375ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 430.579041ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 435.976541ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 416.241875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 423.361041ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 436.833625ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 456.685458ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 454.897541ms bench-lookup/18 time: [8.5101 s 8.5631 s 8.6175 s] change: [+0.5067% +1.6515% +2.8407%] (p = 0.02 < 0.05) Change within noise threshold. ``` Using `permute_expression_pair_seq`: ``` Benchmarking bench-lookup/14: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.35325ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.101125ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.721708ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.291333ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.860208ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.553916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.965375ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.9s. Benchmarking bench-lookup/14: Collecting 10 samples in estimated 6.8584 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.117458ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.80025ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.169875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.03325ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.636166ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.170166ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.000416ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.565958ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 18.970333ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 19.007291ms bench-lookup/14 time: [679.11 ms 687.62 ms 696.42 ms] change: [-1.5707% +0.1313% +1.8844%] (p = 0.89 > 0.05) No change in performance detected. Benchmarking bench-lookup/15: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.634625ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 40.915958ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 40.774625ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.6s. Benchmarking bench-lookup/15: Collecting 10 samples in estimated 12.569 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.548583ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.074333ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.807125ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.106458ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.222541ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.021458ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.411666ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.024416ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.636541ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 41.869166ms bench-lookup/15 time: [1.2474 s 1.2597 s 1.2711 s] change: [+0.9856% +2.0736% +3.1604%] (p = 0.00 < 0.05) Change within noise threshold. Benchmarking bench-lookup/16: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 90.202208ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 89.318083ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 88.717125ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 23.8s. Benchmarking bench-lookup/16: Collecting 10 samples in estimated 23.789 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 89.929041ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 88.316333ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 89.588083ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 88.630916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 88.872ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 88.961416ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 88.796833ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 89.80725ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 89.426916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 89.039833ms bench-lookup/16 time: [2.3537 s 2.3854 s 2.4165 s] change: [-0.7038% +0.8909% +2.5108%] (p = 0.33 > 0.05) No change in performance detected. Benchmarking bench-lookup/17: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 187.758583ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 43.2s. Benchmarking bench-lookup/17: Collecting 10 samples in estimated 43.177 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 187.717833ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 185.975333ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 187.108625ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 187.965416ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 188.279458ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 188.287166ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 188.590291ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 186.355708ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 186.600458ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 188.600875ms bench-lookup/17 time: [4.3965 s 4.4531 s 4.5112 s] change: [-0.8127% +0.5248% +2.0814%] (p = 0.52 > 0.05) No change in performance detected. Benchmarking bench-lookup/18: Warming up for 3.0000 s[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 405.103375ms Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 86.2s. Benchmarking bench-lookup/18: Collecting 10 samples in estimated 86.153 s (10 iterations)[halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 406.387125ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 402.833916ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 403.729208ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 404.397583ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 404.277833ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 409.32825ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 404.037583ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 403.349875ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 405.868083ms [halo2_proofs/src/plonk/lookup/prover.rs:418] start.elapsed() = 403.907458ms bench-lookup/18 time: [8.5238 s 8.5804 s 8.6488 s] change: [-0.7212% +0.2021% +1.2400%] (p = 0.70 > 0.05) No change in performance detected. Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild ```
Sorry, was busy. It was also a bit hard because there was no existing lookup benchmarking using a real prover. I originally wanted to test just the I added a very basic lookup bench, which is not very comprehensive since the lookups are very uniform. It uses IPA, but I don't think that matters for benchmarking lookup permutation. For now, the benchmarks on the On my laptop (M2Max), it seems the parallelized version isn't any better, but I think this is because the circuit I used is too simple. I'd like to do the benchmarks on the zkevm keccak circuit, but I need to resolve some versioning issues for that first. I will post an update once I have those benchmarks. |
@jonathanpwang @CPerezz We have some pretty large circuits that we could benchmark on if that helps. Would need to update this branch with #192 once merged for it to work -- then can send some numbers yonder |
A new algorithm to get (A', S') that is fully multi-threaded: this is a different algorithm than the original
permute_expression_pair_seq
.I observed that the previous computation was still single-threaded at some places, which becomes a bottleneck for larger circuits. This is because the way A', S' need to be permuted is rather esoteric and not so parallel friendly.
My implementation isn't optimized by any means: I just aggressively use rayon and fold on
BTreeMap
(in Axiom's repo I usedHashMap
so I'm not sure of the performance difference withBTreeMap
).Also I'm not sure why indexing into
Range
wasn't implement onPolynomial
before but it was onRangeTo
. I could also add it forRangeFrom
if there's interest.