Skip to content
This repository has been archived by the owner on Jul 5, 2024. It is now read-only.

Memory requirements for generating a full block proof #1768

Open
ed255 opened this issue Feb 15, 2024 · 4 comments
Open

Memory requirements for generating a full block proof #1768

ed255 opened this issue Feb 15, 2024 · 4 comments
Labels
T-performance Type: performance improvements T-tech-debt Type: tech-debt generated or cleaned up

Comments

@ed255
Copy link
Member

ed255 commented Feb 15, 2024

I am worried about the current architecture and the memory consumption it entails. I did a lower bound estimate and found that SuperCircuit with k=26 (That's 32 chunks for 30M gas) needs 21 TB of memory. See results from #1763

I think we need to discuss ways to reduce the memory consumption. Here are a few ideas:

  • From the circuit side
    • Reduce the expression degree. Currently it's 10. We could easily bring it down to 9 (to get ~50% mem) or 7 (to get ~25% mem)
    • Reuse advice columns in subcircuits. Currently each subcircuit uses its own advice columns, but for a target worst case gas amount not all rows are used in all subcircuits. We could lay out circuits vertically. If we do this in the circuit side, we may need to refactor the code to use virtual columns instead of read columns, and synthesize subcircuits in a virtual circuit instead of the real one; so perhaps it's easy to implement this via halo2.
  • From halo2 side
    • Support vertical stacking layouting: treat columns in regions as virtual and support combining several into one by stacking multiple regions with different virtual advice columns using the same real advice columns.
    • Implement Scroll's memory optimization with chunked FFT Feat/scroll mem reduction halo2#274
@ed255 ed255 added T-performance Type: performance improvements T-tech-debt Type: tech-debt generated or cleaned up labels Feb 15, 2024
@rrtoledo
Copy link
Contributor

@ed255 do you know which FFT algo we are using?

With ‎Cooley–Tukey FFT algorithm, we pad the evaluations to the closest power of 2 before doing iFFTs (as the FFTs on power of 2s are much cheaper).

We should be using iFFTs directly for the commitment but also for polynomial multiplications and so the end polynomial degree can be much higher (unless you already posted that in the stats PR).

So memory-wise, we may have higher gain than what you said.

@hero78119
Copy link
Member

Notice the benchmark are taking on k = 26 with 32 chunks.

What do you think about another straw-man idea by trading smaller k with larger chunks ?
Ideally if we reduce k in a magnitude, memory consumption during proof generation should be cut also nearly half.
Under current aggregate proof scheme, if only one prover generating all chunk proof, each chunk proof after generated can be discard.
If we adopt multiple prover, the overall latency should also being improved thanks to parallelism

@ed255
Copy link
Member Author

ed255 commented Feb 23, 2024

@ed255 do you know which FFT algo we are using?

With ‎Cooley–Tukey FFT algorithm, we pad the evaluations to the closest power of 2 before doing iFFTs (as the FFTs on power of 2s are much cheaper).

In halo2 all polynomials are stored in vectors, and these vectors are always preallocated with sizes power of 2, so I would say the padding happens implicitly (it's just elements in the vector that are not assigned and have 0 by default).

We should be using iFFTs directly for the commitment but also for polynomial multiplications and so the end polynomial degree can be much higher (unless you already posted that in the stats PR).

The stats PR already considers the polynomials in the extended domain (which depends on the max expression degree). Is this what you mean? I believe the biggest source of memory consumption comes from the polynomials in the extended domain.

So memory-wise, we may have higher gain than what you said.

On a related note, the numbers of the stats utility are theoretical. In practice the memory usage of the process may be higher due to:

  • allocations that we didn't consider, for example coming from iterators? (but maybe we could study those if they appear and try to fix them)
  • things we may have missed?

Notice the benchmark are taking on k = 26 with 32 chunks.

What do you think about another straw-man idea by trading smaller k with larger chunks ? Ideally if we reduce k in a magnitude, memory consumption during proof generation should be cut also nearly half. Under current aggregate proof scheme, if only one prover generating all chunk proof, each chunk proof after generated can be discard. If we adopt multiple prover, the overall latency should also being improved thanks to parallelism.

Yes! I think that's something we could easily do now. On one hand we have to dimensions: memory and compute; and by changing the k (and thus the number of chunks) we have different memory and compute values (which may be a tradeoff; I assume at some point we can trade memory by compute and vice-versa). As you say, the good thing about compute is that we can parallelize or distribute the work (due to aggregation) and reduce memory, increase compute but not increase time (because we add more machines).

On the other hand, I think it would be great to find the sweet spot of the aggregation configuration:

  • If the circuit is too small, the aggregation overhead is too high overall
  • If the circuit is too big, the aggregation overhead is small (but maybe the aggregation proof takes longer?)
  • moreover, we can play with the k, and somehow with the number of advice columns.
  • If we have an aggregation tree, how many children should a node have?

@rrtoledo
Copy link
Contributor

The stats PR already considers the polynomials in the extended domain (which depends on the max expression degree). Is this what you mean? I believe the biggest source of memory consumption comes from the polynomials in the extended domain.

I was not sure it was the extended domain, but you clarified this. I agree with you!

On a related note, the numbers of the stats utility are theoretical. In practice the memory usage of the process may be higher due to:
allocations that we didn't consider, for example coming from iterators? (but maybe we could study those if they appear and try to fix them)
things we may have missed?
On top of my head, the usual biggest costs are iFFTs, commitment and computing the quotient polynomial. However, because of the nb of columns we have, I wouldn't be surprised is the permutation argument (even if we split the poly) is very costly too.

Also, one straightforward thing that we can do for the time being (before thinking of merging columns and so on) is check if the FFT algo we use is sparse-friendly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-performance Type: performance improvements T-tech-debt Type: tech-debt generated or cleaned up
Projects
None yet
Development

No branches or pull requests

3 participants