-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use randomness more efficiently #335
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The on-the-fly derivation of column randomizers is as best as I can tell correct and secure, and the interface to get them looks exactly spot on to me.
That said, the current implementation misses out on performance benefits.
The problem is that there are still points in time where the entire randomized trace lives in RAM (yes, in the monomial coefficient basis, but that changes nothing). According to my understanding the RAM cost does drop but not substantially:
Before this PR:
- store the entire randomized trace throughout (
$2 \cdot N \cdot \mathsf{w}$ ) - allocate memory for a duplicate of the entire randomized trace (
$2 \cdot N \cdot \mathsf{w}$ ) - run JIT LDE for quotient calculation using that duplicate matrix and no other memory by repeatedly interpolating and evaluating to a new coset before evaluating AIR (0).
Total memory cost:
With this PR:
- store the entire unrandomized trace throughout along with seeds (
$N \cdot \mathsf{w} + \epsilon \cdot \mathsf{w})$ - allocate memory for a duplicate of the entire randomized trace in monomial coefficient form (
$2 \cdot N \cdot \mathsf{w}$ ) - run JIT LDE for quotient calculation using a new matrix derived from the randomized trace (
$2 \cdot N \cdot \mathsf{w}$ )
Total memory cost:
So I expect the memory cost to increase, not decrease.
Here are a few workflows you might want to consider instead; select one depending on your level of ambition.
Workflow 1:
- store the entire unrandomized trace throughout along with master seed (
$N \cdot \mathsf{w} + \epsilon)$ - allocate a new matrix containing the entire randomized trace in monomial coefficient basis (
$2 \cdot N \cdot \mathsf{w}$ ) - use it in JIT LDE to compute quotients (0)
Total memory cost:
Workflow 2:
- store the entire unrandomized trace throughout along with master seed (
$N \cdot \mathsf{w} + \epsilon)$ - allocate a new matrix containing the unrandomized trace in monomial coefficient basis (
$N \cdot \mathsf{w}$ ) - use it in JIT LDE to compute quotients (0), and note that
- you need to manually add in terms originating from randomizers for every coset
- you need to tweak
segmentify
so that it works on the unrandomized instead of randomized domain
Total memory cost:
Workflow 3:
- Same as workflow 2 but the trace stored throughout is the same as the matrix used in the course of JIT LDE
Total cost:
My measurements of proving the program with explicit randomizers (old)
without explicit randomizers (new)
That said, it is worth some additional effort to bring |
954bb27
to
30968d2
Compare
With the most recent changes, we now have: basically “Workflow 2” (new new)
The total has dropped by 0.7 GB (~20%). The difference in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are still some up-for-grabs memory savers, although I am not sure how they compare to the memory costs of other steps in the pipeline. More comments inline.
f1f7375
to
2a875a9
Compare
2a875a9
to
d948934
Compare
With the most recent changes, we now have: basically “Workflow 3” (new new new)
The total has dropped by another 0.7 GB (~28%), all of which seems to be from savings in Additionally, my benchmarks indicate that runtime performance has improved a tad. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall: lgtm. Some minor comments and change requests inline.
No need to bounce back though; I trust that your response to the requested changes will be excellent.
One final thing: do tell please, what's the magnitude of tad?
changelog: ignore
changelog: ignore
changelog: ignore
changelog: ignore
In particular, the memory-vs-compute-time trade-off can now be tuned via constant `RANDOMIZED_TRACE_LEN_TO_WORKING_DOMAIN_LEN_RATIO`, which is hardcoded for now. Co-authored-by: Alan <[email protected]>
The runtime performance gain of the memory efficient path that comes with this PR1 is 15%. 🎉 Profile at branch's base
Profile at branch's tip
A substantial amount of the savings are from witness generation. This makes sense, as the tables held in memory are now half as large, and we now only generate exactly as much randomness as required for ZK. Note that the large shift away from the category “hash” is due to now-correct annotations. Footnotes
|
d948934
to
3ffa68f
Compare
The runtime performance gain of the caching path that comes with this PR is 10%. 🎉 Profile at branch's base
Prove Fibonacci 10000 time: [6.5011 s 6.5121 s 6.5221 s] Profile at branch's tip
Prove Fibonacci 10000 time: [5.8572 s 5.8721 s 5.8835 s] change: [-10.083% -9.8281% -9.5672%] (p = 0.00 < 0.05) Performance has improved. |
Transform codewords into polynomials in monomial coefficient form in-place, reducing maximum RAM consumption. Also, clear unusable caches to decrease RAM usage even further.
3ffa68f
to
4adeda7
Compare
Remove the “randomized trace table,” which stored a copy of the entire trace interleaved with randomness.