-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
git: sync rewards calculated incorrectly because of totalActiveBalance being off-by-one in EpochData storage #12310
Comments
mmmh - Ok - I think I can guess what the issue is... Btw, thanks for noticing... I am about to remake the archivial node of caplin so that It can be synced only in a few hours so this helps. |
Can I have some context in what this is? an ongoing epoch that then finished and go processed in the database? Anyway - I think that it is unlikely that |
Hey could you try branch |
The context for this is this:
Finalization was true in both cases, but in the first case it was finalized just 1-2 minutes ago, the second case it was finalized (and antiquated) already more than an hour ago. |
I currently don't have a btrfs setup, so I can't easily make a copy-on-write backup of my DB, and therefore I can't really try changes that result in a DB that is half-half. My worry is the following: once I apply this change and start running with it, I can't even write a workaround, because I have to somehow know at query time if I'm querying the "old and broken" part or the "new and good" part. Alternatively, is your change fully isolated to |
Also, now that we are talking about this code, I have a followup question: what happens if the epoch (or after the fix epoch+1) slot is missed on the beacon chain? Does this code still run? Because it seems important to do this bookkeeping at the right moment and I was just wondering about the different error cases. |
yes - all good... the slot has to be processed even if empty. |
Hmm, is there a command to delete caplin processed states from the DB by slot/epoch when erigon is stopped? Then I would be able to test your code, by remembering where caplin was before I tried and just deleting later the epochs/slots that was created by the patched code... |
Mh - I can be the one testing it... no worries |
for now - use your patch... I have a synced node and will do testing shortly |
I fixed it in #12314 . I will not include this in the next alpha tho. I recommend you to use your patch for the time being and I will leave this ticket open accordingly |
Sounds like a plan, quick idea: why don't we include my patch until then in git? The hack is fully local to an RPC code, and easy to remove it at the same time when you include #12314 later. We could even already include the removal in #12314 so you don't forget. Should I prepare a PR with appropriate comment saying it's a temporary hack? |
Ohh, and don't forget that I haven't investigated that maybe the other fields of the EpochData are off-by-one epoch too, which can cause trouble in the future. |
no - it is just that we are not resetting a cache. all other fields are fine |
System information
OS & Version: Linux
Commit hash: b66cc69
Erigon Command (with flags/config): full caplin sync
The problem
Sync rewards are calculated slightly incorrectly:
The calculated reward is off by 32, because every slot is off by 1, and we have 32 slots in the epoch.
Here is the script called
syncwrong.sh
if anyone wants to reproduce the issue:Workaround
I investigated for a while, and first of all, I just wanted to say that if anybody wants a quick fix, there is this patch (that I use now):
Having this patch in, I was able to confirm that the calculation is fixed. Now, the patch itself doesn't make sense, why would we get the epochData from the next epoch to do a calculation this epoch?
Investigation
I added some debug print statements, and my conclusion right now, is that unfortunately the
epochData.TotalActiveBalance
field is off-by-one epoch in the database. This happens, becausecl/antiquary/beacon_states_collector.go:storeEpochData
writes to the antique DB beforecl/phase1/forkchoice/on_block.go:OnBlock
actually deals with the epoch transition. I'm not 100% sure of this, but kind of confident.What is sure, that the epoch data is screwed in the DB already with current 3.0.0-alpha caplin:
I modified the validator_inclusion endpoint locally to only show total_active_balance instead of the original lighthouse specific debug values. And then I waited for 317928 to be just finished, and ran the above commands.
Reference data: https://beaconcha.in/epoch/317928 (correct number is 34,334,260 ETH).
As one can see, the first answer from caplin is correct (from forkchoiceStore), but the second one is incorrect (from antique db store), and also it matches my theory that it's off by one: https://beaconcha.in/epoch/317927
I'm happy to work on this and propose pull requests, but I need quite a lot of help, because I have a lot of questions:
OnBlock
already updated the data?EpochData
are also off-by-one or not?Can someone from the erigon team kindly help out here? @Giulio2002 can you please take this on or route me in the team, if you are not the appropriate person? Thanks!
The text was updated successfully, but these errors were encountered: