-
Notifications
You must be signed in to change notification settings - Fork 470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ledger: Clear Merkle Trie on Commit Error #5568
ledger: Clear Merkle Trie on Commit Error #5568
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5568 +/- ##
==========================================
+ Coverage 54.92% 54.96% +0.03%
==========================================
Files 463 464 +1
Lines 64570 64619 +49
==========================================
+ Hits 35468 35520 +52
+ Misses 26720 26718 -2
+ Partials 2382 2381 -1
... and 6 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good, please remerge master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the trie to nil on a tracker commit error causes lazy execution of MakeTrie to reset the trie back to the prior disk-committed trie root on subsequent commit.
Co-authored-by: Bob Broderick <[email protected]>
09ebb4e
Co-authored-by: Bob Broderick <[email protected]>
What
This commit clears the Catchpoint Tracker's "Balances Trie" in cases where the commitRound function encountered an error.
Additionally: At the end of Server
initialize
,s.Stop
is added to the Exit Handlers for logrus. Logging at the Fatal level will prompt logrus to exit the program viaos.exit
, and by registering s.Stop, we first invoke a graceful shutdown of the service and node. Whentracker.go
commitRound
fails, if the underlying error matches a sqlite3ErrIoErr
, we invoke Fatal to start a graceful shutdown of the node.Why
The ledger/tracker manages several individual trackers who contribute to a commit to disk. All trackers first
prepareCommit
, and thencommitRound
. These disk updates are all done within a single transaction context, so the write-to-disk operation is atomic/transactional.Additionally, if
prepareCommit
orcommitRound
fails for any tracker, all trackers are prompted tohandleUnorderedCommitOrError
, which should rewind any affect from the attempted commit.However, in the case of
catchpointTracker
,commitRound
calls intoaccountsUpdateBalances
, which makes updates to the underlyingbalancesTrie
of the catchpointTracker. When an error is encountered after the Merkle Trie updates, the trie remains modified. If a future commit to disk is successful, incorrect merkle trie data may be committed, which may cause errors on the node going forward (see testing below)For example, accountsUpdateBalances directly modifies the catchpoint tracker's trie, and if/when there is a failure of commitRound, handleUnorderedCommitOrError does not reverse the changes.
How (to fix)
Trackers are currently controlled by the ledger tracker, which uses a common interface to control the flow of the independent data recorders:
If at any point in the tracker's commit process there is an error, or the round is found to be out-of-order, the handling function
handleUnorderedCommitOrError
is called.This PR splits that control function into individual functions:
This matches the situations in which the tracker directs the individual trackers to handleUnorderedCommitOrError.
This was done to avoid clearing out the trie excessively, as only
commitRound
changes the rootHash.All implementations of the tracker interface had to get these new functions, but the catchpointTracker is the only one using these calls, so it was the only one in need of actual refactor.
When the catchpoint tracker
handleCommitError
, the original logic is followed, and additionally, the balances Trie is set tonil
. This prevents bad data from being used by the tracker. The catchpoint tracker is still able to handle new commits, and will reload its merkle trie from disk.Testing
Unit Test
A new unit test is included to check that:
commitRound
with relevant updateshandleUnorderedCommit
andhandlePrepareCommitError
are callednil
ifhandleCommitError
is calledI also checked that by removing the
balancesTrie = nil
, this test fails.Manual Repro
In order to reproduce the divergence of Root Hashes, and to observe the fix, I set up a local 3-node network, each using catchpoints, and all set to log their Catchpoint Root Hash on every commit.
I then added to the catchpoint
commitRound
a 1-in-3 chance of throwing a random error, after updating the trie.After launching the network and putting a pingpong load on Node1, I observed that the three nodes had the same Root Hash until one of them hit the random failure. At that point, the node which hit the failure had a divergence of Root Hash. After a short time, all three nodes hit errors and similarly diverged.
After stopping and restarting the network, I observed that two of the three nodes were unable to come up fully, each reporting errors from the Merkle Trie attempting to load the cache from disk, in different ways. It would appear that by having a failed commit, successful commits would push incorrect roots to persistence, effectively corrupting their catchpoint systems.
Adding the fix
I added a one-line version of this fix to clear the trie on when calling
handleUnorderedCommitorError
, and launched a new network. This time, I observed that even when nodes hit errors in commitRound, their root hashes remaind the same