Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ledger: Clear Merkle Trie on Commit Error #5568

Merged
merged 25 commits into from
Aug 11, 2023

Conversation

AlgoAxel
Copy link
Contributor

@AlgoAxel AlgoAxel commented Jul 13, 2023

What

This commit clears the Catchpoint Tracker's "Balances Trie" in cases where the commitRound function encountered an error.

Additionally: At the end of Server initialize, s.Stop is added to the Exit Handlers for logrus. Logging at the Fatal level will prompt logrus to exit the program via os.exit, and by registering s.Stop, we first invoke a graceful shutdown of the service and node. When tracker.go commitRound fails, if the underlying error matches a sqlite3 ErrIoErr, we invoke Fatal to start a graceful shutdown of the node.

Why

The ledger/tracker manages several individual trackers who contribute to a commit to disk. All trackers first prepareCommit, and then commitRound. These disk updates are all done within a single transaction context, so the write-to-disk operation is atomic/transactional.
Additionally, if prepareCommit or commitRound fails for any tracker, all trackers are prompted to handleUnorderedCommitOrError, which should rewind any affect from the attempted commit.

However, in the case of catchpointTracker, commitRound calls into accountsUpdateBalances, which makes updates to the underlying balancesTrie of the catchpointTracker. When an error is encountered after the Merkle Trie updates, the trie remains modified. If a future commit to disk is successful, incorrect merkle trie data may be committed, which may cause errors on the node going forward (see testing below)

For example, accountsUpdateBalances directly modifies the catchpoint tracker's trie, and if/when there is a failure of commitRound, handleUnorderedCommitOrError does not reverse the changes.

How (to fix)

Trackers are currently controlled by the ledger tracker, which uses a common interface to control the flow of the independent data recorders:

  • PrepareCommit
  • CommitRound
  • PostCommit
    If at any point in the tracker's commit process there is an error, or the round is found to be out-of-order, the handling function handleUnorderedCommitOrError is called.

This PR splits that control function into individual functions:

  • handlePrepareCommitError
  • handleUnorderedCommit
  • handleCommitError

This matches the situations in which the tracker directs the individual trackers to handleUnorderedCommitOrError.
This was done to avoid clearing out the trie excessively, as only commitRound changes the rootHash.

All implementations of the tracker interface had to get these new functions, but the catchpointTracker is the only one using these calls, so it was the only one in need of actual refactor.

When the catchpoint tracker handleCommitError, the original logic is followed, and additionally, the balances Trie is set to nil. This prevents bad data from being used by the tracker. The catchpoint tracker is still able to handle new commits, and will reload its merkle trie from disk.

Testing

Unit Test

A new unit test is included to check that:

  • Merkle trie roots change when the catchpointTracker runs commitRound with relevant updates
  • The root hash remains the same when handleUnorderedCommit and handlePrepareCommitError are called
  • The trie is set to nil if handleCommitError is called
  • Reloading from disk brings you back to the original root
  • Committing again immediately after handling error uses the lazy initializer and returns you to the same new-root from the first time you committed the data.

I also checked that by removing the balancesTrie = nil, this test fails.

Manual Repro

In order to reproduce the divergence of Root Hashes, and to observe the fix, I set up a local 3-node network, each using catchpoints, and all set to log their Catchpoint Root Hash on every commit.

I then added to the catchpoint commitRound a 1-in-3 chance of throwing a random error, after updating the trie.
After launching the network and putting a pingpong load on Node1, I observed that the three nodes had the same Root Hash until one of them hit the random failure. At that point, the node which hit the failure had a divergence of Root Hash. After a short time, all three nodes hit errors and similarly diverged.

After stopping and restarting the network, I observed that two of the three nodes were unable to come up fully, each reporting errors from the Merkle Trie attempting to load the cache from disk, in different ways. It would appear that by having a failed commit, successful commits would push incorrect roots to persistence, effectively corrupting their catchpoint systems.

Adding the fix

I added a one-line version of this fix to clear the trie on when calling handleUnorderedCommitorError, and launched a new network. This time, I observed that even when nodes hit errors in commitRound, their root hashes remaind the same

@AlgoAxel AlgoAxel marked this pull request as draft July 13, 2023 19:58
@AlgoAxel AlgoAxel changed the title Clear Merkle Trie on Commit Error Fix: Clear Merkle Trie on Commit Error Jul 13, 2023
@codecov
Copy link

codecov bot commented Jul 13, 2023

Codecov Report

Merging #5568 (8fca9ce) into master (867aa17) will increase coverage by 0.03%.
Report is 6 commits behind head on master.
The diff coverage is 57.69%.

@@            Coverage Diff             @@
##           master    #5568      +/-   ##
==========================================
+ Coverage   54.92%   54.96%   +0.03%     
==========================================
  Files         463      464       +1     
  Lines       64570    64619      +49     
==========================================
+ Hits        35468    35520      +52     
+ Misses      26720    26718       -2     
+ Partials     2382     2381       -1     
Files Changed Coverage Δ
daemon/algod/server.go 4.28% <0.00%> (-0.03%) ⬇️
ledger/store/trackerdb/interface.go 0.00% <0.00%> (ø)
...edger/store/trackerdb/sqlitedriver/sqlitedriver.go 5.92% <31.03%> (+5.92%) ⬆️
ledger/acctonline.go 79.44% <66.66%> (+0.26%) ⬆️
ledger/acctupdates.go 71.54% <66.66%> (-0.29%) ⬇️
ledger/bulletin.go 93.75% <66.66%> (-1.91%) ⬇️
ledger/metrics.go 97.36% <66.66%> (-2.64%) ⬇️
ledger/notifier.go 86.79% <66.66%> (-1.45%) ⬇️
ledger/spverificationtracker.go 95.71% <66.66%> (-0.67%) ⬇️
ledger/txtail.go 78.14% <66.66%> (-0.32%) ⬇️
... and 4 more

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@AlgoAxel AlgoAxel marked this pull request as ready for review July 17, 2023 19:35
ledger/tracker.go Outdated Show resolved Hide resolved
ledger/tracker.go Show resolved Hide resolved
logging/log_test.go Show resolved Hide resolved
logging/log.go Outdated Show resolved Hide resolved
ledger/tracker.go Show resolved Hide resolved
algorandskiy
algorandskiy previously approved these changes Aug 9, 2023
Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good, please remerge master

algorandskiy
algorandskiy previously approved these changes Aug 10, 2023
bbroder-algo
bbroder-algo previously approved these changes Aug 11, 2023
Copy link
Contributor

@bbroder-algo bbroder-algo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the trie to nil on a tracker commit error causes lazy execution of MakeTrie to reset the trie back to the prior disk-committed trie root on subsequent commit.

@algorandskiy algorandskiy changed the title Fix: Clear Merkle Trie on Commit Error ledger: Clear Merkle Trie on Commit Error Aug 11, 2023
@algorandskiy algorandskiy merged commit 4ff2bf3 into algorand:master Aug 11, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants