Catchup should be canceled and started again if the rest of the network is in a new epoch #12321

marcelo-gonzalez · 2024-10-25T20:31:34Z

Description

When we get to the first block in an epoch and realize that we need to sync state for a shard we don't currently track, we save catchup info on disk. Then the client reads that to decide what shards to sync. If this hasn't finished by the time the next epoch comes around, we won't apply the first block of that epoch until it's done. But if we start catchup for epoch T (or are in the middle of it) while the rest of the network gets to epoch T+1, then we'll continue to try syncing state for epoch T even if other nodes in the network serving state parts have deleted their snapshots for epoch T in favor of epoch T+1.

This is conceivable on a real network, but probably not extremely likely. Still a bug though. But in tests with a short epoch length, if a node is stopped at epoch T-1 and then restarted when nodes are almost finished with epoch T, it will receive the latest block when one of the peers it connects to sends it its head block. Then since it's in a known epoch (T), the node will save it as an orphan and request that block's prev hash, and that one's prev hash, and so on, until it gets to the next block it was going to apply before it was stopped.

Then when that node gets to the last block of epoch T, the other nodes are already well into epoch T+1. If that node was going to track a new shard in epoch T+1 (because it will produce chunks or it sets tracked_shadow_validator in its config), it will have saved a catchup state sync info to the DB when it applied the first block of T. Then it won't be able to apply the first block of epoch T+1 because it's not caught up yet, so it will try to state sync for epoch T. But since all other nodes are on epoch T+1, nobody has that snapshot anymore.

If there's a centralized state sync, it will eventually work when the node falls back to that, but in a world where we get rid of that, this will just fail completely.

This is actually observed in nayduck tests, and is what happened here in this nayduck test against a PR: https://nayduck.nearone.org/#/test/146847

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catchup should be canceled and started again if the rest of the network is in a new epoch #12321

Catchup should be canceled and started again if the rest of the network is in a new epoch #12321

marcelo-gonzalez commented Oct 25, 2024

Catchup should be canceled and started again if the rest of the network is in a new epoch #12321

Catchup should be canceled and started again if the rest of the network is in a new epoch #12321

Comments

marcelo-gonzalez commented Oct 25, 2024

Description