Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catchup should be canceled and started again if the rest of the network is in a new epoch #12321

Open
marcelo-gonzalez opened this issue Oct 25, 2024 · 0 comments

Comments

@marcelo-gonzalez
Copy link
Contributor

Description

When we get to the first block in an epoch and realize that we need to sync state for a shard we don't currently track, we save catchup info on disk. Then the client reads that to decide what shards to sync. If this hasn't finished by the time the next epoch comes around, we won't apply the first block of that epoch until it's done. But if we start catchup for epoch T (or are in the middle of it) while the rest of the network gets to epoch T+1, then we'll continue to try syncing state for epoch T even if other nodes in the network serving state parts have deleted their snapshots for epoch T in favor of epoch T+1.

This is conceivable on a real network, but probably not extremely likely. Still a bug though. But in tests with a short epoch length, if a node is stopped at epoch T-1 and then restarted when nodes are almost finished with epoch T, it will receive the latest block when one of the peers it connects to sends it its head block. Then since it's in a known epoch (T), the node will save it as an orphan and request that block's prev hash, and that one's prev hash, and so on, until it gets to the next block it was going to apply before it was stopped.

Then when that node gets to the last block of epoch T, the other nodes are already well into epoch T+1. If that node was going to track a new shard in epoch T+1 (because it will produce chunks or it sets tracked_shadow_validator in its config), it will have saved a catchup state sync info to the DB when it applied the first block of T. Then it won't be able to apply the first block of epoch T+1 because it's not caught up yet, so it will try to state sync for epoch T. But since all other nodes are on epoch T+1, nobody has that snapshot anymore.

If there's a centralized state sync, it will eventually work when the node falls back to that, but in a world where we get rid of that, this will just fail completely.

This is actually observed in nayduck tests, and is what happened here in this nayduck test against a PR: https://nayduck.nearone.org/#/test/146847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant