-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rest_api][aptos_vm] Prevent running move code on too stale of a state #15588
base: main
Are you sure you want to change the base?
Conversation
⏱️ 16m total CI duration on this PR
|
No strong opinion myself, but I wonder how we compare this to a wider window, lets say 3 or 6 month, closer to when we could drop replayability guarantees? |
@vgao1996 - replay guarantee is that transaction should execute the same with the state as defined at that time. There shouldn't be invariant violations when simulating on stale state, but results could be very much useless if there has been framework upgrade / feature flag change, binary rollout that removes deprecated feature etc. so 1 day here is set as long enough to not cause issues on temporarily stale nodes, and be short enough to be shorter than the release cycle. For our nodes/fullnodes (i.e. https://fullnode.mainnet.aptoslabs.com/), maybe we should be even more aggressive - like 10 minutes, to avoid serving wrong data to users, and pick a different node (though load balances in api gateway should do that already) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so basically you can't run something that's more than 24 hours old?
You can't run simulation etc, if full node's state is more than 24h stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid! Curious to see if we'll run into any edge cases, but I can't really imagine any 🤔
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
3582826
to
1d9cdab
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Running code on very stale state (i.e. before node was able to state-sync on startup), leads to confusing outcomes, and also excercises paths that should otherwise never happen. For example - new prologue functions have been introduced, and VM expects them to exist, but genesis framework doesn't have them.
1d9cdab
to
87dd64c
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
✅ Forge suite
|
✅ Forge suite
|
✅ Forge suite
|
Description
Running code on very stale state (i.e. before node was able to state-sync on startup), leads to confusing outcomes, and also excercises paths that should otherwise never happen.
For example - new prologue functions have been introduced, and VM expects them to exist, but genesis framework doesn't have them.
There are two places that call AptosVM to execute code - /view and /transaction/simulate, and gate both of them.
By default I set 1 day as the limit - which is long enough to not cause any issues if node temporarily goes out of sync, while short enough to not cross more than one release.
Alternative is to wait for a different signal - like first state sync completed, etc, but then it is tricky if node is suspended for extended periods of time.
How Has This Been Tested?
Key Areas to Review
Type of Change
Which Components or Systems Does This Change Impact?
Checklist