Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rest_api][aptos_vm] Prevent running move code on too stale of a state #15588

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

igor-aptos
Copy link
Contributor

Description

Running code on very stale state (i.e. before node was able to state-sync on startup), leads to confusing outcomes, and also excercises paths that should otherwise never happen.

For example - new prologue functions have been introduced, and VM expects them to exist, but genesis framework doesn't have them.

There are two places that call AptosVM to execute code - /view and /transaction/simulate, and gate both of them.

By default I set 1 day as the limit - which is long enough to not cause any issues if node temporarily goes out of sync, while short enough to not cross more than one release.

Alternative is to wait for a different signal - like first state sync completed, etc, but then it is tricky if node is suspended for extended periods of time.

How Has This Been Tested?

Key Areas to Review

Type of Change

  • New feature
  • Bug fix
  • Breaking change
  • Performance improvement
  • Refactoring
  • Dependency update
  • Documentation update
  • Tests

Which Components or Systems Does This Change Impact?

  • Validator Node
  • Full Node (API, Indexer, etc.)
  • Move/Aptos Virtual Machine
  • Aptos Framework
  • Aptos CLI/SDK
  • Developer Infrastructure
  • Move Compiler
  • Other (specify)

Checklist

  • I have read and followed the CONTRIBUTING doc
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I identified and added all stakeholders and component owners affected by this change as reviewers
  • I tested both happy and unhappy path of the functionality
  • I have made corresponding changes to the documentation

Copy link

trunk-io bot commented Dec 12, 2024

⏱️ 16m total CI duration on this PR
Job Cumulative Duration Recent Runs
rust-move-tests 12m 🟩
rust-cargo-deny 2m 🟩
check-dynamic-deps 39s 🟩
general-lints 28s 🟩
semgrep/ci 23s 🟩
file_change_determinator 10s 🟩
permission-check 3s 🟩
permission-check 2s 🟩

settingsfeedbackdocs ⋅ learn more about trunk.io

@vgao1996
Copy link
Contributor

By default I set 1 day as the limit - which is long enough to not cause any issues if node temporarily goes out of sync, while short enough to not cross more than one release.

No strong opinion myself, but I wonder how we compare this to a wider window, lets say 3 or 6 month, closer to when we could drop replayability guarantees?

@igor-aptos
Copy link
Contributor Author

@vgao1996 - replay guarantee is that transaction should execute the same with the state as defined at that time.

There shouldn't be invariant violations when simulating on stale state, but results could be very much useless if there has been framework upgrade / feature flag change, binary rollout that removes deprecated feature etc.

so 1 day here is set as long enough to not cause issues on temporarily stale nodes, and be short enough to be shorter than the release cycle.

For our nodes/fullnodes (i.e. https://fullnode.mainnet.aptoslabs.com/), maybe we should be even more aggressive - like 10 minutes, to avoid serving wrong data to users, and pick a different node (though load balances in api gateway should do that already)

@igor-aptos
Copy link
Contributor Author

@JoshLind , @banool - can I get your reviews here, not on the motivation, but on the implementation, as you've touched the API code the most?

Copy link
Contributor

@gregnazario gregnazario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so basically you can't run something that's more than 24 hours old?

@igor-aptos
Copy link
Contributor Author

You can't run simulation etc, if full node's state is more than 24h stale

Copy link
Contributor

@JoshLind JoshLind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid! Curious to see if we'll run into any edge cases, but I can't really imagine any 🤔

@igor-aptos igor-aptos enabled auto-merge (squash) January 29, 2025 23:47

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

@igor-aptos igor-aptos force-pushed the igor/prevent_api_running_move_code_on_stale_state branch from 3582826 to 1d9cdab Compare January 30, 2025 00:29

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Running code on very stale state (i.e. before node was able to state-sync on startup),
leads to confusing outcomes, and also excercises paths that should otherwise never happen.

For example - new prologue functions have been introduced, and VM expects them to exist,
but genesis framework doesn't have them.
@igor-aptos igor-aptos force-pushed the igor/prevent_api_running_move_code_on_stale_state branch from 1d9cdab to 87dd64c Compare January 31, 2025 16:53

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

✅ Forge suite compat success on 60f7ca8827f5d64a148c3b163dc4126b0879279b ==> 87dd64cc33076a43b7a0608423a727b3233897f2

Compatibility test results for 60f7ca8827f5d64a148c3b163dc4126b0879279b ==> 87dd64cc33076a43b7a0608423a727b3233897f2 (PR)
1. Check liveness of validators at old version: 60f7ca8827f5d64a148c3b163dc4126b0879279b
compatibility::simple-validator-upgrade::liveness-check : committed: 12388.11 txn/s, latency: 2520.35 ms, (p50: 2600 ms, p70: 2700, p90: 3000 ms, p99: 3600 ms), latency samples: 409240
2. Upgrading first Validator to new version: 87dd64cc33076a43b7a0608423a727b3233897f2
compatibility::simple-validator-upgrade::single-validator-upgrading : committed: 4450.90 txn/s, latency: 6986.25 ms, (p50: 7900 ms, p70: 8400, p90: 8600 ms, p99: 8900 ms), latency samples: 94320
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 4438.32 txn/s, latency: 7678.83 ms, (p50: 8600 ms, p70: 8700, p90: 9000 ms, p99: 9000 ms), latency samples: 156740
3. Upgrading rest of first batch to new version: 87dd64cc33076a43b7a0608423a727b3233897f2
compatibility::simple-validator-upgrade::half-validator-upgrading : committed: 4219.02 txn/s, latency: 7384.65 ms, (p50: 8200 ms, p70: 8700, p90: 9200 ms, p99: 9300 ms), latency samples: 91180
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 4262.30 txn/s, latency: 7983.00 ms, (p50: 9000 ms, p70: 9000, p90: 9300 ms, p99: 9300 ms), latency samples: 149860
4. upgrading second batch to new version: 87dd64cc33076a43b7a0608423a727b3233897f2
compatibility::simple-validator-upgrade::rest-validator-upgrading : committed: 7704.02 txn/s, latency: 3983.94 ms, (p50: 4600 ms, p70: 4800, p90: 5000 ms, p99: 5200 ms), latency samples: 142500
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 4057.09 txn/s, submitted: 4057.26 txn/s, expired: 0.18 txn/s, latency: 4555.14 ms, (p50: 5000 ms, p70: 5000, p90: 5100 ms, p99: 5200 ms), latency samples: 251349
5. check swarm health
Compatibility test for 60f7ca8827f5d64a148c3b163dc4126b0879279b ==> 87dd64cc33076a43b7a0608423a727b3233897f2 passed
Test Ok

Copy link
Contributor

✅ Forge suite realistic_env_max_load success on 87dd64cc33076a43b7a0608423a727b3233897f2

two traffics test: inner traffic : committed: 14442.49 txn/s, latency: 2745.06 ms, (p50: 2700 ms, p70: 2700, p90: 3000 ms, p99: 3900 ms), latency samples: 5491400
two traffics test : committed: 99.98 txn/s, latency: 1485.73 ms, (p50: 1400 ms, p70: 1500, p90: 1600 ms, p99: 3000 ms), latency samples: 1780
Latency breakdown for phase 0: ["MempoolToBlockCreation: max: 1.548, avg: 1.399", "ConsensusProposalToOrdered: max: 0.310, avg: 0.295", "ConsensusOrderedToCommit: max: 0.446, avg: 0.418", "ConsensusProposalToCommit: max: 0.738, avg: 0.713"]
Max non-epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.95s no progress at version 17651 (avg 0.20s) [limit 15].
Max epoch-change gap was: 0 rounds at version 0 (avg 0.00) [limit 4], 0.62s no progress at version 2774853 (avg 0.62s) [limit 16].
Test Ok

Copy link
Contributor

✅ Forge suite framework_upgrade success on 60f7ca8827f5d64a148c3b163dc4126b0879279b ==> 87dd64cc33076a43b7a0608423a727b3233897f2

Compatibility test results for 60f7ca8827f5d64a148c3b163dc4126b0879279b ==> 87dd64cc33076a43b7a0608423a727b3233897f2 (PR)
Upgrade the nodes to version: 87dd64cc33076a43b7a0608423a727b3233897f2
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1439.41 txn/s, submitted: 1443.65 txn/s, failed submission: 4.24 txn/s, expired: 4.24 txn/s, latency: 2020.81 ms, (p50: 1800 ms, p70: 2100, p90: 3000 ms, p99: 6300 ms), latency samples: 129000
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1123.52 txn/s, submitted: 1127.04 txn/s, failed submission: 3.52 txn/s, expired: 3.52 txn/s, latency: 2500.39 ms, (p50: 1500 ms, p70: 2100, p90: 3200 ms, p99: 13600 ms), latency samples: 102200
5. check swarm health
Compatibility test for 60f7ca8827f5d64a148c3b163dc4126b0879279b ==> 87dd64cc33076a43b7a0608423a727b3233897f2 passed
Upgrade the remaining nodes to version: 87dd64cc33076a43b7a0608423a727b3233897f2
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 1513.41 txn/s, submitted: 1518.18 txn/s, failed submission: 4.78 txn/s, expired: 4.78 txn/s, latency: 2399.63 ms, (p50: 1500 ms, p70: 2100, p90: 3700 ms, p99: 12700 ms), latency samples: 133081
Test Ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants