Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(snapshots): store BlockInfo in Resharding(CatchingUp) status #12651

Merged
merged 7 commits into from
Dec 20, 2024

Conversation

marcelo-gonzalez
Copy link
Contributor

@marcelo-gonzalez marcelo-gonzalez commented Dec 19, 2024

#12589 made a change required for making snapshots of child shards that are catching up after resharding, where if we want to take a state snapshot of a child shard's flat storage whose status is Resharding(CatchingUp), then we set it to Ready in the state snapshot. To do this, we need to be able to figure out what the correct BlockInfo is just from the hash, so we read the info from the block headers column. But this column is not kept in state snapshots (which was previously overlooked in test loop since everything was copied). So fix it by storing a BlockInfo in the Resharding(CatchingUp) status so we don't need to look these up anymore

@marcelo-gonzalez marcelo-gonzalez requested a review from a team as a code owner December 19, 2024 09:58
@marcelo-gonzalez
Copy link
Contributor Author

could be worth merging a cleaned up version of this, but a failing test that passes after this PR can be gotten by applying this diff:

diff --git a/chain/chain/src/flat_storage_resharder.rs b/chain/chain/src/flat_storage_resharder.rs
index 83bd481f3..a51751088 100644
--- a/chain/chain/src/flat_storage_resharder.rs
+++ b/chain/chain/src/flat_storage_resharder.rs
@@ -71,6 +71,7 @@ pub struct FlatStorageResharder {
     runtime: Arc<dyn RuntimeAdapter>,
     /// The current active resharding event.
     resharding_event: Arc<Mutex<Option<FlatStorageReshardingEventStatus>>>,
+    resharding_hash: Arc<Mutex<Option<CryptoHash>>>,
     /// Sender responsible to convey requests to the dedicated resharding actor.
     sender: ReshardingSender,
     /// Controls cancellation of background processing.
@@ -102,6 +103,7 @@ impl FlatStorageResharder {
         Self {
             runtime,
             resharding_event,
+            resharding_hash: Arc::new(Mutex::new(None)),
             sender,
             controller,
             resharding_config,
@@ -233,6 +235,7 @@ impl FlatStorageResharder {
     }
 
     fn set_resharding_event(&self, event: FlatStorageReshardingEventStatus) {
+        *self.resharding_hash.lock().unwrap() = Some(event.resharding_hash());
         *self.resharding_event.lock().unwrap() = Some(event);
     }
 
@@ -336,13 +339,13 @@ impl FlatStorageResharder {
             }
         };
 
-        #[cfg(feature = "test_features")]
-        {
-            if self.adv_should_delay_task(&resharding_hash, chain_store) {
-                info!(target: "resharding", "flat storage shard split task has been artificially postponed!");
-                return FlatStorageReshardingTaskResult::Postponed;
-            }
-        }
+        // #[cfg(feature = "test_features")]
+        // {
+        //     if self.adv_should_delay_task(&resharding_hash, chain_store) {
+        //         info!(target: "resharding", "flat storage shard split task has been artificially postponed!");
+        //         return FlatStorageReshardingTaskResult::Postponed;
+        //     }
+        // }
 
         // We know that the resharding block has become final so let's start the real work.
         let (parent_shard, split_params) = self
@@ -609,7 +612,15 @@ impl FlatStorageResharder {
         if self.controller.is_cancelled() {
             return FlatStorageReshardingTaskResult::Cancelled;
         }
-        info!(target: "resharding", ?shard_uid, "flat storage shard catchup task started");
+        #[cfg(feature = "test_features")]
+        {
+            let resharding_hash = self
+            .resharding_hash.lock().unwrap().unwrap();
+            if self.adv_should_delay_task(&resharding_hash, chain_store) {
+                info!(target: "resharding", "flat storage catchup task has been artificially postponed!");
+                return FlatStorageReshardingTaskResult::Postponed;
+            }
+        }
         let metrics = FlatStorageReshardingShardCatchUpMetrics::new(&shard_uid);
         // Apply deltas and then create the flat storage.
         let apply_result = self.shard_catchup_apply_deltas(shard_uid, chain_store, &metrics);

on top of the first "drop non-kept snapshot columns in tests" commit in this PR and running test_loop::tests::resharding_v3::test_resharding_v3_shard_shuffling_slower_post_processing_tasks. we need that hack for now to get it to hit the right order of events

Copy link

codecov bot commented Dec 19, 2024

Codecov Report

Attention: Patch coverage is 82.50000% with 14 lines in your changes missing coverage. Please review.

Project coverage is 70.49%. Comparing base (e692d20) to head (cfea0e8).
Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/flat_storage_resharder.rs 85.36% 3 Missing and 3 partials ⚠️
...chain/src/stateless_validation/chunk_validation.rs 66.66% 0 Missing and 4 partials ⚠️
core/store/src/flat/manager.rs 0.00% 2 Missing ⚠️
chain/chain/src/resharding/manager.rs 83.33% 0 Missing and 1 partial ⚠️
core/store/src/db/testdb.rs 83.33% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master   #12651   +/-   ##
=======================================
  Coverage   70.49%   70.49%           
=======================================
  Files         845      845           
  Lines      172247   172265   +18     
  Branches   172247   172265   +18     
=======================================
+ Hits       121426   121439   +13     
- Misses      45725    45726    +1     
- Partials     5096     5100    +4     
Flag Coverage Δ
backward-compatibility 0.16% <0.00%> (-0.01%) ⬇️
db-migration 0.16% <0.00%> (-0.01%) ⬇️
genesis-check 1.36% <0.00%> (-0.01%) ⬇️
linux 69.34% <70.00%> (-0.02%) ⬇️
linux-nightly 70.10% <82.50%> (+0.01%) ⬆️
pytests 1.66% <0.00%> (-0.01%) ⬇️
sanity-checks 1.47% <0.00%> (-0.01%) ⬇️
unittests 70.32% <82.50%> (+<0.01%) ⬆️
upgradability 0.20% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@Longarithm Longarithm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider putting BlockInfo to CatchingUp status. It is probably easier dependency-wise and implementation-wise.

@marcelo-gonzalez
Copy link
Contributor Author

Please consider putting BlockInfo to CatchingUp status. It is probably easier dependency-wise and implementation-wise.

ya thats prob better. I think when I implemented it the first time I just kind of forgot that it's fine to just change the DB structures since there hasn't been a release in between. PTAL

@marcelo-gonzalez marcelo-gonzalez changed the title fix(snapshots): read block headers from the main DB fix(snapshots): store BlockInfo in Resharding(CatchingUp) status Dec 20, 2024
@Longarithm Longarithm added this pull request to the merge queue Dec 20, 2024
Merged via the queue into near:master with commit f06f798 Dec 20, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants