-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(resharding): fix Chain::get_shards_to_state_sync() #12617
base: master
Are you sure you want to change the base?
fix(resharding): fix Chain::get_shards_to_state_sync() #12617
Conversation
This function was left unimplemented for the resharding v3 case, so here we implement it. For each shard ID, if there hasn't been a resharding, we say that we want to state sync it if we'll be tracking it next epoch and don't currently track it. If there has been a resharding, we state sync it if the previous condition is true and we aren't currently generating the shard itself as part of resharding. Since this shards to state sync calculation is made when we preprocess the first block of an epoch, there are 3 relevant epochs here: prev_epoch, current_epoch, next_epoch. When we process the last block of prev_epoch, start_resharding() will be called and will initiate the child shard splitting if we were tracking the parent of the split shards during prev_epoch (this is true even if we don't track either child in current_epoch). So if we aren't tracking a child shard during current_epoch, we still actually don't want to state sync it if we were tracking the parent previously
few things:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12617 +/- ##
=======================================
Coverage 70.48% 70.48%
=======================================
Files 845 845
Lines 172251 172270 +19
Branches 172251 172270 +19
=======================================
+ Hits 121411 121428 +17
+ Misses 45723 45720 -3
- Partials 5117 5122 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes because I think it can be simplified, let me know if I'm missing something.
@@ -793,6 +793,7 @@ impl Chain { | |||
fn get_state_sync_info( | |||
&self, | |||
me: &Option<AccountId>, | |||
prev_prev_hash: &CryptoHash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mini nit: I slightly dislike mixing parameters type (hash & block), would it make any sense to pass, hash, prev_hash and prev_prev_hash as separate arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking below, the other signature is:
header: &BlockHeader
prev_hash: CryptoHash,
prev_prev_hash: CryptoHash,
And maybe specify in the comment that header is the epoch first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya agreed, just changed them all to hashes and epoch IDs
@@ -2424,56 +2426,69 @@ impl Chain { | |||
shard_tracker: &ShardTracker, | |||
me: &Option<AccountId>, | |||
parent_hash: &CryptoHash, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<unrelated rant>
Why do we have two names - prev and parent - for the same thing? 😢 </unrelated rant>
shard_tracker.will_care_about_shard(me.as_ref(), parent_hash, shard_id, true); | ||
let does_care_about_shard = | ||
shard_tracker.care_about_shard(me.as_ref(), parent_hash, shard_id, true); | ||
) -> Result<bool, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the old condition is no longer relevant but the logic here seems too complicated.
In V2 we needed to get a second copy of the parent, even the node tracks it, to perform the preprocessing.
will_care_about_shard && (will_shard_layout_change || !does_care_about_shard)
In V3 this should be needed so it feels like the following should be sufficient:
will_care_about_shard && !does_care_about_shard
Please keep in mind that will_care_about_shard
is resharding aware. We should aim to keep the resharding specific part of this logic in one place if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In V3 this should be needed so it feels like the following should be sufficient:
will_care_about_shard && !does_care_about_shard
So, relative to the current PR, something like:
diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index 931f711f8..ac6b3d650 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -2465,32 +2465,7 @@ impl Chain {
if shard_tracker.care_about_shard(me.as_ref(), parent_hash, shard_id, true) {
return Ok(false);
}
- // Now we need to state sync it unless this is a post-resharding child shard whose parent we were tracking in the
- // previous epoch, in which case we don't need to state sync because we'll generate the child when we do the resharding
-
- let shard_layout = epoch_manager.get_shard_layout(epoch_id)?;
- let prev_epoch_id = epoch_manager.get_prev_epoch_id_from_prev_block(parent_hash)?;
- let prev_shard_layout = epoch_manager.get_shard_layout(&prev_epoch_id)?;
-
- let resharded = shard_layout != prev_shard_layout;
- if !resharded {
- return Ok(true);
- }
- let Some(parent_shard_id) = shard_layout.try_get_parent_shard_id(shard_id)? else {
- return Ok(true);
- };
- let was_split = parent_shard_id != shard_id;
- if !was_split {
- return Ok(true);
- }
-
- // Note that here passing `prev_prev_hash` to care_about_shard() will have us check whether we were tracking it in
- // the previous epoch, because the current block is the first block of an epoch, so prev_prev_hash is the "parent_hash"
- // of the last block of the previous epoch. TODO: consider refactoring these ShardTracker functions to accept an epoch_id
- // to make this less tricky.
- let splitting_child =
- shard_tracker.care_about_shard(me.as_ref(), prev_prev_hash, parent_shard_id, true);
- Ok(!splitting_child)
+ Ok(true)
}
/// Check if any block with missing chunk is ready to be processed and start processing these blocks
Now that the other PR is merged, we can try the modification to test_resharding_v3_shard_shuffling_slower_post_processing_tasks
that I mentioned caused a crash:
diff --git a/integration-tests/src/test_loop/tests/resharding_v3.rs b/integration-tests/src/test_loop/tests/resharding_v3.rs
index c04d89ba6..457db7723 100644
--- a/integration-tests/src/test_loop/tests/resharding_v3.rs
+++ b/integration-tests/src/test_loop/tests/resharding_v3.rs
@@ -985,6 +985,7 @@ fn test_resharding_v3_shard_shuffling_slower_post_processing_tasks() {
.track_all_shards(false)
.all_chunks_expected(false)
.delay_flat_state_resharding(2)
+ .epoch_length(10)
.build();
test_resharding_v3_base(params);
}
Now unfortunately this passes because something about the order of the test loop task processing for state sync and resharding catchup has changed between when I wrote that and the current head of master. Without trying to figure it out, we can just hack it to run in the same order as before:
diff --git a/chain/chain/src/flat_storage_resharder.rs b/chain/chain/src/flat_storage_resharder.rs
index 83bd481f3..a401c51ce 100644
--- a/chain/chain/src/flat_storage_resharder.rs
+++ b/chain/chain/src/flat_storage_resharder.rs
@@ -609,6 +609,16 @@ impl FlatStorageResharder {
if self.controller.is_cancelled() {
return FlatStorageReshardingTaskResult::Cancelled;
}
+ let head = chain_store.head().unwrap();
+ let Some(sync_hash) = chain_store.get_current_epoch_sync_hash(&head.epoch_id).unwrap() else {
+ return FlatStorageReshardingTaskResult::Postponed;
+ };
+ let key = borsh::to_vec(&near_primitives::state_sync::StatePartKey(sync_hash, shard_uid.shard_id(), 0)).unwrap();
+ let part = chain_store.store().get(near_store::DBCol::StateParts, &key).unwrap();
+ if !part.is_some() {
+ return FlatStorageReshardingTaskResult::Postponed;
+ }
+
info!(target: "resharding", ?shard_uid, "flat storage shard catchup task started");
let metrics = FlatStorageReshardingShardCatchUpMetrics::new(&shard_uid);
// Apply deltas and then create the flat storage.
Then (with the above epoch length change), run it and you get:
thread 'test_loop::tests::resharding_v3::test_resharding_v3_shard_shuffling_slower_post_processing_tasks' panicked at chain/chain/src/resharding/resharding_actor.rs:93:17:
impossible to recover from a flat storage shard catchup failure!
with this error message at the end of the logs:
5.661s ERROR resharding: flat storage shard catchup delta application failed! shard_uid=s7.v3 err=Other("unexpected resharding catchup flat storage status for s7.v3: Ready(FlatStorageReadyStatus { flat_head: BlockInfo { hash: Gposh9D271WMUy3wt4Re1pXtFViNEdiMEgWyjm8eoYHH, height: 24, prev_hash: 7d7uMfUA8dXHjGcixM88gVVnt7jaak8N5y7gfv7C58af } })")
What's happening here is the problem with account6
I mentioned in the other comment. It's not tracking the child in the current epoch, will track it in the next, but is performing the split because it tracked the parent in the previous epoch.
Please keep in mind that
will_care_about_shard
is resharding aware. We should aim to keep the resharding specific part of this logic in one place if possible.
Yeah it's a bit messy... maybe we should split somethiing out of this?
Just for clarity I think you're talking about the following scenario: That is an interesting case to think about and same even without resharding actually. The two options we have are:
What is your intuition about how those compare? Approach 2 is what you've implemented here. What would it take to make 1) work? Are there any drawbacks to this approach that I overlooked? Which approach would have simpler and cleaner code? |
This function was left unimplemented for the resharding v3 case, so here we implement it. For each shard ID, if there hasn't been a resharding, we say that we want to state sync it if we'll be tracking it next epoch and don't currently track it. If there has been a resharding, we state sync it if the previous condition is true and we aren't currently generating the shard itself as part of resharding.
Since this shards to state sync calculation is made when we preprocess the first block of an epoch, there are 3 relevant epochs here: prev_epoch, current_epoch, next_epoch. When we process the last block of prev_epoch, start_resharding() will be called and will initiate the child shard splitting if we were tracking the parent of the split shards during prev_epoch (this is true even if we don't track either child in current_epoch). So if we aren't tracking a child shard during current_epoch, we still actually don't want to state sync it if we were tracking the parent previously