Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: in remap, when since is empty, suspend instead of panic'ing #31226

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions src/storage/src/source/reclock/compat.rs
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,30 @@ where
// Allow manually simulating the scenario where the since of the remap
// shard has advanced too far.
fail_point!("invalid_remap_as_of");

if since.is_empty() {
// This can happen when, say, a source is being dropped but we on
// the cluster are busy and notice that only later. In those cases
// it can happen that we still try to render an ingestion that is
// not valid anymore and where the shards it uses are not valid to
// use anymore.
//
// This is a rare race condition and something that is expected to
// happen every now and then. It's not a bug in the current way of
// how things work.
Comment on lines +122 to +130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storage has the DroppedIds protocol response that I assumed to exist to let the storage controller wait until all replicas have acknowledged the dropping of an object. Is that not true? Or does this race only come up in the context of 0dt upgrades?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DroppedIds only relate to state of the storage controller, read holds are dropped right away when the collection is dropped without waiting. Right now we drop things eagerly so that a struggling, slow, broken replica does not hold up cleaning up those things, finalizing shards ,etc.

tracing::info!(
source_id = %id,
%worker_id,
"since of remap shard is the empty antichain, suspending...");

// Suspending will make it so we don't make progress and downstream
// operators don't think we're making progress. And this dataflow
// will eventually be removed once the news about removal make it
// from the controller to the cluster.
std::future::pending::<()>().await;
unreachable!("pending future never returns");
}

assert!(
PartialOrder::less_equal(since, &as_of),
"invalid as_of: as_of({as_of:?}) < since({since:?}), \
Expand Down