-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: in remap, when since is empty, suspend instead of panic'ing #31226
base: main
Are you sure you want to change the base?
storage: in remap, when since is empty, suspend instead of panic'ing #31226
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be a good way to test this? Should I try just creating sources, using them, and then deleting them?
The case we observed in the incident is the source being dropped right after creating it, I think that's the most promising path. It also helps when the cluster is really busy, so it doesn't try and render the source in time but is delayed to after the source is dropped already |
As the comment describes, this is a race condition that is expected to happen and it's better to suspend rather than bring down the whole cluster, which causes pain for customers/the oncall.
df33a47
to
9442ff2
Compare
I have a simple testdrive test with which I'm trying to reproduce this, but I ran into another panic instead:
I'll experiment around a bit with making it reproduce more reliably and open a separate PR. |
Inspired by MaterializeInc#31226, but found another panic instead: thread 'timely:work-0' panicked at src/storage/src/render/sources.rs:223:18: resuming an already finished ingestion 5: core::panicking::panic_fmt 6: core::option::expect_failed 7: mz_storage::render::sources::render_source::<timely::dataflow::scopes::child::Child<timely::worker::Worker<timely_communication::allocator::generic::Generic>, ()>, mz_storage_types::sources::kafka::KafkaSourceConnection> 8: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic>::{closure#0}::{closure#0} 9: <timely::worker::Worker<timely_communication::allocator::generic::Generic>>::dataflow_core::<(), (), mz_storage::render::build_ingestion_dataflow<timely_communication::allocator::generic::Generic>::{closure#0}, alloc::boxed::Box<()>> 10: mz_storage::render::build_ingestion_dataflow::<timely_communication::allocator::generic::Generic> 11: <mz_storage::storage_state::Worker<timely_communication::allocator::generic::Generic>>::run_client 12: <mz_storage::server::Config as mz_cluster::types::AsRunnableWorker<mz_storage_client::client::StorageCommand, mz_storage_client::client::StorageResponse>>::build_and_run::<timely_communication::allocator::generic::Generic> note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
As the comment describes, this is a race condition that is expected to happen and it's better to suspend rather than bring down the whole cluster, which causes pain for customers/the oncall.
@def- This fixes the panic of incident-360
Motivation
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.