-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNDB-11070: Limited backport of CASSANDRA-19534 #1393
base: cc-main-migration-release
Are you sure you want to change the base?
Conversation
…g at NTR/async stage dequeuing. Add timeout at boundary of MutationVerbHandler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This should give us a lot of the benefits of the original ticket without all of the painful refactoring that happened.
I think this is a good path forward. Should add some tests.
…ated ClientMetrics public.
… that driver schema queries aren't flaky on slow CI environments
Basic test coverage added, and further ad-hoc testing shows that this interacts well with Stargate query handling (in that Message creation happens early in Stargate handling, so Message creation time is still a reasonable approximation of entry in the system). This is ready for review. Please take a look if interested @JeremiahDJordan @jtgrabowski @sbtourist |
Quality Gate passedIssues Measures |
❌ Build ds-cassandra-pr-gate/PR-1393 rejected by Butler11 new test failure(s) in 7 builds Found 11 new test failures
Found 14 known test failures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left few questions, please take a look.
@@ -176,6 +180,17 @@ private Response handleRequest(QueryState queryState, long queryStartNanoTime, Q | |||
{ | |||
try | |||
{ | |||
if (isExecutingAsync()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider avoiding this mutable state in Message. The following code could be a part of lambda passed to supplyAsync above.
@@ -43,6 +46,13 @@ private void failed() | |||
|
|||
public void doVerb(Message<Mutation> message) | |||
{ | |||
if (approxTime.now() > message.expiresAtNanos()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have something simlar for other verbs? Maybe reads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reads handle this much more extensively through Monitorable
implementation (https://issues.apache.org/jira/browse/CASSANDRA-7392). CASSANDRA-19534 added this as a very minimal load-shedding point for expired mutations, but reads already have much more extensive infrastructure to handle this.
protected abstract CompletableFuture<Response> maybeExecuteAsync(QueryState queryState, long queryStartNanoTime, boolean traceRequest); | ||
|
||
public final CompletableFuture<Response> execute(QueryState queryState, long queryStartNanoTime) | ||
{ | ||
long elapsedTimeSinceCreation = elapsedTimeSinceCreation(TimeUnit.NANOSECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So async queue time includes sync queue time? Could you please make the relation clear in the comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will do.
What is the issue
Unbounded queue length at the native transport stage can caused large backlogs of requests that the system processes, even though clients may no longer expect a response.
What does this PR fix and why was it fixed
This PR implements a limited backport of CNDB-11070, introducing the notion of a native request timeout that can shed messages with excessive queue times at the NTR stage as well as async read/write stages, if enabled. Cross-node message timeouts are also now respected earlier in the mutation verb handler.
Checklist before you submit for review
NoSpamLogger
for log lines that may appear frequently in the logs