Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNDB-11070: Limited backport of CASSANDRA-19534 #1393

Open
wants to merge 7 commits into
base: cc-main-migration-release
Choose a base branch
from

Conversation

jkni
Copy link

@jkni jkni commented Oct 29, 2024

What is the issue

Unbounded queue length at the native transport stage can caused large backlogs of requests that the system processes, even though clients may no longer expect a response.

What does this PR fix and why was it fixed

This PR implements a limited backport of CNDB-11070, introducing the notion of a native request timeout that can shed messages with excessive queue times at the NTR stage as well as async read/write stages, if enabled. Cross-node message timeouts are also now respected earlier in the mutation verb handler.

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

…g at NTR/async stage dequeuing. Add timeout at boundary of MutationVerbHandler.
@jkni jkni marked this pull request as ready for review October 29, 2024 21:47
Copy link
Member

@JeremiahDJordan JeremiahDJordan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This should give us a lot of the benefits of the original ticket without all of the painful refactoring that happened.

I think this is a good path forward. Should add some tests.

@jkni jkni changed the title DRAFT: CNDB-11070: Limited backport of CASSANDRA-19534 CNDB-11070: Limited backport of CASSANDRA-19534 Nov 5, 2024
@jkni
Copy link
Author

jkni commented Nov 5, 2024

Basic test coverage added, and further ad-hoc testing shows that this interacts well with Stargate query handling (in that Message creation happens early in Stargate handling, so Message creation time is still a reasonable approximation of entry in the system). This is ready for review. Please take a look if interested @JeremiahDJordan @jtgrabowski @sbtourist

Copy link

sonarcloud bot commented Nov 5, 2024

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1393 rejected by Butler


11 new test failure(s) in 7 builds
See build details here


Found 11 new test failures

Test Explanation Branch history Upstream history
t.TestConstants.test_cql_reserved_keywords failing 🔴🔴🔴🔴🔴🔴🔴
...Completion.test_complete_in_select_limit_clause failing 🔴🔴🔴🔴🔴🔴🔴
t.TestCqlshUnicode.test_unicode_desc regression 🔴🔴🔵🔵🔵🔴🔵
t.TestCqlshUnicode.test_unicode_identifier failing 🔴🔴🔴🔴🔴🔴🔴
...nTest.testMultipleCompactionsDifferentWs_Static failing 🔴🔴🔴🔴🔴🔴🔴
...i.s.c.VectorSiftSmallTest.testMultiSegmentBuild failing 🔴🔴🔴🔴🔴🔴🔴
...int,int>>>,wide=false,scenario=COMPACTED_QUERY] flaky 🔵🔴🔵🔵
...t,wide=false,scenario=MEMTABLE_QUERY] flaky 🔵🔴🔵🔵
...titionsTest.testServiceTopPartitionsSingleTable flaky 🔵🔴🔵🔵
o.a.c.t.TransportTest.testAsyncTransport flaky 🔴🔵🔴🔴🔴🔵🔴
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS... flaky 🔵🔴🔵🔴🔴🔵🔵

Found 14 known test failures

Copy link

@jtgrabowski jtgrabowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left few questions, please take a look.

@@ -176,6 +180,17 @@ private Response handleRequest(QueryState queryState, long queryStartNanoTime, Q
{
try
{
if (isExecutingAsync())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider avoiding this mutable state in Message. The following code could be a part of lambda passed to supplyAsync above.

@@ -43,6 +46,13 @@ private void failed()

public void doVerb(Message<Mutation> message)
{
if (approxTime.now() > message.expiresAtNanos())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have something simlar for other verbs? Maybe reads?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reads handle this much more extensively through Monitorable implementation (https://issues.apache.org/jira/browse/CASSANDRA-7392). CASSANDRA-19534 added this as a very minimal load-shedding point for expired mutations, but reads already have much more extensive infrastructure to handle this.

protected abstract CompletableFuture<Response> maybeExecuteAsync(QueryState queryState, long queryStartNanoTime, boolean traceRequest);

public final CompletableFuture<Response> execute(QueryState queryState, long queryStartNanoTime)
{
long elapsedTimeSinceCreation = elapsedTimeSinceCreation(TimeUnit.NANOSECONDS);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So async queue time includes sync queue time? Could you please make the relation clear in the comments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants