Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNDB-11655: Limit the number of clauses before optimizing the Plan #1409

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pkolaczk
Copy link

@pkolaczk pkolaczk commented Nov 7, 2024

What is the issue

Plan#optimize can take a very long time when given plans with thousands of intersected clauses, which can result from using ngram analyzers. Related issue: https://github.com/riptano/cndb/issues/10731.

What does this PR fix and why was it fixed

Fixes https://github.com/riptano/cndb/issues/11655

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about we adding some unit tests to cover this change ?

Plan#optimize can take a very long time when given plans with thousands
of intersected clauses, which can result from using ngram analyzers.
Related issue: riptano/cndb#10731.

Fixes riptano/cndb#11655
@pkolaczk
Copy link
Author

pkolaczk commented Nov 7, 2024

That would be quite tricky, as the only thing it changed is the time to run the code.
It is a performance issue, not a correctness one.
I could create a very large plan and add a timeout to fail the test; but that's a bit risky. It may cause flakiness in CI where performance is unpredictable.

@eolivelli
Copy link

Are you saying that the resulting plan is probably the same ?

side question:
Do we have tests about Planning n-grams querys and ANN search ? is it worth to add them as a follow up work ?

Copy link

sonarcloud bot commented Nov 7, 2024

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1409 rejected by Butler


2 new test failure(s) in 2 builds
See build details here


Found 2 new test failures

Test Explanation Branch history Upstream history
o.a.c.i.s.d.v.VectorCompressionTest.testAda002 regression 🔴🔵 🔵🔵🔵🔵🔵🔵🔵
...i.s.d.v.VectorCompressionTest.testOpenAiV3Large regression 🔴🔵 🔵🔵🔵🔵🔵🔵🔵

Found 6 known test failures

@pkolaczk
Copy link
Author

pkolaczk commented Nov 7, 2024

Yes, the resulting plan is likely the same or very similar - generally in either case the intersection clause limit will be applied and at most INTERSECTION_CLAUSE_LIMIT best (most selective) indexes will be chosen; the difference is that now we don't estimate the costs of all queries having more than INTERSECTION_CLAUSE_LIMIT. So this saves a lot of unnecessary work. This path is also very well covered by many tests - see test coverage is 100% for changed code.

As for adding unit tests, we do have LuceneAnalyzerTest which covers ngrams but not ngrams and vector in the same query. I'll can add a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants