-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Duplicated filters within (filter(TableScan)) plan for unparser #13422
base: main
Are you sure you want to change the base?
Conversation
* Eliminate duplicated filter within (filter(TableScan)) plan * Updates * fix * add test * fix
@@ -318,7 +318,9 @@ pub(crate) fn try_transform_to_simple_table_scan_with_filters( | |||
plan_stack.push(alias.input.as_ref()); | |||
} | |||
LogicalPlan::Filter(filter) => { | |||
filters.push(filter.predicate.clone()); | |||
if !filters.contains(&filter.predicate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change them to HashSet? including table_scan_filters
.
contains is O(n) operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah my previous implementation is actually Hashset, however it doesn't preserve the filter order, so I didn't end up using Hashset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing I could do is maintain a temporary Hashset to keep track of the exsiting filters, while the results is still constructed with the Vector. Every filter will be checked if it exists with Hashset before pushing to the Vector. Would this approach be better than just using Vector.contains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about IndexSet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexSet would be better so that the order of the filters is preserved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Sevenannn -- this is a great idea. Thank you @jayzhan211 for the review
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
Which issue does this PR close?
N/A
Rationale for this change
when rewriting plans that has aggregates with lhs / rhs with filter and scan containing same filter.
For query
The logical plan is
The rewritten query will be:
SELECT customer.c_custkey, count(orders.o_orderkey) FROM customer LEFT JOIN orders ON ((customer.c_custkey = orders.o_custkey) AND (orders.o_comment NOT LIKE '%special%requests%' AND orders.o_comment NOT LIKE '%special%requests%')) GROUP BY customer.c_custkey
Under the current approach, the filter
orders.o_comment NOT LIKE Utf8("%special%requests%")
will occur twice in final query, although this has no effect on query result correctness, it brings performance overhead by including duplicated conditions.What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No