Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add config to reject Iceberg queries without partition filter #17263

Merged
merged 1 commit into from
Oct 13, 2023

Conversation

rice668
Copy link
Contributor

@rice668 rice668 commented Apr 27, 2023

Description

For iceberg partitioned tables, we should reject the table scan produced by the planner when the query does not have partition field.

Additional context and related issues

Fixes #17239

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Iceberg
* Add an option to require all scans on partitioned tables to have a filter on a partitioned column. This can be enabled by setting the 
  ``iceberg.query-partition-filter-required`` catalog config property or the ``query_partition_filter_required`` catalog session property to ``true``.

@cla-bot cla-bot bot added the cla-signed label Apr 27, 2023
@github-actions github-actions bot added the iceberg Iceberg connector label Apr 27, 2023
@rice668 rice668 force-pushed the issue-17239 branch 2 times, most recently from 6391560 to 8088b27 Compare May 4, 2023 05:44
@rice668
Copy link
Contributor Author

rice668 commented May 4, 2023

Thank you for @Praveen2112 @marton-bod review, I have made some changes to the code, please help review again when you have time.

@rice668
Copy link
Contributor Author

rice668 commented Jun 30, 2023

@marton-bod Thanks! I have updated the code, Could you please take a look again ?

@rice668 rice668 force-pushed the issue-17239 branch 2 times, most recently from f2e0ada to 4de5bb4 Compare July 3, 2023 03:30
return;
}
if (isQueryPartitionFilterRequired(session)
&& table.getEnforcedPredicate().equals(TupleDomain.all())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhangminglei Can you explain why this
table.getEnforcedPredicate().equals(TupleDomain.all()) is needed? Doesn't this mean that we only enter the block if the table has no enforced predicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right! @marton-bod , If a table has no enforced predicates, like select * from T where udf(partition_field) = someValue then enter the block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for the difference, Marton, is that some filters show up as TupleDomains and some show up as Predicates depending on how they can be represented.

WHERE x = 42 for example, is a tuple domain and will show up in the enforced constraint
WHERE x % 2 = 0 will be a Predicate because it cannot be represented as a discrete set or range

but if x is a partition column, either should be fine for this feature

Copy link
Contributor

@marton-bod marton-bod Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! Just curious to hear what would happen in the following example:

CREATE TABLE tbl (a int, b int) partitioning = ARRAY['a'] AS SELECT * FROM ...;
SELECT * FROM tbl WHERE b % 2 = 0;

Would we fail the read query because it has no predicate defined for the partition column a? Or would we skip the check becase there is an enforced predicate for b

@marton-bod
Copy link
Contributor

@findepi @alexjo2144 @findinpath Can you please take a look at this PR so it can make some progress? It generally looks good to me, but there are some places where your expertise and iceberg connector knowledge would be needed. Thanks a lot!

@@ -69,6 +72,286 @@ public void tearDown()
deleteRecursively(metastoreDir.toPath(), ALLOW_INSECURE);
}

@Test
public void testLackOfPartitionFilterNotAllowed()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave one typical test case here in the smoke test and move the rest to BaseIcebergConnectorTest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Yes, smoke tests are designed to be quick and simple

@alexjo2144
Copy link
Member

I saw in the Hive connector version of this we have a set of schemas where the filters are required. Any thoughts on if we should have that configurable here as well?

@marton-bod
Copy link
Contributor

Thanks so much for your review @alexjo2144 !

Any thoughts on if we should have that configurable here as well?

I also suggested to implement that flag as well. Whether we do it in this PR or a follow-up I'm OK with both. cc @zhangminglei

@marton-bod
Copy link
Contributor

@zhangminglei just a friendly bump to see if you could take a look at @alexjo2144's comments. Our team would love to move this feature forward to completion

@marton-bod marton-bod force-pushed the issue-17239 branch 2 times, most recently from 1907b6f to 5fd7f03 Compare August 31, 2023 15:50
@marton-bod marton-bod requested a review from ebyhr September 3, 2023 21:47
@marton-bod
Copy link
Contributor

Hi @ebyhr , I think all the PR comments have been addressed now - can you please check if anything else is required, and if not, merge this in? Thanks a lot!

@marton-bod
Copy link
Contributor

@ebyhr I've rebased this PR to the latest master, so it's again ready for review. Thanks!

@marton-bod
Copy link
Contributor

@ebyhr Can you please take a final look? I appreciate that all maintainers are probably busy, but it's been waiting for review/merge for a while now. Thanks a lot!

@rice668
Copy link
Contributor Author

rice668 commented Sep 26, 2023

@ebyhr Could you please take a look again ? Thank you very much.

@marton-bod
Copy link
Contributor

Hi @raunaqmorarka! I think @ebyhr is busy, so could I ask you please to help us with this? We have rebased multiple times and responded to all the review comments so I don't think there is anything outstanding here. Thank you so much!

docs/src/main/sphinx/connector/iceberg.md Outdated Show resolved Hide resolved
return Optional.empty();
}

Set<IcebergColumnHandle> icebergColumnHandleSet = constraint.getPredicateColumns()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should include the columns from enforced and unenforced tuple domains. See the implementation in DeltaLakeMetadata#applyFilter

Copy link
Member

@raunaqmorarka raunaqmorarka Oct 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand constraintColumns in HiveTableHandle includes only constraint predicate columns. Either way should be okay as long as we clearly document in IcebergTableHandle what is included in constraintColumns

@cla-bot
Copy link

cla-bot bot commented Oct 10, 2023

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: Marton Bod.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@marton-bod
Copy link
Contributor

@raunaqmorarka I highly appreciate your review. I have addressed your comments. To make it easier to check, I kept the new changes in a separate commit, but once you approve the PR I will squash the commits

Copy link
Member

@raunaqmorarka raunaqmorarka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments
please squash the commits

@raunaqmorarka raunaqmorarka changed the title Reject the table scan produced by the planner when the query does not… Add config to reject Iceberg queries without partition filter Oct 10, 2023
@marton-bod
Copy link
Contributor

@raunaqmorarka Thanks a lot again for you review, I truly appreciate it!

Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

Optionally reject queries with non-partition predicates on Iceberg partitioned tables
10 participants