-
Notifications
You must be signed in to change notification settings - Fork 604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include partition key in the Iceberg data file location #25169
base: dev
Are you sure you want to change the base?
Include partition key in the Iceberg data file location #25169
Conversation
0ccb8df
to
ee8d61d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including a partition key derived path component in the Iceberg data files remote location.
can we please state why its being done?
9d2f03f
to
030044a
Compare
Retry command for Build#62390please wait until all jobs are finished before running the slash command
|
CI test resultstest results on build#62390
test results on build#62405
|
Including a partition key derived path component in the Iceberg data files remote location. Signed-off-by: Michał Maślanka <[email protected]>
030044a
to
bb37c49
Compare
Retry command for Build#62405please wait until all jobs are finished before running the slash command
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe makes sense to add a simple test (ducktape or otherwise) that lists the files and ensures that they look sane? Doesn't have to be in this PR but it'd be nice as a reviewer just so we know it looks like what we expect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to what Andrew said, you can try mixing together Datalake3rdPartyMaintenanceTest
with DatalakeE2ETests.test_iceberg_files_location
to ensure that the paths are as expected and remain so post rewrite by 3rd party engine. I.e. ensure that our interpretation of paths matches 3rd party engines*.
* This would be best effort anyway. 3rd party query engines can be configured not to use partition key in the path, or they must even custom implementations and it is not practical to try and achieve 100% compatibility with all of them. At most, we can try to be consistent with one of them.
@@ -36,7 +36,7 @@ ss::future<checked<remote_path, translation_task::errc>> execute_single_upload( | |||
retry_chain_node& parent_rcn, | |||
lazy_abort_source& lazy_as) { | |||
auto file_remote_path = remote_path{ | |||
file.table_location / remote_path_prefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove remote_path_prefix
completely with this commit?
The prefix contains n/t/p
and we were discussing removing it next time we touch this code and introduce partition key path. Once we run an optimize task with say trino/spark, they will rewrite the paths and remove that information.
Including a partition key derived path component in the Iceberg data files remote location. The partition key being part of the path allow users to better navigate the Iceberg data files in the object store. Additionally it will allow per partition access control.
Backports Required
Release Notes