Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include partition key in the Iceberg data file location #25169

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Feb 25, 2025

Including a partition key derived path component in the Iceberg data files remote location. The partition key being part of the path allow users to better navigate the Iceberg data files in the object store. Additionally it will allow per partition access control.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

  • none

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including a partition key derived path component in the Iceberg data files remote location.

can we please state why its being done?

@mmaslankaprv mmaslankaprv force-pushed the dl-paritition-key-location branch 2 times, most recently from 9d2f03f to 030044a Compare February 27, 2025 06:59
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 27, 2025

Retry command for Build#62390

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_partition_spec_evo@{"cloud_storage_type":1,"query_engine":"trino","use_partition_spec":true}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_partition_spec_evo@{"cloud_storage_type":1,"query_engine":"spark","use_partition_spec":true}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 27, 2025

CI test results

test results on build#62390
test_id test_kind job_url test_status passed
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/62390#01954693-295b-4df0-9b2e-77fd34e611b7 FLAKY 1/2
rptest.tests.datalake.schema_evolution_test.SchemaEvolutionE2ETests.test_partition_spec_evo.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK.use_partition_spec=True ducktape https://buildkite.com/redpanda/redpanda/builds/62390#01954693-295b-4df0-9b2e-77fd34e611b7 FAIL 0/20
rptest.tests.datalake.schema_evolution_test.SchemaEvolutionE2ETests.test_partition_spec_evo.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.TRINO.use_partition_spec=True ducktape https://buildkite.com/redpanda/redpanda/builds/62390#01954693-295d-4126-8c90-f1816a871e76 FAIL 0/20
test results on build#62405
test_id test_kind job_url test_status passed
rptest.tests.archival_test.ArchivalTest.test_all_partitions_leadership_transfer.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/62405#01954796-3a9c-42f7-a483-6fcf4025e434 FLAKY 1/2
rptest.tests.compaction_recovery_test.CompactionRecoveryUpgradeTest.test_index_recovery_after_upgrade ducktape https://buildkite.com/redpanda/redpanda/builds/62405#019547a9-75fc-4fab-a82c-3dafda6ae21d FLAKY 1/2
rptest.tests.datalake.schema_evolution_test.SchemaEvolutionE2ETests.test_partition_spec_evo.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.SPARK.use_partition_spec=True ducktape https://buildkite.com/redpanda/redpanda/builds/62405#019547a9-75fe-4431-ac8e-31d42906299a FAIL 0/20
rptest.tests.datalake.schema_evolution_test.SchemaEvolutionE2ETests.test_partition_spec_evo.cloud_storage_type=CloudStorageType.S3.query_engine=QueryEngineType.TRINO.use_partition_spec=True ducktape https://buildkite.com/redpanda/redpanda/builds/62405#019547a9-75fc-4fab-a82c-3dafda6ae21d FAIL 0/20
rptest.tests.log_segment_ms_test.SegmentMsTest.test_segment_rolling.use_alter_cfg=False.server_cfg=None.topic_cfg=90000 ducktape https://buildkite.com/redpanda/redpanda/builds/62405#019547a9-75fe-4431-ac8e-31d42906299a FLAKY 1/2
rptest.tests.partition_movement_test.SIPartitionMovementTest.test_cross_shard.num_to_upgrade=2.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/62405#01954796-3a9d-46cf-bbd5-6feed7db1776 FLAKY 1/2
rptest.tests.read_replica_e2e_test.TestReadReplicaService.test_simple_end_to_end.partition_count=10.cloud_storage_type_and_url_style=.CloudStorageType.S3.1.path ducktape https://buildkite.com/redpanda/redpanda/builds/62405#019547a9-75fe-4895-997a-c4e1325b9f5b FLAKY 1/2

Including a partition key derived path component in the Iceberg data
files remote location.

Signed-off-by: Michał Maślanka <[email protected]>
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Feb 27, 2025

Retry command for Build#62405

please wait until all jobs are finished before running the slash command


/ci-repeat 1
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_partition_spec_evo@{"cloud_storage_type":1,"query_engine":"spark","use_partition_spec":true}
tests/rptest/tests/datalake/schema_evolution_test.py::SchemaEvolutionE2ETests.test_partition_spec_evo@{"cloud_storage_type":1,"query_engine":"trino","use_partition_spec":true}

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe makes sense to add a simple test (ducktape or otherwise) that lists the files and ensures that they look sane? Doesn't have to be in this PR but it'd be nice as a reviewer just so we know it looks like what we expect

Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what Andrew said, you can try mixing together Datalake3rdPartyMaintenanceTest with DatalakeE2ETests.test_iceberg_files_location to ensure that the paths are as expected and remain so post rewrite by 3rd party engine. I.e. ensure that our interpretation of paths matches 3rd party engines*.

* This would be best effort anyway. 3rd party query engines can be configured not to use partition key in the path, or they must even custom implementations and it is not practical to try and achieve 100% compatibility with all of them. At most, we can try to be consistent with one of them.

@@ -36,7 +36,7 @@ ss::future<checked<remote_path, translation_task::errc>> execute_single_upload(
retry_chain_node& parent_rcn,
lazy_abort_source& lazy_as) {
auto file_remote_path = remote_path{
file.table_location / remote_path_prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove remote_path_prefix completely with this commit?

The prefix contains n/t/p and we were discussing removing it next time we touch this code and introduce partition key path. Once we run an optimize task with say trino/spark, they will rewrite the paths and remove that information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants