Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048

Open
wants to merge 32 commits into
base: branch-25.02
Choose a base branch
from

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Jan 30, 2025

This PR adds Delta-Lake support for Databricks 14.3. In addition, it also adds a way to set Spark configuration for deletion vectors to turn them on/off which will help us test this feature in future when we add deletion vector support.

The expected behavior in this PR is

  • When spark.databricks.delta.properties.defaults.enableDeletionVectors = false mode, all delta-lake tests should pass.
  • When spark.databricks.delta.properties.defaults.enableDeletionVectors = true mode, some delta lake tests will xfail, because of fallback to CPU.
    The xfailed tests will be fixed when the follow-on PR (with deletion vector read support) is merged.

This PR doesn't add deletion vector support at all. This is a stepping stone to adding deletion vector support.

contributes to #10661

Signed-off-by: Raza Jafri <[email protected]>
@@ -81,6 +81,24 @@ trait GpuDeltaCatalogBase extends StagingTableCatalog {
new GpuStagedDeltaTableV2(ident, schema, partitions, properties, operation)
}

protected def getWriter(sourceQuery: Option[DataFrame],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is overridden in Databricks 14.3's version of GpuDeltaCatalog

@@ -0,0 +1,61 @@
/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is the same in every version of Databricks except for Databricks 14.3

@@ -0,0 +1,73 @@
/*
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is the same in every version of Databricks except for 14.3

@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri changed the title Add DeltaLake DeletionVector Scan support for Databricks 14.3 [databricks] Add DeltaLake support for Databricks 14.3 [databricks] Jan 30, 2025
@razajafri razajafri changed the title Add DeltaLake support for Databricks 14.3 [databricks] Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] Jan 30, 2025
@@ -0,0 +1,61 @@
/*
* Copyright (c) 2022-2025, NVIDIA CORPORATION.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new file, right?

Suggested change
* Copyright (c) 2022-2025, NVIDIA CORPORATION.
* Copyright (c) 2025, NVIDIA CORPORATION.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

Copy link
Collaborator

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor changes requested.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jan 31, 2025
@ignore_order(local=True)
@pytest.mark.skipif(not is_databricks104_or_later(), reason="Dynamic File Pruning is only supported in Databricks 10.4+")
@pytest.mark.parametrize('s_index', list(range(len(_statements))), ids=idfn)
@pytest.mark.parametrize('aqe_enabled', ['false', 'true'])
def test_delta_dfp_reuse_broadcast_exchange(spark_tmp_table_factory, s_index, aqe_enabled):
@pytest.mark.parametrize("deletion_vector_conf", deletion_vector_conf, ids=idfn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to raise an issue for this failure (ShuffleExchangeExec falling back to GPU), and link this here with deletion_vector_values_with_reasons?

@mythrocks
Copy link
Collaborator

Just for the record, I've gone over the tests listed in #11541 (in my comment). It looks like these are being conditionally xfailed, in the right way.

@mythrocks
Copy link
Collaborator

I thought I'd alert everybody to a potential problem I'm seeing when test-driving this branch, to examine the tightBounds stats bug in #12027.

I am testing this as a user would, with the following write operation:

spark.range(1, 19).toDF("id").write.mode("overwrite").format("delta").save(myOutputDir)

On DB 14.3, here's what I'm seeing:

  1. With spark.databricks.delta.properties.defaults.enableDeletionVectors=true, the output contains tightBounds. This indicates that spark-rapids is falling back to CPU correctly.
  2. With spark.databricks.delta.properties.defaults.enableDeletionVectors=false, the output does not contain tightBounds. This indicates that the write stays on GPU, for a non-deletion-vector write.
  3. With spark.databricks.delta.properties.defaults.enableDeletionVectors unset, it looks like we're not falling back to CPU. The stat goes missing.

I'm going to examine overrides code and the meta, to see why.

revans2
revans2 previously approved these changes Feb 3, 2025
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it all looks okay now. I would love to see some one test for each operator to verify we fallback to the CPU in cases when deletion vectors are enabled, but that is minor.

@razajafri
Copy link
Collaborator Author

I think it all looks okay now. I would love to see some one test for each operator to verify we fallback to the CPU in cases when deletion vectors are enabled, but that is minor.

Thank you for your patience and review. I missed that comment from our conversation on Friday. Let me add that test.

@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

CI failed with
Databricks part2 result : FAILURE

@razajafri
Copy link
Collaborator Author

build

Co-authored-by: Gera Shegalov <[email protected]>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants