-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048
base: branch-25.02
Are you sure you want to change the base?
Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048
Conversation
Signed-off-by: Raza Jafri <[email protected]>
@@ -81,6 +81,24 @@ trait GpuDeltaCatalogBase extends StagingTableCatalog { | |||
new GpuStagedDeltaTableV2(ident, schema, partitions, properties, operation) | |||
} | |||
|
|||
protected def getWriter(sourceQuery: Option[DataFrame], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is overridden in Databricks 14.3's version of GpuDeltaCatalog
@@ -0,0 +1,61 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is the same in every version of Databricks except for Databricks 14.3
@@ -0,0 +1,73 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is the same in every version of Databricks except for 14.3
build |
...-spark350db143/src/main/scala/com/nvidia/spark/rapids/delta/DeltaSpark350DB143Provider.scala
Show resolved
Hide resolved
@@ -0,0 +1,61 @@ | |||
/* | |||
* Copyright (c) 2022-2025, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new file, right?
* Copyright (c) 2022-2025, NVIDIA CORPORATION. | |
* Copyright (c) 2025, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above
...ark332db/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala
Show resolved
Hide resolved
...350db143/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala
Show resolved
Hide resolved
delta-lake/delta-spark350db143/src/main/scala/com/nvidia/spark/rapids/delta/DeltaProbe.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor changes requested.
table properties
@ignore_order(local=True) | ||
@pytest.mark.skipif(not is_databricks104_or_later(), reason="Dynamic File Pruning is only supported in Databricks 10.4+") | ||
@pytest.mark.parametrize('s_index', list(range(len(_statements))), ids=idfn) | ||
@pytest.mark.parametrize('aqe_enabled', ['false', 'true']) | ||
def test_delta_dfp_reuse_broadcast_exchange(spark_tmp_table_factory, s_index, aqe_enabled): | ||
@pytest.mark.parametrize("deletion_vector_conf", deletion_vector_conf, ids=idfn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to raise an issue for this failure (ShuffleExchangeExec
falling back to GPU), and link this here with deletion_vector_values_with_reasons
?
Just for the record, I've gone over the tests listed in #11541 (in my comment). It looks like these are being conditionally xfailed, in the right way. |
I thought I'd alert everybody to a potential problem I'm seeing when test-driving this branch, to examine the I am testing this as a user would, with the following write operation: spark.range(1, 19).toDF("id").write.mode("overwrite").format("delta").save(myOutputDir) On DB 14.3, here's what I'm seeing:
I'm going to examine overrides code and the meta, to see why. |
…he delta_lake_utils.py. Refactored the deletion vector values from delta_lake_utils.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it all looks okay now. I would love to see some one test for each operator to verify we fallback to the CPU in cases when deletion vectors are enabled, but that is minor.
Thank you for your patience and review. I missed that comment from our conversation on Friday. Let me add that test. |
build |
CI failed with |
build |
Co-authored-by: Gera Shegalov <[email protected]>
build |
build |
This PR adds Delta-Lake support for Databricks 14.3. In addition, it also adds a way to set Spark configuration for deletion vectors to turn them on/off which will help us test this feature in future when we add deletion vector support.
The expected behavior in this PR is
spark.databricks.delta.properties.defaults.enableDeletionVectors
= false mode, all delta-lake tests should pass.spark.databricks.delta.properties.defaults.enableDeletionVectors
= true mode, some delta lake tests will xfail, because of fallback to CPU.The xfailed tests will be fixed when the follow-on PR (with deletion vector read support) is merged.
This PR doesn't add deletion vector support at all. This is a stepping stone to adding deletion vector support.
contributes to #10661