Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048

razajafri · 2025-01-30T08:57:41Z

This PR adds Delta-Lake support for Databricks 14.3. In addition, it also adds a way to set Spark configuration for deletion vectors to turn them on/off which will help us test this feature in future when we add deletion vector support.

The expected behavior in this PR is

When spark.databricks.delta.properties.defaults.enableDeletionVectors = false mode, all delta-lake tests should pass.
When spark.databricks.delta.properties.defaults.enableDeletionVectors = true mode, some delta lake tests will xfail, because of fallback to CPU.
The xfailed tests will be fixed when the follow-on PR (with deletion vector read support) is merged.

This PR doesn't add deletion vector support at all. This is a stepping stone to adding deletion vector support.

contributes to #10661

…e support

Signed-off-by: Raza Jafri <[email protected]>

razajafri · 2025-01-30T09:04:32Z

.../main/databricks/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaCatalogBase.scala

@@ -81,6 +81,24 @@ trait GpuDeltaCatalogBase extends StagingTableCatalog {
    new GpuStagedDeltaTableV2(ident, schema, partitions, properties, operation)
  }

+  protected def getWriter(sourceQuery: Option[DataFrame],


This method is overridden in Databricks 14.3's version of GpuDeltaCatalog

razajafri · 2025-01-30T09:05:10Z

...ark332db/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala

@@ -0,0 +1,61 @@
+/*


This file is the same in every version of Databricks except for Databricks 14.3

razajafri · 2025-01-30T09:08:08Z

...park332db/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuWriteIntoDelta.scala

@@ -0,0 +1,73 @@
+/*


This file is the same in every version of Databricks except for 14.3

razajafri · 2025-01-30T11:31:52Z

build

...-spark350db143/src/main/scala/com/nvidia/spark/rapids/delta/DeltaSpark350DB143Provider.scala

integration_tests/src/main/python/delta_lake_auto_compact_test.py

integration_tests/src/main/python/delta_lake_delete_test.py

integration_tests/src/main/python/delta_lake_low_shuffle_merge_test.py

integration_tests/src/main/python/delta_lake_test.py

integration_tests/src/main/python/delta_lake_update_test.py

integration_tests/src/main/python/delta_lake_write_test.py

mythrocks · 2025-01-31T05:28:18Z

...ark341db/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala

@@ -0,0 +1,61 @@
+/*
+ * Copyright (c) 2022-2025, NVIDIA CORPORATION.


This is a new file, right?

Suggested change

* Copyright (c) 2022-2025, NVIDIA CORPORATION.

* Copyright (c) 2025, NVIDIA CORPORATION.

Same comment as above

...ark332db/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala

...350db143/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala

delta-lake/delta-spark350db143/src/main/scala/com/nvidia/spark/rapids/delta/DeltaProbe.scala

mythrocks

Some minor changes requested.

table properties

integration_tests/src/main/python/delta_lake_delete_test.py

integration_tests/src/main/python/delta_lake_update_test.py

integration_tests/src/main/python/delta_lake_write_test.py

mythrocks · 2025-01-31T21:58:09Z

integration_tests/src/main/python/delta_zorder_test.py

 @ignore_order(local=True)
 @pytest.mark.skipif(not is_databricks104_or_later(), reason="Dynamic File Pruning is only supported in Databricks 10.4+")
 @pytest.mark.parametrize('s_index', list(range(len(_statements))), ids=idfn)
 @pytest.mark.parametrize('aqe_enabled', ['false', 'true'])
-def test_delta_dfp_reuse_broadcast_exchange(spark_tmp_table_factory, s_index, aqe_enabled):
+@pytest.mark.parametrize("deletion_vector_conf", deletion_vector_conf, ids=idfn)


Do we want to raise an issue for this failure (ShuffleExchangeExec falling back to GPU), and link this here with deletion_vector_values_with_reasons?

mythrocks · 2025-01-31T21:59:36Z

Just for the record, I've gone over the tests listed in #11541 (in my comment). It looks like these are being conditionally xfailed, in the right way.

Co-authored-by: MithunR <[email protected]>

mythrocks · 2025-02-01T01:28:44Z

I thought I'd alert everybody to a potential problem I'm seeing when test-driving this branch, to examine the tightBounds stats bug in #12027.

I am testing this as a user would, with the following write operation:

spark.range(1, 19).toDF("id").write.mode("overwrite").format("delta").save(myOutputDir)

On DB 14.3, here's what I'm seeing:

With spark.databricks.delta.properties.defaults.enableDeletionVectors=true, the output contains tightBounds. This indicates that spark-rapids is falling back to CPU correctly.
With spark.databricks.delta.properties.defaults.enableDeletionVectors=false, the output does not contain tightBounds. This indicates that the write stays on GPU, for a non-deletion-vector write.
With spark.databricks.delta.properties.defaults.enableDeletionVectors unset, it looks like we're not falling back to CPU. The stat goes missing.

I'm going to examine overrides code and the meta, to see why.

…he delta_lake_utils.py. Refactored the deletion vector values from delta_lake_utils.py

… wasn't working

…uffle_merge

revans2

I think it all looks okay now. I would love to see some one test for each operator to verify we fallback to the CPU in cases when deletion vectors are enabled, but that is minor.

razajafri · 2025-02-03T17:37:12Z

I think it all looks okay now. I would love to see some one test for each operator to verify we fallback to the CPU in cases when deletion vectors are enabled, but that is minor.

Thank you for your patience and review. I missed that comment from our conversation on Friday. Let me add that test.

razajafri · 2025-02-03T18:41:49Z

build

razajafri · 2025-02-03T20:58:00Z

CI failed with
Databricks part2 result : FAILURE

razajafri · 2025-02-04T05:03:52Z

build

integration_tests/src/main/python/delta_lake_merge_test.py

Co-authored-by: Gera Shegalov <[email protected]>

razajafri · 2025-02-05T00:16:07Z

build

razajafri · 2025-02-05T05:21:22Z

build

razajafri added 10 commits January 28, 2025 21:32

Added delta-lake support for Databricks 14.3

36aab4b

xfailed delta_lake_delete_test.py due to lacking deletion vector writ…

ec3b55d

…e support

skip low_shuffle_merge_test for any databricks version besides 13.3

9cd136c

added Execs to run on CPU for auto_compact tests

4b23ccd

xfailed delta_lake_merge_test.py

fe85f65

xfailed delta_lake_test.py

fcce0a8

xfailed delta_lake_update_test.py

e5b2ff0

reverted change

e482937

xfailed and fixed some failing tests in delta_lake_write_test.py

4d98a62

xfailed delta_zorder_test.py

f14ff42

razajafri requested a review from a team as a code owner January 30, 2025 08:57

razajafri mentioned this pull request Jan 30, 2025

Add DeltaLake DeletionVector Scan support for Databricks 14.3 [databricks] #11964

Closed

Signing off

9c908f7

Signed-off-by: Raza Jafri <[email protected]>

razajafri commented Jan 30, 2025

View reviewed changes

razajafri added 5 commits January 30, 2025 09:20

updated copyrights and fixed line length

6cfce00

Removed multiple versions of DatabricksDeltaProviderBase

c6f9199

reverted test.sh changes

1c77a22

updated copyrights

7360dd3

removed fastparquet.txt

5b3aaf2

revans2 reviewed Jan 30, 2025

View reviewed changes

razajafri changed the title ~~Add DeltaLake DeletionVector Scan support for Databricks 14.3 [databricks]~~ Add DeltaLake support for Databricks 14.3 [databricks] Jan 30, 2025

razajafri changed the title ~~Add DeltaLake support for Databricks 14.3 [databricks]~~ Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] Jan 30, 2025

Added copyrights from Delta-io project

c41d731

mythrocks reviewed Jan 31, 2025

View reviewed changes

...ark332db/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala Show resolved Hide resolved

mythrocks reviewed Jan 31, 2025

View reviewed changes

...350db143/src/main/scala/com/databricks/sql/transaction/tahoe/rapids/GpuDeltaDataSource.scala Show resolved Hide resolved

mythrocks reviewed Jan 31, 2025

View reviewed changes

delta-lake/delta-spark350db143/src/main/scala/com/nvidia/spark/rapids/delta/DeltaProbe.scala Outdated Show resolved Hide resolved

mythrocks requested changes Jan 31, 2025

View reviewed changes

Modified the logic for turning deletion vectors on/off to use the

013a794

table properties

razajafri added 2 commits January 31, 2025 09:21

Added DeltaInvariantCheckerExec to fallback

df583cb

addressed review comments

70edea2

revans2 mentioned this pull request Jan 31, 2025

Explicitly set Delta table props to accommodate for different defaults [databricks] #11970

Merged

revans2 reviewed Jan 31, 2025

View reviewed changes

sameerz added the task Work required that improves the product but is not user facing label Jan 31, 2025

mythrocks reviewed Jan 31, 2025

View reviewed changes

razajafri and others added 4 commits January 31, 2025 14:27

Update copyrights on DeltaProbe.scala

5b3acd9

Co-authored-by: MithunR <[email protected]>

Addressed review comments

2223cd8

Improved formatting

f77ef04

Merge remote-tracking branch 'origin/branch-25.02' into HEAD

f7390e3

razajafri added 3 commits February 1, 2025 04:18

Removed DeltaInvariant from the allow_non_gpu as it's included from t…

88457fc

…he delta_lake_utils.py. Refactored the deletion vector values from delta_lake_utils.py

Disabling deletion vector using the conf as setting the tblproperties…

3ef1f17

… wasn't working

hard coding disabling deletion vectors as it doesn't matter in low_sh…

6f47d0e

…uffle_merge

revans2 previously approved these changes Feb 3, 2025

View reviewed changes

mythrocks mentioned this pull request Feb 3, 2025

[BUG] [Databricks 14.3] Delta Write not falling back to CPU with Deletion Vectors enabled #12059

Open

added fallback tests

1a8b0f9

razajafri dismissed revans2’s stale review via 1a8b0f9 February 4, 2025 04:32

razajafri requested a review from mythrocks February 4, 2025 04:32

gerashegalov reviewed Feb 4, 2025

View reviewed changes

integration_tests/src/main/python/delta_lake_merge_test.py Outdated Show resolved Hide resolved

Fixed unclosed paranthesis

6e2175f

Co-authored-by: Gera Shegalov <[email protected]>

razajafri added 2 commits February 4, 2025 19:35

Add dv tblproperties only for specific versions of delta-lake

c235956

fixed syntax error

4a97097

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048

Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048

razajafri commented Jan 30, 2025 •

edited

Loading

razajafri Jan 30, 2025

razajafri Jan 30, 2025

razajafri Jan 30, 2025

razajafri commented Jan 30, 2025

mythrocks Jan 31, 2025

razajafri Jan 31, 2025

mythrocks left a comment

mythrocks Jan 31, 2025

mythrocks commented Jan 31, 2025

mythrocks commented Feb 1, 2025

revans2 left a comment

razajafri commented Feb 3, 2025

razajafri commented Feb 3, 2025

razajafri commented Feb 3, 2025

razajafri commented Feb 4, 2025

razajafri commented Feb 5, 2025

razajafri commented Feb 5, 2025

		@@ -0,0 +1,61 @@
		/*
		* Copyright (c) 2022-2025, NVIDIA CORPORATION.

	* Copyright (c) 2022-2025, NVIDIA CORPORATION.
	* Copyright (c) 2025, NVIDIA CORPORATION.

Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048

Are you sure you want to change the base?

Add DeltaLake support without Deletion Vectors for Databricks 14.3 [databricks] #12048

Conversation

razajafri commented Jan 30, 2025 • edited Loading

razajafri Jan 30, 2025

Choose a reason for hiding this comment

razajafri Jan 30, 2025

Choose a reason for hiding this comment

razajafri Jan 30, 2025

Choose a reason for hiding this comment

razajafri commented Jan 30, 2025

mythrocks Jan 31, 2025

Choose a reason for hiding this comment

razajafri Jan 31, 2025

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks Jan 31, 2025

Choose a reason for hiding this comment

mythrocks commented Jan 31, 2025

mythrocks commented Feb 1, 2025

revans2 left a comment

Choose a reason for hiding this comment

razajafri commented Feb 3, 2025

razajafri commented Feb 3, 2025

razajafri commented Feb 3, 2025

razajafri commented Feb 4, 2025

razajafri commented Feb 5, 2025

razajafri commented Feb 5, 2025

razajafri commented Jan 30, 2025 •

edited

Loading