Relax decimal metadata checks for mismatched precision/scale [databricks] #12060

nartal1 · 2025-02-04T07:42:24Z

This PR fixes #11433 .

This PR makes the GPU Parquet reader more flexible when handling files whose decimal columns have a different precision/scale than Spark’s requested schema. Previously, the plugin’s code would fail early (“Parquet column cannot be converted”) if the file declared, for example, DECIMAL(20, 0) but Spark asked for DECIMAL(10, 0) or DECIMAL(5, 1). Now we defer these mismatches to be resolved with optional half-up rounding or overflow handling trying to match standard Spark behavior.

In this PR, we make castDecimalToDecimal as public function. In evolveSchemaCasts, we pass the from and to DecimalTypes to cast the decimals to the required form. castDecimalToDecimal will handle the case of widening the scale/precisions or narrowing of scale/precisions.

Updated the current integration tests.
In Spark-UT we are disabling the test as the Apache Spark vectorized path throws error where as the spark-rapids implementation produces the correct results.

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 · 2025-02-04T07:48:50Z

build

revans2

I have a nit for the one test I would like to see expanded, but the indentation is a blocker because it changes when the test runs.

revans2 · 2025-02-04T15:47:04Z

integration_tests/src/main/python/parquet_test.py

-            conf={},
-            error_message="Parquet column cannot be converted"
-        )
-    else:
        assert_gpu_and_cpu_are_equal_collect(


The indentation is off. This test only runs when is_before_spark_400 returns true.

revans2 · 2025-02-04T16:29:29Z

integration_tests/src/main/python/parquet_test.py

@@ -1500,6 +1500,8 @@ def test_parquet_check_schema_compatibility_nested_types(spark_tmp_path):
    (DecimalGen(7, 4), DecimalGen(5, 2)),
    (DecimalGen(10, 7), DecimalGen(5, 2)),
    (DecimalGen(20, 17), DecimalGen(5, 2)),
+    # Narrowing precision
+    (DecimalGen(20, 0), DecimalGen(10, 0)),
    # Increasing precision and decreasing scale


nit: So I mapped out all of the tests here. I looked at any change in the data type and also if the scale increased, stayed the same, or decreased crossed with if the whole part (aka the precision - the scale) increased stayed the same or decreased.

data type 32->32 64->64 128->128 32->64 32->128 64->128 128->64 128->32 128->64 64->32

scale same/whole same N/A noop N/A noop N/A noop N/A impossible N/A impossible N/A impossible N/A impossible N/A impossible N/A impossible N/A impossible

scale same/whole increase N/A impossible N/A impossible N/A impossible N/A impossible

scale same/whole decrease N/A impossible N/A impossible N/A impossible (20,0)->(10,0)

scale increase/whole same (5,2)->(7,4),(5,2)->(6,3) (10,2)->(12,4) (20,2)->(22,4) (5,2)->(10,7) (5,2)->(20,17) (10,2)->(20,12) N/A impossible N/A impossible N/A impossible N/A impossible

scale increase/whole increase (5,2)->(12,5) (5,2)->(22,10) N/A impossible N/A impossible N/A impossible N/A impossible

scale increase/whole decrease (5,2)->(6,4) (10,4)->(12,7)

scale decrease/whole same (7,4)->(5,2) N/A impossible N/A impossible N/A impossible (20,17)->(5,2) (10,7)->(5,2)

scale decrease/whole increase (5,4)->(7,2) (10,6)->(12,4) (20,7)->(22,5)

scale decrease/whole decrease N/A impossible N/A impossible N/A impossible

I don't expect all of the boxes to be filled. I don't think we need exhaustive tests, but I noticed that

(DecimalGen(5, 2), DecimalGen(6, 3)),

does not actually increase the precision by larger amount than scale (scale increased by 1 and so did the precision so the whole part stayed the same, just like for (5,2)->(7,4)

Could we get one or two tests for when the scale stays the same and the whole part increases, and similarly for when the scale decreases and so does the whole part.

I don't think this is going to improve the coverage massively.

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 · 2025-02-04T19:55:58Z

build

nartal1 · 2025-02-04T20:44:53Z

Thanks @revans2 for the review. I have addressed the review comments. PTAL.

nartal1 · 2025-02-04T20:45:57Z

Filed issue for Spark-4.0 build failure - #12062

Relax decimal metadata checks for mismatched precision/scale

b1901ab

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 added the bug Something isn't working label Feb 4, 2025

nartal1 self-assigned this Feb 4, 2025

nartal1 requested review from revans2, mythrocks and gerashegalov February 4, 2025 07:43

nartal1 added 2 commits February 3, 2025 23:45

remove indentation fix

8fb2f1f

indentatin fix

6d5b351

revans2 requested changes Feb 4, 2025

View reviewed changes

addressed review comments

e46e9ac

Signed-off-by: Niranjan Artal <[email protected]>

nartal1 changed the title ~~Relax decimal metadata checks for mismatched precision/scale~~ Relax decimal metadata checks for mismatched precision/scale [databricks] Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax decimal metadata checks for mismatched precision/scale [databricks] #12060

Relax decimal metadata checks for mismatched precision/scale [databricks] #12060

nartal1 commented Feb 4, 2025 •

edited

Loading

nartal1 commented Feb 4, 2025

revans2 left a comment

revans2 Feb 4, 2025

revans2 Feb 4, 2025

nartal1 commented Feb 4, 2025

nartal1 commented Feb 4, 2025

nartal1 commented Feb 4, 2025

data type	32->32	64->64	128->128	32->64	32->128	64->128	128->64	128->32	128->64	64->32
scale same/whole same	N/A noop	N/A noop	N/A noop	N/A impossible	N/A impossible	N/A impossible	N/A impossible	N/A impossible	N/A impossible	N/A impossible
scale same/whole increase							N/A impossible	N/A impossible	N/A impossible	N/A impossible
scale same/whole decrease				N/A impossible	N/A impossible	N/A impossible	(20,0)->(10,0)
scale increase/whole same	(5,2)->(7,4),(5,2)->(6,3)	(10,2)->(12,4)	(20,2)->(22,4)	(5,2)->(10,7)	(5,2)->(20,17)	(10,2)->(20,12)	N/A impossible	N/A impossible	N/A impossible	N/A impossible
scale increase/whole increase				(5,2)->(12,5)	(5,2)->(22,10)		N/A impossible	N/A impossible	N/A impossible	N/A impossible
scale increase/whole decrease	(5,2)->(6,4)	(10,4)->(12,7)
scale decrease/whole same	(7,4)->(5,2)			N/A impossible	N/A impossible	N/A impossible		(20,17)->(5,2)		(10,7)->(5,2)
scale decrease/whole increase	(5,4)->(7,2)	(10,6)->(12,4)	(20,7)->(22,5)
scale decrease/whole decrease				N/A impossible	N/A impossible	N/A impossible

Relax decimal metadata checks for mismatched precision/scale [databricks] #12060

Are you sure you want to change the base?

Relax decimal metadata checks for mismatched precision/scale [databricks] #12060

Conversation

nartal1 commented Feb 4, 2025 • edited Loading

nartal1 commented Feb 4, 2025

revans2 left a comment

Choose a reason for hiding this comment

revans2 Feb 4, 2025

Choose a reason for hiding this comment

revans2 Feb 4, 2025

Choose a reason for hiding this comment

nartal1 commented Feb 4, 2025

nartal1 commented Feb 4, 2025

nartal1 commented Feb 4, 2025

nartal1 commented Feb 4, 2025 •

edited

Loading