Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax decimal metadata checks for mismatched precision/scale [databricks] #12060

Open
wants to merge 4 commits into
base: branch-25.02
Choose a base branch
from

Conversation

nartal1
Copy link
Collaborator

@nartal1 nartal1 commented Feb 4, 2025

This PR fixes #11433 .

This PR makes the GPU Parquet reader more flexible when handling files whose decimal columns have a different precision/scale than Spark’s requested schema. Previously, the plugin’s code would fail early (“Parquet column cannot be converted”) if the file declared, for example, DECIMAL(20, 0) but Spark asked for DECIMAL(10, 0) or DECIMAL(5, 1). Now we defer these mismatches to be resolved with optional half-up rounding or overflow handling trying to match standard Spark behavior.

In this PR, we make castDecimalToDecimal as public function. In evolveSchemaCasts, we pass the from and to DecimalTypes to cast the decimals to the required form. castDecimalToDecimal will handle the case of widening the scale/precisions or narrowing of scale/precisions.

Updated the current integration tests.
In Spark-UT we are disabling the test as the Apache Spark vectorized path throws error where as the spark-rapids implementation produces the correct results.

@nartal1 nartal1 added the bug Something isn't working label Feb 4, 2025
@nartal1 nartal1 self-assigned this Feb 4, 2025
@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 4, 2025

build

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a nit for the one test I would like to see expanded, but the indentation is a blocker because it changes when the test runs.

conf={},
error_message="Parquet column cannot be converted"
)
else:
assert_gpu_and_cpu_are_equal_collect(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is off. This test only runs when is_before_spark_400 returns true.

@@ -1500,6 +1500,8 @@ def test_parquet_check_schema_compatibility_nested_types(spark_tmp_path):
(DecimalGen(7, 4), DecimalGen(5, 2)),
(DecimalGen(10, 7), DecimalGen(5, 2)),
(DecimalGen(20, 17), DecimalGen(5, 2)),
# Narrowing precision
(DecimalGen(20, 0), DecimalGen(10, 0)),
# Increasing precision and decreasing scale
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: So I mapped out all of the tests here. I looked at any change in the data type and also if the scale increased, stayed the same, or decreased crossed with if the whole part (aka the precision - the scale) increased stayed the same or decreased.

data type 32->32 64->64 128->128 32->64 32->128 64->128 128->64 128->32 128->64 64->32
scale same/whole same N/A noop N/A noop N/A noop N/A impossible N/A impossible N/A impossible N/A impossible N/A impossible N/A impossible N/A impossible
scale same/whole increase             N/A impossible N/A impossible N/A impossible N/A impossible
scale same/whole decrease       N/A impossible N/A impossible N/A impossible (20,0)->(10,0)      
scale increase/whole same (5,2)->(7,4),(5,2)->(6,3) (10,2)->(12,4) (20,2)->(22,4) (5,2)->(10,7) (5,2)->(20,17) (10,2)->(20,12) N/A impossible N/A impossible N/A impossible N/A impossible
scale increase/whole increase       (5,2)->(12,5) (5,2)->(22,10)   N/A impossible N/A impossible N/A impossible N/A impossible
scale increase/whole decrease (5,2)->(6,4) (10,4)->(12,7)                
scale decrease/whole same (7,4)->(5,2)     N/A impossible N/A impossible N/A impossible   (20,17)->(5,2)   (10,7)->(5,2)
scale decrease/whole increase (5,4)->(7,2) (10,6)->(12,4) (20,7)->(22,5)              
scale decrease/whole decrease       N/A impossible N/A impossible N/A impossible        

I don't expect all of the boxes to be filled. I don't think we need exhaustive tests, but I noticed that

    (DecimalGen(5, 2), DecimalGen(6, 3)),

does not actually increase the precision by larger amount than scale (scale increased by 1 and so did the precision so the whole part stayed the same, just like for (5,2)->(7,4)

Could we get one or two tests for when the scale stays the same and the whole part increases, and similarly for when the scale decreases and so does the whole part.

I don't think this is going to improve the coverage massively.

Signed-off-by: Niranjan Artal <[email protected]>
@nartal1 nartal1 changed the title Relax decimal metadata checks for mismatched precision/scale Relax decimal metadata checks for mismatched precision/scale [databricks] Feb 4, 2025
@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 4, 2025

build

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 4, 2025

Thanks @revans2 for the review. I have addressed the review comments. PTAL.

@nartal1
Copy link
Collaborator Author

nartal1 commented Feb 4, 2025

Filed issue for Spark-4.0 build failure - #12062

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Spark UT framework: SPARK-34212 Parquet should read decimals correctly
2 participants