-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relax decimal metadata checks for mismatched precision/scale [databricks] #12060
base: branch-25.02
Are you sure you want to change the base?
Conversation
Signed-off-by: Niranjan Artal <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a nit for the one test I would like to see expanded, but the indentation is a blocker because it changes when the test runs.
conf={}, | ||
error_message="Parquet column cannot be converted" | ||
) | ||
else: | ||
assert_gpu_and_cpu_are_equal_collect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation is off. This test only runs when is_before_spark_400 returns true.
@@ -1500,6 +1500,8 @@ def test_parquet_check_schema_compatibility_nested_types(spark_tmp_path): | |||
(DecimalGen(7, 4), DecimalGen(5, 2)), | |||
(DecimalGen(10, 7), DecimalGen(5, 2)), | |||
(DecimalGen(20, 17), DecimalGen(5, 2)), | |||
# Narrowing precision | |||
(DecimalGen(20, 0), DecimalGen(10, 0)), | |||
# Increasing precision and decreasing scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: So I mapped out all of the tests here. I looked at any change in the data type and also if the scale increased, stayed the same, or decreased crossed with if the whole part (aka the precision - the scale) increased stayed the same or decreased.
data type | 32->32 | 64->64 | 128->128 | 32->64 | 32->128 | 64->128 | 128->64 | 128->32 | 128->64 | 64->32 |
---|---|---|---|---|---|---|---|---|---|---|
scale same/whole same | N/A noop | N/A noop | N/A noop | N/A impossible | N/A impossible | N/A impossible | N/A impossible | N/A impossible | N/A impossible | N/A impossible |
scale same/whole increase | N/A impossible | N/A impossible | N/A impossible | N/A impossible | ||||||
scale same/whole decrease | N/A impossible | N/A impossible | N/A impossible | (20,0)->(10,0) | ||||||
scale increase/whole same | (5,2)->(7,4),(5,2)->(6,3) | (10,2)->(12,4) | (20,2)->(22,4) | (5,2)->(10,7) | (5,2)->(20,17) | (10,2)->(20,12) | N/A impossible | N/A impossible | N/A impossible | N/A impossible |
scale increase/whole increase | (5,2)->(12,5) | (5,2)->(22,10) | N/A impossible | N/A impossible | N/A impossible | N/A impossible | ||||
scale increase/whole decrease | (5,2)->(6,4) | (10,4)->(12,7) | ||||||||
scale decrease/whole same | (7,4)->(5,2) | N/A impossible | N/A impossible | N/A impossible | (20,17)->(5,2) | (10,7)->(5,2) | ||||
scale decrease/whole increase | (5,4)->(7,2) | (10,6)->(12,4) | (20,7)->(22,5) | |||||||
scale decrease/whole decrease | N/A impossible | N/A impossible | N/A impossible |
I don't expect all of the boxes to be filled. I don't think we need exhaustive tests, but I noticed that
(DecimalGen(5, 2), DecimalGen(6, 3)),
does not actually increase the precision by larger amount than scale (scale increased by 1 and so did the precision so the whole part stayed the same, just like for (5,2)->(7,4)
Could we get one or two tests for when the scale stays the same and the whole part increases, and similarly for when the scale decreases and so does the whole part.
I don't think this is going to improve the coverage massively.
Signed-off-by: Niranjan Artal <[email protected]>
build |
Thanks @revans2 for the review. I have addressed the review comments. PTAL. |
Filed issue for Spark-4.0 build failure - #12062 |
This PR fixes #11433 .
This PR makes the GPU Parquet reader more flexible when handling files whose decimal columns have a different precision/scale than Spark’s requested schema. Previously, the plugin’s code would fail early (“Parquet column cannot be converted”) if the file declared, for example, DECIMAL(20, 0) but Spark asked for DECIMAL(10, 0) or DECIMAL(5, 1). Now we defer these mismatches to be resolved with optional half-up rounding or overflow handling trying to match standard Spark behavior.
In this PR, we make castDecimalToDecimal as public function. In
evolveSchemaCasts
, we pass the from and to DecimalTypes to cast the decimals to the required form. castDecimalToDecimal will handle the case of widening the scale/precisions or narrowing of scale/precisions.Updated the current integration tests.
In Spark-UT we are disabling the test as the Apache Spark vectorized path throws error where as the spark-rapids implementation produces the correct results.