-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: enable full decimal to decimal support #1385
base: main
Are you sure you want to change the base?
fix: enable full decimal to decimal support #1385
Conversation
use a regex to match arrow invalid argument error.
@@ -872,6 +872,13 @@ fn cast_array( | |||
let array = array_with_timezone(array, cast_options.timezone.clone(), Some(to_type))?; | |||
let from_type = array.data_type().clone(); | |||
|
|||
let native_cast_options: CastOptions = CastOptions { | |||
safe: !matches!(cast_options.eval_mode, EvalMode::Ansi), // take safe mode from cast_options passed | |||
format_options: FormatOptions::new() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one can use a default
value defined for FormatOptions
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the default CAST_OPTIONS which is replaced by this native_cast_options had two these set to
static TIMESTAMP_FORMAT: Option<&str> = Some("%Y-%m-%d %H:%M:%S%.f");
timestamp_format: TIMESTAMP_FORMAT,
timestamp_tz_format: TIMESTAMP_FORMAT,
If we change it to default
, I checked FormatOptions::default()
implementation set these
timestamp_format: None,
timestamp_tz_format: None,
Hence kept it as it is defined inside default CAST_OPTIONS for comet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. (The format options are used only to make the cast of timestamp to string compatible with Spark, and are not needed anywhere else) but I guess it is a good idea to be consistent everywhere.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1385 +/- ##
=============================================
- Coverage 56.12% 39.50% -16.63%
- Complexity 976 2085 +1109
=============================================
Files 119 265 +146
Lines 11743 61597 +49854
Branches 2251 13092 +10841
=============================================
+ Hits 6591 24335 +17744
- Misses 4012 32697 +28685
- Partials 1140 4565 +3425 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, thank you @himadripal just minor comments
// for comet decimal conversion throws ArrowError(string) from arrow - across spark versions the message dont match. | ||
if (sparkMessage.contains("cannot be represented as")) { | ||
assert( | ||
sparkException.getMessage | ||
.replace(".WITH_SUGGESTION] ", "]") | ||
.startsWith(cometMessage)) | ||
} else if (CometSparkSessionExtensions.isSpark34Plus) { | ||
// for Spark 3.4 we expect to reproduce the error message exactly | ||
assert(cometMessage == sparkMessage) | ||
cometMessage.contains("cannot be represented as") || cometMessage.contains( | ||
"too large to store")) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are message modifications below per spark version
Would you mind update them instead of creating another if branch?
… double to decimal and few others paths still uses spark, hence generates spark error message.
// for comet decimal conversion throws ArrowError(string) from arrow - across spark versions the message dont match. | ||
if (sparkMessage.contains("cannot be represented as")) { | ||
cometMessage.contains("cannot be represented as") || cometMessage.contains( | ||
"too large to store") | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to remove this new if
block and update the test cases below.
This new block may still pass with cometMessage.contains("cannot be represented as") that seems to be an indication of Spark cast instead of native cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when I removed the branch from the top, the test fails for double to decimal conversion with allow-incompatible flag, I think that is still using spark cast. Hence I had to put it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to remove the if
block is to convert the error message to make it similar to Spark
This is the place that we are defining Spark error messages https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/error.rs#L36
You can check how these are used.
Or you can move the message check below and switch the expected messages based on the fromType/toType.
One or the other way. Otherwise we cannot confidently say that the test is passing due to the cannot be represented as
message or the too large to store
message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to remove the if block is to convert the error message to make it similar to Spark
This is the place that we are defining Spark error messages https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/error.rs#L36
You can check how these are used
This one I checked - problem here is from native execution, we get back Arrow(ArrowError)
which only has a string, precision and scale information is not present. Also to construct an error message from a string - we need to check specific string in the message.
We can try to change the ArrowError to have parameter but that will be a big change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can move the message check below and switch the expected messages based on the fromType/toType.
I'll try this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One or the other way. Otherwise we cannot confidently say that the test is passing due to the cannot be represented as message or the too large to store message.
I'm contemplating creating a different test/check function for decimal to decimal test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought. For the former approach, can we get precision info similar to https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/conversion_funcs/cast.rs#L932 ?
@@ -69,7 +69,8 @@ object GenerateDocs { | |||
w.write("|-|-|-|\n".getBytes) | |||
for (fromType <- CometCast.supportedTypes) { | |||
for (toType <- CometCast.supportedTypes) { | |||
if (Cast.canCast(fromType, toType) && fromType != toType) { | |||
if (Cast.canCast(fromType, toType) && (fromType != toType || fromType.typeName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove please check - I added this exception for decimal
Completes #375
Which issue does this PR close?
Closes #.
Rationale for this change
What changes are included in this PR?
How are these changes tested?