Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: enable full decimal to decimal support #1385

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

himadripal
Copy link
Contributor

@himadripal himadripal commented Feb 11, 2025

Completes #375

  • enable decimal to decimal
  • remove hard coded castoptions to pass to native execution
  • fixed castTest to match arrow invalid argument error with spark's Number out of range error.

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

use a regex to match arrow invalid argument error.
@@ -872,6 +872,13 @@ fn cast_array(
let array = array_with_timezone(array, cast_options.timezone.clone(), Some(to_type))?;
let from_type = array.data_type().clone();

let native_cast_options: CastOptions = CastOptions {
safe: !matches!(cast_options.eval_mode, EvalMode::Ansi), // take safe mode from cast_options passed
format_options: FormatOptions::new()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one can use a default value defined for FormatOptions here

Copy link
Contributor Author

@himadripal himadripal Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the default CAST_OPTIONS which is replaced by this native_cast_options had two these set to

static TIMESTAMP_FORMAT: Option<&str> = Some("%Y-%m-%d %H:%M:%S%.f");
           
 timestamp_format: TIMESTAMP_FORMAT,
 timestamp_tz_format: TIMESTAMP_FORMAT,

If we change it to default, I checked FormatOptions::default() implementation set these

            timestamp_format: None,
            timestamp_tz_format: None,

Hence kept it as it is defined inside default CAST_OPTIONS for comet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. (The format options are used only to make the cast of timestamp to string compatible with Spark, and are not needed anywhere else) but I guess it is a good idea to be consistent everywhere.

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2025

Codecov Report

Attention: Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.

Project coverage is 39.50%. Comparing base (f09f8af) to head (5cce0c4).
Report is 45 commits behind head on main.

Files with missing lines Patch % Lines
...src/main/scala/org/apache/comet/GenerateDocs.scala 0.00% 2 Missing ⚠️
...scala/org/apache/comet/expressions/CometCast.scala 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #1385       +/-   ##
=============================================
- Coverage     56.12%   39.50%   -16.63%     
- Complexity      976     2085     +1109     
=============================================
  Files           119      265      +146     
  Lines         11743    61597    +49854     
  Branches       2251    13092    +10841     
=============================================
+ Hits           6591    24335    +17744     
- Misses         4012    32697    +28685     
- Partials       1140     4565     +3425     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kazuyukitanimura kazuyukitanimura changed the title enable full decimal to decimal support fix: enable full decimal to decimal support Feb 14, 2025
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, thank you @himadripal just minor comments

Comment on lines 1132 to 1137
// for comet decimal conversion throws ArrowError(string) from arrow - across spark versions the message dont match.
if (sparkMessage.contains("cannot be represented as")) {
assert(
sparkException.getMessage
.replace(".WITH_SUGGESTION] ", "]")
.startsWith(cometMessage))
} else if (CometSparkSessionExtensions.isSpark34Plus) {
// for Spark 3.4 we expect to reproduce the error message exactly
assert(cometMessage == sparkMessage)
cometMessage.contains("cannot be represented as") || cometMessage.contains(
"too large to store"))
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are message modifications below per spark version
Would you mind update them instead of creating another if branch?

… double to decimal and few others paths still uses spark, hence generates spark error message.
Comment on lines +1132 to 1136
// for comet decimal conversion throws ArrowError(string) from arrow - across spark versions the message dont match.
if (sparkMessage.contains("cannot be represented as")) {
cometMessage.contains("cannot be represented as") || cometMessage.contains(
"too large to store")
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to remove this new if block and update the test cases below.
This new block may still pass with cometMessage.contains("cannot be represented as") that seems to be an indication of Spark cast instead of native cast

Copy link
Contributor Author

@himadripal himadripal Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when I removed the branch from the top, the test fails for double to decimal conversion with allow-incompatible flag, I think that is still using spark cast. Hence I had to put it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to remove the if block is to convert the error message to make it similar to Spark
This is the place that we are defining Spark error messages https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/error.rs#L36
You can check how these are used.

Or you can move the message check below and switch the expected messages based on the fromType/toType.

One or the other way. Otherwise we cannot confidently say that the test is passing due to the cannot be represented as message or the too large to store message.

Copy link
Contributor Author

@himadripal himadripal Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to remove the if block is to convert the error message to make it similar to Spark
This is the place that we are defining Spark error messages https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/error.rs#L36
You can check how these are used

This one I checked - problem here is from native execution, we get back Arrow(ArrowError) which only has a string, precision and scale information is not present. Also to construct an error message from a string - we need to check specific string in the message.
We can try to change the ArrowError to have parameter but that will be a big change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can move the message check below and switch the expected messages based on the fromType/toType.

I'll try this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One or the other way. Otherwise we cannot confidently say that the test is passing due to the cannot be represented as message or the too large to store message.

I'm contemplating creating a different test/check function for decimal to decimal test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought. For the former approach, can we get precision info similar to https://github.com/apache/datafusion-comet/blob/main/native/spark-expr/src/conversion_funcs/cast.rs#L932 ?

@@ -69,7 +69,8 @@ object GenerateDocs {
w.write("|-|-|-|\n".getBytes)
for (fromType <- CometCast.supportedTypes) {
for (toType <- CometCast.supportedTypes) {
if (Cast.canCast(fromType, toType) && fromType != toType) {
if (Cast.canCast(fromType, toType) && (fromType != toType || fromType.typeName
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove please check - I added this exception for decimal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants