Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFusion should support casting strings such as "4e7" to decimal #10315

Open
andygrove opened this issue Apr 30, 2024 · 4 comments · May be fixed by apache/arrow-rs#7179 or apache/arrow-rs#7191
Open
Labels
enhancement New feature or request

Comments

@andygrove
Copy link
Member

andygrove commented Apr 30, 2024

Is your feature request related to a problem or challenge?

DataFusion supports casting the string 4e7 to float but not to decimal. This is inconsistent with Postgres (and Apache Spark).

Postgres

postgres=# select cast('4e7' as float);
  float8  
----------
 40000000
(1 row)

postgres=# select cast('4e7' as decimal(10,2));
   numeric   
-------------
 40000000.00

Apache Spark

scala> spark.sql("select cast('4e7' as float)").show
+------------------+
|CAST(4e7 AS FLOAT)|
+------------------+
|             4.0E7|
+------------------+


scala> spark.sql("select cast('4e7' as decimal(10,2))").show
+--------------------------+
|CAST(4e7 AS DECIMAL(10,2))|
+--------------------------+
|               40000000.00|
+--------------------------+

DataFusion

DataFusion CLI v37.0.0
❯ select cast('4e7' as float);
+-------------+
| Utf8("4e7") |
+-------------+
| 40000000.0  |
+-------------+
1 row in set. Query took 0.010 seconds.

❯ select cast('4e7' as decimal(10,2));
Arrow error: Cast error: Cannot cast string '4e7' to value of Decimal128(38, 10) type

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

@andygrove andygrove added enhancement New feature or request good first issue Good for newcomers and removed good first issue Good for newcomers labels Apr 30, 2024
@Omega359
Copy link
Contributor

Omega359 commented May 1, 2024

I suspect this issue is actually in arrow_cast - the error message seems to come from https://github.com/apache/arrow-rs/blob/ada986c7ec8f8fe4f94235c8aaeba4995392ee72/arrow-cast/src/cast.rs#L2753

@tustvold
Copy link
Contributor

tustvold commented May 1, 2024

I believe this will be supported in the next arrow release

@himadripal
Copy link
Contributor

himadripal commented Dec 18, 2024

I'm working on this issue and need help here -
I added test case to debug, at the moment string to decimal conversion is using parse_string_to_decimal_native, I did not find support for e notation in this, but there is a function parse_decimal that does support e-notation.I changed the code to use parse_decimal and it works fine for cases with 4e+7 and looks like there is a bug in the parse_decimal when i provide 4e7 without + or -, for example this test case -

#[test]
fn test_cast_with_options_utf8_to_decimal() {
    let array = StringArray::from(vec!["4e7"]);
    let result = cast_with_options(
        &array,
        &DataType::Decimal128(10,2),
        &CastOptions {
            safe: false,
            format_options: FormatOptions::default(),
        },
    ).unwrap();
    let output_array = result.as_any().downcast_ref::<Decimal128Array>();
    assert_eq!(output_array.unwrap().value(0) , 40000000);
}
returns 4000000000  so it expands till the precision value.

should we replace parse_string_to_decimal_native with parse_decimal and fix this bug there?

@andygrove
Copy link
Member Author

At first glance, parse_decimal looks more mature and more efficient than parse_string_to_decimal_native.

For example, parse_string_to_decimal_native has some expensive operations that are avoided in parse_decimal:

let value_str = value_str.trim();
let parts: Vec<&str> = value_str.split('.').collect();

It does seem to make sense to do what you suggest and use parse_decimal and fix the bug there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants