Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velox doesn't support legacy date format behavior in Spark SQL #10354

Open
5 of 17 tasks
NEUpanning opened this issue Jul 1, 2024 · 16 comments
Open
5 of 17 tasks

Velox doesn't support legacy date format behavior in Spark SQL #10354

NEUpanning opened this issue Jul 1, 2024 · 16 comments
Labels
bug Something isn't working spark-functions triage Newly created issue that needs attention.

Comments

@NEUpanning
Copy link
Contributor

NEUpanning commented Jul 1, 2024

Bug description

For backward capability of parsing/formatting of timestamp/date strings expression behavior and adoption of new behaviors. Spark adds a setting spark.sql.legacy.timeParserPolicy to indicate whether using legacy behavior. In legacy behavior, Spark does parse/format date using SimpleDateFormat that has different standard and behavior with Velox Joda. For example, SimpleDateFormat doesn't performs strict checking for expression's input and Velox doesn't support this when using unix_timestamp function, there is a related gluten issue : link. There are more result mismatch cases include gluten#7109 gluten#7069

The main difference between Spark SimpleDateFormat and Velox Joda

SimpleDateFormat Letter vs Joda (Velox) Letter

Field (SimpleDateFormat) Field (Joda) Desc Velox Support
  C century of era (>=0)
F Day of week in month Unsupported
L M Month in year (standalone form)
M Month in year (context sensitive) Unsupported
W Week in month #10511
  Y year of era (>=0)
Y x Week year Unsupported
Z Z Time zone Unsupported
u e day of week

Range of field value

SimpleDateFormat in non-lenient mode:

/*
     * <pre>
     *                            Greatest       Least
     * Field name        Minimum   Minimum     Maximum     Maximum
     * ----------        -------   -------     -------     -------
     * ERA                     0         0           1           1
     * YEAR                    1         1   292269054   292278994
     * MONTH                   0         0          11          11
     * WEEK_OF_YEAR            1         1          52*         53
     * WEEK_OF_MONTH           0         0           4*          6
     * DAY_OF_MONTH            1         1          28*         31
     * DAY_OF_YEAR             1         1         365*        366
     * DAY_OF_WEEK             1         1           7           7
     * DAY_OF_WEEK_IN_MONTH    1         1           4*          6
     * AM_PM                   0         0           1           1
     * HOUR                    0         0          11          11
     * HOUR_OF_DAY             0         0          23          23
     * MINUTE                  0         0          59          59
     * SECOND                  0         0          59          59
     * MILLISECOND             0         0         999         999
     * ZONE_OFFSET        -13:00    -13:00       14:00       14:00
     * DST_OFFSET           0:00      0:00        0:20        2:00
     * </pre>
     * *: depends on the Gregorian change date
     */

Behavior for invalid format

select from_unixtime(time, "AA") from test
spark result:null
gluten result:Error

Calendar supporting

SimpleDateFormat supports both the Julian and Gregorian calendar systems with the support of a single discontinuity. But Velox Joda only supports Gregorian Calendar.

Behavior when not cosuming entire string

spark result : result value
gluten result : Error
gluten issue

Obey the number of pattern letters for parsing

SimpleDateFormat obeys the number of pattern letters for parsing when the next pattern is numeric field, but Velox always obeys the number of pattern letters.

Road map

@NEUpanning NEUpanning added bug Something isn't working triage Newly created issue that needs attention. labels Jul 1, 2024
@NEUpanning
Copy link
Contributor Author

Maybe we should add support for legacy date format behavior. @pedroerp @rui-mo What do you think? Thanks!

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Jul 1, 2024

@NEUpanning, is Spark's legacy policy widely used in your production environment?

@NEUpanning
Copy link
Contributor Author

NEUpanning commented Jul 1, 2024

@PHILO-HE Yes. In Spark upgrades, we need to use legacy policy for backward compatibility rather than pushing users to accept the new behavior. Here is part of legacy policies we are using :

spark.sql.storeAssignmentPolicy Legacy
spark.sql.legacy.createArrayOneBasedIndex false
spark.sql.legacy.setCommandRejectsSparkCoreConfs false
spark.sql.legacy.ctePrecedencePolicy LEGACY
spark.sql.legacy.createEmptyCollectionUsingStringType true
spark.sql.legacy.exponentLiteralAsDecimal.enabled true
spark.sql.legacy.parser.havingWithoutGroupByAsWhere true
spark.sql.legacy.timeParserPolicy LEGACY
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue true
spark.sql.legacy.fromDayTimeString.enabled true
spark.sql.legacy.allowHashOnMapType true
spark.sql.legacy.allowNegativeScaleOfDecimal true
spark.sql.legacy.followThreeValuedLogicInArrayExists false
spark.sql.legacy.json.allowEmptyString.enabled true
spark.sql.legacy.setopsPrecedence.enabled true
spark.sql.legacy.literal.pickMinimumPrecision false
spark.sql.legacy.sizeOfNull true
spark.sql.legacy.typeCoercion.datetimeToString.enabled true
spark.sql.legacy.doLooseUpcast true
spark.sql.legacy.addMonthsAdjustLastDay true
spark.sql.legacy.statisticalAggregate true

@pedroerp
Copy link
Contributor

pedroerp commented Jul 1, 2024

@NEUpanning: we have recently added support for different date/timestamp parsing semantics; are these not enough to follow the Spark parsing semantic you described? If not, could you highlight some of the differences?

https://github.com/facebookincubator/velox/blob/main/velox/type/TimestampConversion.h#L48

Cc: @mbasmanova

@NEUpanning
Copy link
Contributor Author

@pedroerp Thanks for your reply. UnixTimestampParseFunction is using DateTimeFormatter to parse date : source code link. DateTimeFormatter doesn't support different date/timestamp parsing semantics like TimestampConversion.

@NEUpanning
Copy link
Contributor Author

@kecookier

@pedroerp
Copy link
Contributor

@NEUpanning DateTimeFormatter has different parsing "types". My point was that, even if none of these types support the exact parsing semantics you need today, they could be extended to do so:

https://github.com/facebookincubator/velox/blob/main/velox/functions/lib/DateTimeFormatter.h#L26

@NEUpanning
Copy link
Contributor Author

@pedroerp I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat to support legacy date parsing behavior in unix_timestamp function. And i think we need to add a flag to indicate which DateTimeFormatterType to use in Spark unix_timestamp function. How do you think?

@pedroerp
Copy link
Contributor

Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?

Cc: @mbasmanova who recently enhanced this part of the code.

@NEUpanning
Copy link
Contributor Author

NEUpanning commented Jul 16, 2024

@pedroerp Here is an example I wrote from scratch.

code :

import java.text.ParseException;
import java.text.SimpleDateFormat;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;import java.util.Date;

public class Test {
    static SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
    static DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
    public static void invalidMonth() {
        // Invalid month (13)
        String dateString = "2023-13-01";

        // SimpleDateFormat can parse the invalid date
        try {
            Date date = simpleDateFormat.parse(dateString);
            System.out.println("SimpleDateFormat parsed date: " + date);
        } catch (ParseException e) {
            System.out.println("SimpleDateFormat: " + e.getMessage());
        }

        // Joda-Time will throw exception
        try {
            DateTime dateTime = jodaFormatter.parseDateTime(dateString);
            System.out.println("Joda-Time parsed date: " + dateTime);
        } catch (IllegalArgumentException e) {
            System.out.println("Joda-Time: " + e.getMessage());
        }
    }

    public static void dateWithExtraCharacters() {
        // Date with extra Characters
        String dateString = "2023-12-01 00:00:00";

        // SimpleDateFormat does not use all characters up to the end of the string
        try {
            Date date = simpleDateFormat.parse(dateString);
            System.out.println("SimpleDateFormat parsed date: " + date);
        } catch (ParseException e) {
            System.out.println("SimpleDateFormat: " + e.getMessage());
        }

        // Joda-Time consumes all characters
        DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
        try {
            DateTime dateTime = jodaFormatter.parseDateTime(dateString);
            System.out.println("Joda-Time parsed date: " + dateTime);
        } catch (IllegalArgumentException e) {
            System.out.println("Joda-Time: " + e.getMessage());
        }
    }

    public static void main(String[] args) throws ParseException {
        invalidMonth();
        dateWithExtraCharacters();
    }
}

output :

SimpleDateFormat parsed date: Mon Jan 01 00:00:00 CST 2024
Joda-Time: Cannot parse "2023-13-01": Value 13 for monthOfYear must be in the range [1,12]
SimpleDateFormat parsed date: Fri Dec 01 00:00:00 CST 2023
Joda-Time: Invalid format: "2023-12-01 00:00:00" is malformed at " 00:00:00"

@NEUpanning
Copy link
Contributor Author

@pedroerp I'm implementing new DateTimeFormatterType and i see DateTimeFormatSpecifier doesn't support Week in month pattern used by java.text.SimpleDateFormat. Should i implement it in another PR or in the PR that addresses this issue?

@mbasmanova
Copy link
Contributor

Should i implement it in another PR

@NEUpanning A separate PR would be nice. Smaller PRs are easier to review. Thanks.

@ccat3z
Copy link
Contributor

ccat3z commented Sep 4, 2024

Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?

Cc: @mbasmanova who recently enhanced this part of the code.

Here is the difference between joda and SimpleDateFormat:

Field (SimpleDateFormat) Field (Joda) Desc Velox Support
  C century of era (>=0)  
F   Day of week in month Unsupported
L M Month in year (standalone form)  
M Month in year (context sensitive) Unsupported
W   Week in month #10511
  Y year of era (>=0)  
Y x Week year Unsupported
Z Z Time zone Unsupported
u e day of week

@pedroerp I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat to support legacy date parsing behavior in unix_timestamp function. And i think we need to add a flag to indicate which DateTimeFormatterType to use in Spark unix_timestamp function. How do you think?

I'm in favor of this approach too, because there are conflicting parts between the two formats. cc @rui-mo, @NEUpanning

@rui-mo
Copy link
Collaborator

rui-mo commented Sep 4, 2024

I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat

@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?

@ccat3z
Copy link
Contributor

ccat3z commented Sep 4, 2024

I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat

@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?

from_unixtime, to_unix_timestamp

@NEUpanning
Copy link
Contributor Author

We encountered more result mismatch issues due to Velox joda Dateformatter not aligning with Spark SimpleDateFormat. I did some research on Spark SimpleDateFormat. And I updated the issue description to include the main difference between Spark SimpleDateFormat and Velox Joda and the tasks to resolve this issue.
cc @rui-mo @mbasmanova @kecookier @pedroerp

facebook-github-bot pushed a commit that referenced this issue Sep 24, 2024
Summary:
Introduce new DateTimeFormatterType called 'LENIENT_SIMPLE' and 'STRICT_SIMPLE' that are used when Spark legacy time parser policy is enabled for java.text.SimpleDateFormat in lenient and non-lenient mode. The implementation of 'LENIENT_SIMPLE' and 'STRICT_SIMPLE' is just copy from Joda in this PR and further PR will change the behavior to align with Spark.
Spark functions using strict mode(lenient=false): 'from_unixtime', 'unix_timestamp', 'make_date', 'to_unix_timestamp', 'date_format'.
Spark functions using lenient mode: cast timestamp to string.
'casting timestamp to string' will use LENIENT_SIMPLE only after the behavior of LENIENT_SIMPLE is aligned with Spark since it does not use Joda DateFormatter to do cast.

Relates #10354

Pull Request resolved: #10966

Reviewed By: xiaoxmeng

Differential Revision: D63261575

Pulled By: Yuhta

fbshipit-source-id: 20ebdc1ad38a43d7064e5c232c9d52d361b7f474
facebook-github-bot pushed a commit that referenced this issue Oct 26, 2024
Summary:
`java.text.SimpleDateFormat` supports using 'week of month' to parse/format
date. The specifier of 'week of month' is 'W'. Now DateTimeFormatter supports 3
group of fields specifying the day within the year. They are following
combinations:

```
year + week + dayOfWeek
year + dayOfYear
year + month + day
```
This PR introduces a new combination that is
`year + month + weekOfMonth + dayOfWeek` and adds support for "week of month"
in SimpleDateTimeFormatter.

Relates issue : #10354

Pull Request resolved: #11103

Reviewed By: Yuhta, amitkdutta

Differential Revision: D64920551

Pulled By: pedroerp

fbshipit-source-id: 3db9b6d33783aac0ee41791aeac96142e63fb22a
facebook-github-bot pushed a commit that referenced this issue Nov 7, 2024
… used (#11131)

Summary:
For Spark functions 'unix_timestamp', 'from_unixtime' and 'get_timestamp',
returns null for invalid datetime format when legacy date formatter is used.

Relating to #10354.

Pull Request resolved: #11131

Reviewed By: xiaoxmeng

Differential Revision: D65517767

Pulled By: mbasmanova

fbshipit-source-id: 863d3955caa64317c306458ff917e13e2593bc8f
facebook-github-bot pushed a commit that referenced this issue Nov 14, 2024
…1386)

Summary:
The Spark legacy datetime formatter allows parsing date from incomplete text,
seeing [code link](https://github.com/openjdk/jdk8/blob/master/jdk/src/share/classes/java/text/DateFormat.java#L351). This PR enables partial date parsing when the `LENIENT_SIMPLE`
or `STRICT_SIMPLE` datetime formatter is used.

Relates issues: #10354, [gluten#6227](apache/incubator-gluten#6227)

Pull Request resolved: #11386

Reviewed By: pedroerp

Differential Revision: D65948039

Pulled By: Yuhta

fbshipit-source-id: 0d17084f723ebeaded7278178982b5a10d9f9fed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working spark-functions triage Newly created issue that needs attention.
Projects
None yet
Development

No branches or pull requests

6 participants