Velox doesn't support legacy date format behavior in Spark SQL #10354

NEUpanning · 2024-07-01T04:10:13Z

Bug description

For backward capability of parsing/formatting of timestamp/date strings expression behavior and adoption of new behaviors. Spark adds a setting spark.sql.legacy.timeParserPolicy to indicate whether using legacy behavior. In legacy behavior, Spark does parse/format date using SimpleDateFormat that has different standard and behavior with Velox Joda. For example, SimpleDateFormat doesn't performs strict checking for expression's input and Velox doesn't support this when using unix_timestamp function, there is a related gluten issue : link. There are more result mismatch cases include gluten#7109 gluten#7069

The main difference between Spark SimpleDateFormat and Velox Joda

SimpleDateFormat Letter vs Joda (Velox) Letter

Field (SimpleDateFormat)	Field (Joda)	Desc	Velox Support
	C	century of era (>=0)
F		Day of week in month	Unsupported
L	M	Month in year (standalone form)
M		Month in year (context sensitive)	Unsupported
W		Week in month	#10511
	Y	year of era (>=0)
Y	x	Week year	Unsupported
Z	Z	Time zone	Unsupported
u	e	day of week

Range of field value

SimpleDateFormat in non-lenient mode:

/*
     * <pre>
     *                            Greatest       Least
     * Field name        Minimum   Minimum     Maximum     Maximum
     * ----------        -------   -------     -------     -------
     * ERA                     0         0           1           1
     * YEAR                    1         1   292269054   292278994
     * MONTH                   0         0          11          11
     * WEEK_OF_YEAR            1         1          52*         53
     * WEEK_OF_MONTH           0         0           4*          6
     * DAY_OF_MONTH            1         1          28*         31
     * DAY_OF_YEAR             1         1         365*        366
     * DAY_OF_WEEK             1         1           7           7
     * DAY_OF_WEEK_IN_MONTH    1         1           4*          6
     * AM_PM                   0         0           1           1
     * HOUR                    0         0          11          11
     * HOUR_OF_DAY             0         0          23          23
     * MINUTE                  0         0          59          59
     * SECOND                  0         0          59          59
     * MILLISECOND             0         0         999         999
     * ZONE_OFFSET        -13:00    -13:00       14:00       14:00
     * DST_OFFSET           0:00      0:00        0:20        2:00
     * </pre>
     * *: depends on the Gregorian change date
     */

Behavior for invalid format

select from_unixtime(time, "AA") from test
spark result:null
gluten result:Error

Calendar supporting

SimpleDateFormat supports both the Julian and Gregorian calendar systems with the support of a single discontinuity. But Velox Joda only supports Gregorian Calendar.

Behavior when not cosuming entire string

spark result : result value
gluten result : Error
gluten issue

Obey the number of pattern letters for parsing

SimpleDateFormat obeys the number of pattern letters for parsing when the next pattern is numeric field, but Velox always obeys the number of pattern letters.

Road map

The text was updated successfully, but these errors were encountered:

NEUpanning · 2024-07-01T04:15:19Z

Maybe we should add support for legacy date format behavior. @pedroerp @rui-mo What do you think? Thanks!

PHILO-HE · 2024-07-01T07:42:16Z

@NEUpanning, is Spark's legacy policy widely used in your production environment?

NEUpanning · 2024-07-01T08:13:46Z

@PHILO-HE Yes. In Spark upgrades, we need to use legacy policy for backward compatibility rather than pushing users to accept the new behavior. Here is part of legacy policies we are using :

spark.sql.storeAssignmentPolicy Legacy
spark.sql.legacy.createArrayOneBasedIndex false
spark.sql.legacy.setCommandRejectsSparkCoreConfs false
spark.sql.legacy.ctePrecedencePolicy LEGACY
spark.sql.legacy.createEmptyCollectionUsingStringType true
spark.sql.legacy.exponentLiteralAsDecimal.enabled true
spark.sql.legacy.parser.havingWithoutGroupByAsWhere true
spark.sql.legacy.timeParserPolicy LEGACY
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue true
spark.sql.legacy.fromDayTimeString.enabled true
spark.sql.legacy.allowHashOnMapType true
spark.sql.legacy.allowNegativeScaleOfDecimal true
spark.sql.legacy.followThreeValuedLogicInArrayExists false
spark.sql.legacy.json.allowEmptyString.enabled true
spark.sql.legacy.setopsPrecedence.enabled true
spark.sql.legacy.literal.pickMinimumPrecision false
spark.sql.legacy.sizeOfNull true
spark.sql.legacy.typeCoercion.datetimeToString.enabled true
spark.sql.legacy.doLooseUpcast true
spark.sql.legacy.addMonthsAdjustLastDay true
spark.sql.legacy.statisticalAggregate true

pedroerp · 2024-07-01T15:36:55Z

@NEUpanning: we have recently added support for different date/timestamp parsing semantics; are these not enough to follow the Spark parsing semantic you described? If not, could you highlight some of the differences?

https://github.com/facebookincubator/velox/blob/main/velox/type/TimestampConversion.h#L48

Cc: @mbasmanova

NEUpanning · 2024-07-02T02:26:18Z

@pedroerp Thanks for your reply. UnixTimestampParseFunction is using DateTimeFormatter to parse date : source code link. DateTimeFormatter doesn't support different date/timestamp parsing semantics like TimestampConversion.

NEUpanning · 2024-07-05T11:50:13Z

@kecookier

pedroerp · 2024-07-15T16:56:58Z

@NEUpanning DateTimeFormatter has different parsing "types". My point was that, even if none of these types support the exact parsing semantics you need today, they could be extended to do so:

https://github.com/facebookincubator/velox/blob/main/velox/functions/lib/DateTimeFormatter.h#L26

NEUpanning · 2024-07-16T03:41:20Z

@pedroerp I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat to support legacy date parsing behavior in unix_timestamp function. And i think we need to add a flag to indicate which DateTimeFormatterType to use in Spark unix_timestamp function. How do you think?

pedroerp · 2024-07-16T03:47:48Z

Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?

Cc: @mbasmanova who recently enhanced this part of the code.

NEUpanning · 2024-07-16T04:19:09Z

@pedroerp Here is an example I wrote from scratch.

code :

import java.text.ParseException;
import java.text.SimpleDateFormat;
import org.joda.time.DateTime;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;import java.util.Date;

public class Test {
    static SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
    static DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
    public static void invalidMonth() {
        // Invalid month (13)
        String dateString = "2023-13-01";

        // SimpleDateFormat can parse the invalid date
        try {
            Date date = simpleDateFormat.parse(dateString);
            System.out.println("SimpleDateFormat parsed date: " + date);
        } catch (ParseException e) {
            System.out.println("SimpleDateFormat: " + e.getMessage());
        }

        // Joda-Time will throw exception
        try {
            DateTime dateTime = jodaFormatter.parseDateTime(dateString);
            System.out.println("Joda-Time parsed date: " + dateTime);
        } catch (IllegalArgumentException e) {
            System.out.println("Joda-Time: " + e.getMessage());
        }
    }

    public static void dateWithExtraCharacters() {
        // Date with extra Characters
        String dateString = "2023-12-01 00:00:00";

        // SimpleDateFormat does not use all characters up to the end of the string
        try {
            Date date = simpleDateFormat.parse(dateString);
            System.out.println("SimpleDateFormat parsed date: " + date);
        } catch (ParseException e) {
            System.out.println("SimpleDateFormat: " + e.getMessage());
        }

        // Joda-Time consumes all characters
        DateTimeFormatter jodaFormatter = DateTimeFormat.forPattern("yyyy-MM-dd");
        try {
            DateTime dateTime = jodaFormatter.parseDateTime(dateString);
            System.out.println("Joda-Time parsed date: " + dateTime);
        } catch (IllegalArgumentException e) {
            System.out.println("Joda-Time: " + e.getMessage());
        }
    }

    public static void main(String[] args) throws ParseException {
        invalidMonth();
        dateWithExtraCharacters();
    }
}

output :

SimpleDateFormat parsed date: Mon Jan 01 00:00:00 CST 2024
Joda-Time: Cannot parse "2023-13-01": Value 13 for monthOfYear must be in the range [1,12]
SimpleDateFormat parsed date: Fri Dec 01 00:00:00 CST 2023
Joda-Time: Invalid format: "2023-12-01 00:00:00" is malformed at " 00:00:00"

NEUpanning · 2024-07-18T12:40:47Z

@pedroerp I'm implementing new DateTimeFormatterType and i see DateTimeFormatSpecifier doesn't support Week in month pattern used by java.text.SimpleDateFormat. Should i implement it in another PR or in the PR that addresses this issue?

mbasmanova · 2024-07-18T12:48:36Z

Should i implement it in another PR

@NEUpanning A separate PR would be nice. Smaller PRs are easier to review. Thanks.

ccat3z · 2024-09-04T02:52:22Z

Sounds reasonable. Do you have any examples on how the parsing compares with Joda parsing?

Cc: @mbasmanova who recently enhanced this part of the code.

Here is the difference between joda and SimpleDateFormat:

Field (SimpleDateFormat)	Field (Joda)	Desc	Velox Support
	C	century of era (>=0)
F		Day of week in month	Unsupported
L	M	Month in year (standalone form)
M		Month in year (context sensitive)	Unsupported
W		Week in month	#10511
	Y	year of era (>=0)
Y	x	Week year	Unsupported
Z	Z	Time zone	Unsupported
u	e	day of week

@pedroerp I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat to support legacy date parsing behavior in unix_timestamp function. And i think we need to add a flag to indicate which DateTimeFormatterType to use in Spark unix_timestamp function. How do you think?

I'm in favor of this approach too, because there are conflicting parts between the two formats. cc @rui-mo, @NEUpanning

rui-mo · 2024-09-04T03:39:22Z

I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat

@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?

ccat3z · 2024-09-04T04:00:58Z

I'd like to add a new DateTimeFormatterType called SIMPLE for Java's java.text.SimpleDateFormat

@ccat3z Thanks for providing the details. Sounds good to me generally. What are the functions that are impacted by this date format issue, if you could list them?

from_unixtime, to_unix_timestamp

NEUpanning · 2024-09-09T11:15:54Z

We encountered more result mismatch issues due to Velox joda Dateformatter not aligning with Spark SimpleDateFormat. I did some research on Spark SimpleDateFormat. And I updated the issue description to include the main difference between Spark SimpleDateFormat and Velox Joda and the tasks to resolve this issue.
cc @rui-mo @mbasmanova @kecookier @pedroerp

Summary: Introduce new DateTimeFormatterType called 'LENIENT_SIMPLE' and 'STRICT_SIMPLE' that are used when Spark legacy time parser policy is enabled for java.text.SimpleDateFormat in lenient and non-lenient mode. The implementation of 'LENIENT_SIMPLE' and 'STRICT_SIMPLE' is just copy from Joda in this PR and further PR will change the behavior to align with Spark. Spark functions using strict mode(lenient=false): 'from_unixtime', 'unix_timestamp', 'make_date', 'to_unix_timestamp', 'date_format'. Spark functions using lenient mode: cast timestamp to string. 'casting timestamp to string' will use LENIENT_SIMPLE only after the behavior of LENIENT_SIMPLE is aligned with Spark since it does not use Joda DateFormatter to do cast. Relates #10354 Pull Request resolved: #10966 Reviewed By: xiaoxmeng Differential Revision: D63261575 Pulled By: Yuhta fbshipit-source-id: 20ebdc1ad38a43d7064e5c232c9d52d361b7f474

Summary: `java.text.SimpleDateFormat` supports using 'week of month' to parse/format date. The specifier of 'week of month' is 'W'. Now DateTimeFormatter supports 3 group of fields specifying the day within the year. They are following combinations: ``` year + week + dayOfWeek year + dayOfYear year + month + day ``` This PR introduces a new combination that is `year + month + weekOfMonth + dayOfWeek` and adds support for "week of month" in SimpleDateTimeFormatter. Relates issue : #10354 Pull Request resolved: #11103 Reviewed By: Yuhta, amitkdutta Differential Revision: D64920551 Pulled By: pedroerp fbshipit-source-id: 3db9b6d33783aac0ee41791aeac96142e63fb22a

… used (#11131) Summary: For Spark functions 'unix_timestamp', 'from_unixtime' and 'get_timestamp', returns null for invalid datetime format when legacy date formatter is used. Relating to #10354. Pull Request resolved: #11131 Reviewed By: xiaoxmeng Differential Revision: D65517767 Pulled By: mbasmanova fbshipit-source-id: 863d3955caa64317c306458ff917e13e2593bc8f

…1386) Summary: The Spark legacy datetime formatter allows parsing date from incomplete text, seeing [code link](https://github.com/openjdk/jdk8/blob/master/jdk/src/share/classes/java/text/DateFormat.java#L351). This PR enables partial date parsing when the `LENIENT_SIMPLE` or `STRICT_SIMPLE` datetime formatter is used. Relates issues: #10354, [gluten#6227](apache/incubator-gluten#6227) Pull Request resolved: #11386 Reviewed By: pedroerp Differential Revision: D65948039 Pulled By: Yuhta fbshipit-source-id: 0d17084f723ebeaded7278178982b5a10d9f9fed

NEUpanning added bug Something isn't working triage Newly created issue that needs attention. labels Jul 1, 2024

mbasmanova added the spark-functions label Jul 18, 2024

NEUpanning mentioned this issue Sep 11, 2024

Introduce simple date time formatter #10966

Closed

NEUpanning mentioned this issue Oct 30, 2024

Allow partial date parsing when simple datetime formatter is used #11386

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velox doesn't support legacy date format behavior in Spark SQL #10354

Velox doesn't support legacy date format behavior in Spark SQL #10354

NEUpanning commented Jul 1, 2024 •

edited

Loading

NEUpanning commented Jul 1, 2024

PHILO-HE commented Jul 1, 2024

NEUpanning commented Jul 1, 2024 •

edited

Loading

pedroerp commented Jul 1, 2024

NEUpanning commented Jul 2, 2024

NEUpanning commented Jul 5, 2024

pedroerp commented Jul 15, 2024

NEUpanning commented Jul 16, 2024

pedroerp commented Jul 16, 2024

NEUpanning commented Jul 16, 2024 •

edited

Loading

NEUpanning commented Jul 18, 2024

mbasmanova commented Jul 18, 2024

ccat3z commented Sep 4, 2024 •

edited

Loading

rui-mo commented Sep 4, 2024

ccat3z commented Sep 4, 2024

NEUpanning commented Sep 9, 2024

Velox doesn't support legacy date format behavior in Spark SQL #10354

Velox doesn't support legacy date format behavior in Spark SQL #10354

Comments

NEUpanning commented Jul 1, 2024 • edited Loading

Bug description

The main difference between Spark SimpleDateFormat and Velox Joda

SimpleDateFormat Letter vs Joda (Velox) Letter

Range of field value

Behavior for invalid format

Calendar supporting

Behavior when not cosuming entire string

Obey the number of pattern letters for parsing

Road map

NEUpanning commented Jul 1, 2024

PHILO-HE commented Jul 1, 2024

NEUpanning commented Jul 1, 2024 • edited Loading

pedroerp commented Jul 1, 2024

NEUpanning commented Jul 2, 2024

NEUpanning commented Jul 5, 2024

pedroerp commented Jul 15, 2024

NEUpanning commented Jul 16, 2024

pedroerp commented Jul 16, 2024

NEUpanning commented Jul 16, 2024 • edited Loading

NEUpanning commented Jul 18, 2024

mbasmanova commented Jul 18, 2024

ccat3z commented Sep 4, 2024 • edited Loading

rui-mo commented Sep 4, 2024

ccat3z commented Sep 4, 2024

NEUpanning commented Sep 9, 2024

NEUpanning commented Jul 1, 2024 •

edited

Loading

NEUpanning commented Jul 1, 2024 •

edited

Loading

NEUpanning commented Jul 16, 2024 •

edited

Loading

ccat3z commented Sep 4, 2024 •

edited

Loading