Skip to content

Commit

Permalink
Introduce simple date time formatter (#10966)
Browse files Browse the repository at this point in the history
Summary:
Introduce new DateTimeFormatterType called 'LENIENT_SIMPLE' and 'STRICT_SIMPLE' that are used when Spark legacy time parser policy is enabled for java.text.SimpleDateFormat in lenient and non-lenient mode. The implementation of 'LENIENT_SIMPLE' and 'STRICT_SIMPLE' is just copy from Joda in this PR and further PR will change the behavior to align with Spark.
Spark functions using strict mode(lenient=false): 'from_unixtime', 'unix_timestamp', 'make_date', 'to_unix_timestamp', 'date_format'.
Spark functions using lenient mode: cast timestamp to string.
'casting timestamp to string' will use LENIENT_SIMPLE only after the behavior of LENIENT_SIMPLE is aligned with Spark since it does not use Joda DateFormatter to do cast.

Relates #10354

Pull Request resolved: #10966

Reviewed By: xiaoxmeng

Differential Revision: D63261575

Pulled By: Yuhta

fbshipit-source-id: 20ebdc1ad38a43d7064e5c232c9d52d361b7f474
  • Loading branch information
NEUpanning authored and facebook-github-bot committed Sep 24, 2024
1 parent 83d6609 commit 35b79eb
Show file tree
Hide file tree
Showing 7 changed files with 212 additions and 13 deletions.
9 changes: 9 additions & 0 deletions velox/core/QueryConfig.h
Original file line number Diff line number Diff line change
Expand Up @@ -287,6 +287,11 @@ class QueryConfig {
/// The current spark partition id.
static constexpr const char* kSparkPartitionId = "spark.partition_id";

/// If true, simple date formatter is used for time formatting and parsing.
/// Joda date formatter is used by default.
static constexpr const char* kSparkLegacyDateFormatter =
"spark.legacy_date_formatter";

/// The number of local parallel table writer operators per task.
static constexpr const char* kTaskWriterCount = "task_writer_count";

Expand Down Expand Up @@ -741,6 +746,10 @@ class QueryConfig {
return value;
}

bool sparkLegacyDateFormatter() const {
return get<bool>(kSparkLegacyDateFormatter, false);
}

bool exprTrackCpuUsage() const {
return get<bool>(kExprTrackCpuUsage, false);
}
Expand Down
7 changes: 7 additions & 0 deletions velox/docs/configs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -713,6 +713,13 @@ Spark-specific Configuration
- integer
-
- The current task's Spark partition ID. It's set by the query engine (Spark) prior to task execution.
* - spark.legacy_date_formatter
- bool
- false
- If true, `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>` date formatter is used for time formatting and parsing. Joda date formatter is used by default.
- Joda date formatter performs strict checking of its input and uses different pattern string.
- For example, the 2015-07-22 10:00:00 timestamp cannot be parse if pattern is yyyy-MM-dd because the parser does not consume whole input.
- Another example is that the 'W' pattern, which means week in month, is not supported. For more differences, see :issue:`10354`.

Tracing
--------
Expand Down
9 changes: 7 additions & 2 deletions velox/docs/functions/spark/datetime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,9 @@ These functions support TIMESTAMP and DATE input types.
Adjusts ``unixTime`` (elapsed seconds since UNIX epoch) to configured session timezone, then
converts it to a formatted time string according to ``format``. Only supports BIGINT type for
``unixTime``.
``unixTime``. Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>` date formatter depends on ``spark.legacy_date_formatter`` configuration.
`Valid patterns for date format
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. Throws exception for
invalid ``format``. This function will convert input to milliseconds, and integer overflow is
Expand Down Expand Up @@ -285,7 +287,10 @@ These functions support TIMESTAMP and DATE input types.

.. spark:function:: unix_timestamp() -> integer
Returns the current UNIX timestamp in seconds.
Returns the current UNIX timestamp in seconds. Using
`Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>` date formatter in lenient mode
that is align with Spark legacy date parser behavior or `Joda <https://www.joda.org/joda-time/>` date formatter
depends on the ``spark.legacy_date_formatter`` configuration.

.. spark:function:: unix_timestamp(string) -> integer
:noindex:
Expand Down
125 changes: 125 additions & 0 deletions velox/functions/lib/DateTimeFormatter.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1697,4 +1697,129 @@ std::shared_ptr<DateTimeFormatter> buildJodaDateTimeFormatter(
return builder.setType(DateTimeFormatterType::JODA).build();
}

std::shared_ptr<DateTimeFormatter> buildSimpleDateTimeFormatter(
const std::string_view& format,
bool lenient) {
VELOX_USER_CHECK(!format.empty(), "Format pattern should not be empty.");

DateTimeFormatterBuilder builder(format.size());
const char* cur = format.data();
const char* end = cur + format.size();

while (cur < end) {
const char* startTokenPtr = cur;

// For literal case, literal should be quoted using single quotes ('). If
// there is no quotes, it is interpreted as pattern letters. If there is
// only single quote, a user error will be thrown.
if (*startTokenPtr == '\'') {
// Append single literal quote for 2 consecutive single quote.
if (cur + 1 < end && *(cur + 1) == '\'') {
builder.appendLiteral("'");
cur += 2;
} else {
// Append literal characters from the start until the next closing
// literal sequence single quote.
int64_t count = numLiteralChars(startTokenPtr + 1, end);
VELOX_USER_CHECK_NE(count, -1, "No closing single quote for literal");
for (int64_t i = 1; i <= count; i++) {
builder.appendLiteral(startTokenPtr + i, 1);
if (*(startTokenPtr + i) == '\'') {
i += 1;
}
}
cur += count + 2;
}
} else {
// Append format specifier according to pattern letters. If pattern letter
// is not supported, a user error will be thrown.
int count = 1;
++cur;
while (cur < end && *startTokenPtr == *cur) {
++count;
++cur;
}
switch (*startTokenPtr) {
case 'a':
builder.appendHalfDayOfDay();
break;
case 'C':
builder.appendCenturyOfEra(count);
break;
case 'd':
builder.appendDayOfMonth(count);
break;
case 'D':
builder.appendDayOfYear(count);
break;
case 'e':
builder.appendDayOfWeek1Based(count);
break;
case 'E':
builder.appendDayOfWeekText(count);
break;
case 'G':
builder.appendEra();
break;
case 'h':
builder.appendClockHourOfHalfDay(count);
break;
case 'H':
builder.appendHourOfDay(count);
break;
case 'K':
builder.appendHourOfHalfDay(count);
break;
case 'k':
builder.appendClockHourOfDay(count);
break;
case 'm':
builder.appendMinuteOfHour(count);
break;
case 'M':
if (count <= 2) {
builder.appendMonthOfYear(count);
} else {
builder.appendMonthOfYearText(count);
}
break;
case 's':
builder.appendSecondOfMinute(count);
break;
case 'S':
builder.appendFractionOfSecond(count);
break;
case 'w':
builder.appendWeekOfWeekYear(count);
break;
case 'x':
builder.appendWeekYear(count);
break;
case 'y':
builder.appendYear(count);
break;
case 'Y':
builder.appendYearOfEra(count);
break;
case 'z':
builder.appendTimeZone(count);
break;
case 'Z':
builder.appendTimeZoneOffsetId(count);
break;
default:
if (isalpha(*startTokenPtr)) {
VELOX_UNSUPPORTED("Specifier {} is not supported.", *startTokenPtr);
} else {
builder.appendLiteral(startTokenPtr, cur - startTokenPtr);
}
break;
}
}
}
DateTimeFormatterType type = lenient ? DateTimeFormatterType::LENIENT_SIMPLE
: DateTimeFormatterType::STRICT_SIMPLE;
return builder.setType(type).build();
}

} // namespace facebook::velox::functions
18 changes: 17 additions & 1 deletion velox/functions/lib/DateTimeFormatter.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,19 @@

namespace facebook::velox::functions {

enum class DateTimeFormatterType { JODA, MYSQL, UNKNOWN };
enum class DateTimeFormatterType {
JODA,
MYSQL,
// Corresponding to java.text.SimpleDateFormat in lenient mode. It is used by
// the 'date_format', 'from_unixtime', 'unix_timestamp' and
// 'to_unix_timestamp' Spark functions.
// TODO: this is currently no different from STRICT_SIMPLE.
LENIENT_SIMPLE,
// Corresponding to java.text.SimpleDateFormat in strict(lenient=false) mode.
// It is used by Spark 'cast date to string'.
STRICT_SIMPLE,
UNKNOWN
};

enum class DateTimeFormatSpecifier : uint8_t {
// Era, e.g: "AD"
Expand Down Expand Up @@ -209,6 +221,10 @@ std::shared_ptr<DateTimeFormatter> buildMysqlDateTimeFormatter(
std::shared_ptr<DateTimeFormatter> buildJodaDateTimeFormatter(
const std::string_view& format);

std::shared_ptr<DateTimeFormatter> buildSimpleDateTimeFormatter(
const std::string_view& format,
bool lenient);

} // namespace facebook::velox::functions

template <>
Expand Down
55 changes: 46 additions & 9 deletions velox/functions/sparksql/DateTimeFunctions.h
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,22 @@

namespace facebook::velox::functions::sparksql {

namespace detail {
std::shared_ptr<DateTimeFormatter> getDateTimeFormatter(
const std::string_view& format,
DateTimeFormatterType type) {
switch (type) {
case DateTimeFormatterType::STRICT_SIMPLE:
return buildSimpleDateTimeFormatter(format, /*lenient=*/false);
case DateTimeFormatterType::LENIENT_SIMPLE:
return buildSimpleDateTimeFormatter(format, /*lenient=*/true);
default:
return buildJodaDateTimeFormatter(
std::string_view(format.data(), format.size()));
}
}
} // namespace detail

template <typename T>
struct YearFunction : public InitSessionTimezone<T> {
VELOX_DEFINE_FUNCTION_TYPES(T);
Expand Down Expand Up @@ -156,7 +172,10 @@ struct UnixTimestampParseFunction {
const std::vector<TypePtr>& /*inputTypes*/,
const core::QueryConfig& config,
const arg_type<Varchar>* /*input*/) {
format_ = buildJodaDateTimeFormatter(kDefaultFormat_);
format_ = detail::getDateTimeFormatter(
kDefaultFormat_,
config.sparkLegacyDateFormatter() ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
setTimezone(config);
}

Expand Down Expand Up @@ -205,10 +224,13 @@ struct UnixTimestampParseWithFormatFunction
const core::QueryConfig& config,
const arg_type<Varchar>* /*input*/,
const arg_type<Varchar>* format) {
legacyFormatter_ = config.sparkLegacyDateFormatter();
if (format != nullptr) {
try {
this->format_ = buildJodaDateTimeFormatter(
std::string_view(format->data(), format->size()));
this->format_ = detail::getDateTimeFormatter(
std::string_view(format->data(), format->size()),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
} catch (const VeloxUserError&) {
invalidFormat_ = true;
}
Expand All @@ -228,8 +250,10 @@ struct UnixTimestampParseWithFormatFunction
// Format error returns null.
try {
if (!isConstFormat_) {
this->format_ = buildJodaDateTimeFormatter(
std::string_view(format.data(), format.size()));
this->format_ = detail::getDateTimeFormatter(
std::string_view(format.data(), format.size()),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
}
} catch (const VeloxUserError&) {
return false;
Expand All @@ -248,6 +272,7 @@ struct UnixTimestampParseWithFormatFunction
private:
bool isConstFormat_{false};
bool invalidFormat_{false};
bool legacyFormatter_{false};
};

// Parses unix time in seconds to a formatted string.
Expand All @@ -260,6 +285,7 @@ struct FromUnixtimeFunction {
const core::QueryConfig& config,
const arg_type<int64_t>* /*unixtime*/,
const arg_type<Varchar>* format) {
legacyFormatter_ = config.sparkLegacyDateFormatter();
sessionTimeZone_ = getTimeZoneFromConfig(config);
if (format != nullptr) {
setFormatter(*format);
Expand All @@ -284,15 +310,18 @@ struct FromUnixtimeFunction {

private:
FOLLY_ALWAYS_INLINE void setFormatter(const arg_type<Varchar>& format) {
formatter_ = buildJodaDateTimeFormatter(
std::string_view(format.data(), format.size()));
formatter_ = detail::getDateTimeFormatter(
std::string_view(format.data(), format.size()),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
maxResultSize_ = formatter_->maxResultSize(sessionTimeZone_);
}

const tz::TimeZone* sessionTimeZone_{nullptr};
std::shared_ptr<DateTimeFormatter> formatter_;
uint32_t maxResultSize_;
bool isConstantTimeFormat_{false};
bool legacyFormatter_{false};
};

template <typename T>
Expand Down Expand Up @@ -366,12 +395,16 @@ struct GetTimestampFunction {
const core::QueryConfig& config,
const arg_type<Varchar>* /*input*/,
const arg_type<Varchar>* format) {
legacyFormatter_ = config.sparkLegacyDateFormatter();
auto sessionTimezoneName = config.sessionTimezone();
if (!sessionTimezoneName.empty()) {
sessionTimeZone_ = tz::locateZone(sessionTimezoneName);
}
if (format != nullptr) {
formatter_ = buildJodaDateTimeFormatter(std::string_view(*format));
formatter_ = detail::getDateTimeFormatter(
std::string_view(*format),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
isConstantTimeFormat_ = true;
}
}
Expand All @@ -381,7 +414,10 @@ struct GetTimestampFunction {
const arg_type<Varchar>& input,
const arg_type<Varchar>& format) {
if (!isConstantTimeFormat_) {
formatter_ = buildJodaDateTimeFormatter(std::string_view(format));
formatter_ = detail::getDateTimeFormatter(
std::string_view(format),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
}
auto dateTimeResult = formatter_->parse(std::string_view(input));
// Null as result for parsing error.
Expand All @@ -404,6 +440,7 @@ struct GetTimestampFunction {
std::shared_ptr<DateTimeFormatter> formatter_{nullptr};
bool isConstantTimeFormat_{false};
const tz::TimeZone* sessionTimeZone_{tz::locateZone(0)}; // default to GMT.
bool legacyFormatter_{false};
};

template <typename T>
Expand Down
2 changes: 1 addition & 1 deletion velox/functions/sparksql/Split.h
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,6 @@ struct Split {
result.add_item().setNoCopy(StringView(start + pos, end - pos));
}

mutable detail::ReCache cache_;
mutable facebook::velox::functions::detail::ReCache cache_;
};
} // namespace facebook::velox::functions::sparksql

0 comments on commit 35b79eb

Please sign in to comment.