Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return null for invalid datetime format when legacy date formatter is used #11131

Closed
22 changes: 13 additions & 9 deletions velox/docs/functions/spark/datetime.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,14 +82,14 @@ These functions support TIMESTAMP and DATE input types.

Adjusts ``unixTime`` (elapsed seconds since UNIX epoch) to configured session timezone, then
converts it to a formatted time string according to ``format``. Only supports BIGINT type for
``unixTime``. Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`
``unixTime``. Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>` date formatter depends on ``spark.legacy_date_formatter`` configuration.
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
`Valid patterns for date format
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. Throws exception for
invalid ``format``. This function will convert input to milliseconds, and integer overflow is
allowed in the conversion, which aligns with Spark. See the below third example where INT64_MAX
is used, -1000 milliseconds are produced by INT64_MAX * 1000 due to integer overflow. ::
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. When `Simple` date formatter is used,
null is returned for invalid ``format``; otherwise, exception is thrown. This function will convert input to
milliseconds, and integer overflow is allowed in the conversion, which aligns with Spark. See the below third
example where INT64_MAX is used, -1000 milliseconds are produced by INT64_MAX * 1000 due to integer overflow. ::

SELECT from_unixtime(100, 'yyyy-MM-dd HH:mm:ss'); -- '1970-01-01 00:01:40'
SELECT from_unixtime(3600, 'yyyy'); -- '1970'
Expand All @@ -113,7 +113,11 @@ These functions support TIMESTAMP and DATE input types.
The format follows Spark's
`Datetime patterns
<https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_.
Returns NULL for parsing error or NULL input. Throws exception for invalid format. ::
Using `Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_
date formatter in lenient mode that is align with Spark legacy date parser behavior or
`Joda <https://www.joda.org/joda-time/>`_ date formatter depends on ``spark.legacy_date_formatter`` configuration.
Returns NULL for parsing error or NULL input. When `Simple` date formatter is used, null is returned for invalid
``dateFormat``; otherwise, exception is thrown. ::

SELECT get_timestamp('1970-01-01', 'yyyy-MM-dd); -- timestamp `1970-01-01`
SELECT get_timestamp('1970-01-01', 'yyyy-MM'); -- NULL (parsing error)
Expand Down Expand Up @@ -288,8 +292,8 @@ These functions support TIMESTAMP and DATE input types.
.. spark:function:: unix_timestamp() -> integer

Returns the current UNIX timestamp in seconds. Using
`Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>` date formatter in lenient mode
that is align with Spark legacy date parser behavior or `Joda <https://www.joda.org/joda-time/>` date formatter
`Simple <https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html>`_ date formatter in lenient mode
that is align with Spark legacy date parser behavior or `Joda <https://www.joda.org/joda-time/>`_ date formatter
depends on the ``spark.legacy_date_formatter`` configuration.

.. spark:function:: unix_timestamp(string) -> integer
Expand Down
95 changes: 66 additions & 29 deletions velox/functions/sparksql/DateTimeFunctions.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,32 @@ Expected<std::shared_ptr<DateTimeFormatter>> getDateTimeFormatter(
std::string_view(format.data(), format.size()));
}
}

// Creates datetime formatter based on the provided format string. When legacy
// formatter is used, returns nullptr for invalid format; otherwise, throws a
// user error.
//
// @param format The format string to be used for initializing the formatter.
// @param legacyFormatter Whether legacy formatter is used.
// @return A shared pointer to a DateTimeFormatter. If the formatter
// initialization fails and the legacy formatter is used, nullptr is returned.
inline std::shared_ptr<DateTimeFormatter> initializeFormatter(
const std::string_view format,
bool legacyFormatter) {
auto formatter = detail::getDateTimeFormatter(
std::string_view(format),
legacyFormatter ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA);
// When legacy formatter is used, returns nullptr for invalid format;
// otherwise, throws a user error.
if (formatter.hasError()) {
if (legacyFormatter) {
return nullptr;
}
VELOX_USER_FAIL(formatter.error().message());
}
return formatter.value();
}
} // namespace detail

template <typename T>
Expand Down Expand Up @@ -245,43 +271,51 @@ struct FromUnixtimeFunction {
legacyFormatter_ = config.sparkLegacyDateFormatter();
sessionTimeZone_ = getTimeZoneFromConfig(config);
if (format != nullptr) {
setFormatter(*format);
auto formatter = detail::initializeFormatter(
std::string_view(*format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
maxResultSize_ = formatter_->maxResultSize(sessionTimeZone_);
} else {
invalidFormat_ = true;
}
isConstantTimeFormat_ = true;
}
}

FOLLY_ALWAYS_INLINE void call(
FOLLY_ALWAYS_INLINE bool call(
out_type<Varchar>& result,
const arg_type<int64_t>& second,
const arg_type<Varchar>& format) {
if (invalidFormat_) {
return false;
}
if (!isConstantTimeFormat_) {
setFormatter(format);
auto formatter = detail::initializeFormatter(
std::string_view(format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
maxResultSize_ = formatter_->maxResultSize(sessionTimeZone_);
} else {
return false;
}
}
const Timestamp timestamp{second, 0};
result.reserve(maxResultSize_);
int32_t resultSize;
resultSize = formatter_->format(
timestamp, sessionTimeZone_, maxResultSize_, result.data(), true);
result.resize(resultSize);
return true;
}

private:
FOLLY_ALWAYS_INLINE void setFormatter(const arg_type<Varchar>& format) {
formatter_ = detail::getDateTimeFormatter(
std::string_view(format.data(), format.size()),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA)
.thenOrThrow(folly::identity, [&](const Status& status) {
VELOX_USER_FAIL("{}", status.message());
});
maxResultSize_ = formatter_->maxResultSize(sessionTimeZone_);
}

const tz::TimeZone* sessionTimeZone_{nullptr};
std::shared_ptr<DateTimeFormatter> formatter_;
uint32_t maxResultSize_;
bool isConstantTimeFormat_{false};
bool legacyFormatter_{false};
bool invalidFormat_{false};
};

template <typename T>
Expand Down Expand Up @@ -361,29 +395,31 @@ struct GetTimestampFunction {
sessionTimeZone_ = tz::locateZone(sessionTimezoneName);
}
if (format != nullptr) {
formatter_ = detail::getDateTimeFormatter(
std::string_view(*format),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA)
.thenOrThrow(folly::identity, [&](const Status& status) {
VELOX_USER_FAIL("{}", status.message());
});
isConstantTimeFormat_ = true;
auto formatter = detail::initializeFormatter(
std::string_view(*format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
} else {
invalidFormat_ = true;
}
rui-mo marked this conversation as resolved.
Show resolved Hide resolved
}
}

FOLLY_ALWAYS_INLINE bool call(
out_type<Timestamp>& result,
const arg_type<Varchar>& input,
const arg_type<Varchar>& format) {
if (invalidFormat_) {
return false;
}
if (!isConstantTimeFormat_) {
formatter_ = detail::getDateTimeFormatter(
std::string_view(format),
legacyFormatter_ ? DateTimeFormatterType::STRICT_SIMPLE
: DateTimeFormatterType::JODA)
.thenOrThrow(folly::identity, [&](const Status& status) {
VELOX_USER_FAIL("{}", status.message());
});
auto formatter = detail::initializeFormatter(
std::string_view(format), legacyFormatter_);
if (formatter) {
formatter_ = formatter;
} else {
return false;
}
rui-mo marked this conversation as resolved.
Show resolved Hide resolved
}
auto dateTimeResult = formatter_->parse(std::string_view(input));
// Null as result for parsing error.
Expand All @@ -406,6 +442,7 @@ struct GetTimestampFunction {
bool isConstantTimeFormat_{false};
const tz::TimeZone* sessionTimeZone_{tz::locateZone(0)}; // default to GMT.
bool legacyFormatter_{false};
bool invalidFormat_{false};
};

template <typename T>
Expand Down
25 changes: 25 additions & 0 deletions velox/functions/sparksql/tests/DateTimeFunctionsTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,12 @@ class DateTimeFunctionsTest : public SparkFunctionBaseTest {
});
}

void enableLegacyFormatter() {
queryCtx_->testingOverrideConfigUnsafe({
{core::QueryConfig::kSparkLegacyDateFormatter, "true"},
});
}

template <typename TOutput, typename TValue>
std::optional<TOutput> evaluateDateFuncOnce(
const std::string& expr,
Expand Down Expand Up @@ -262,6 +268,9 @@ TEST_F(DateTimeFunctionsTest, unixTimestamp) {
EXPECT_EQ(std::nullopt, unixTimestamp("00:00:00"));
EXPECT_EQ(std::nullopt, unixTimestamp(""));
EXPECT_EQ(std::nullopt, unixTimestamp("malformed input"));
enableLegacyFormatter();
EXPECT_EQ(std::nullopt, unixTimestamp(""));
EXPECT_EQ(std::nullopt, unixTimestamp("malformed input"));
}

TEST_F(DateTimeFunctionsTest, unixTimestampCurrent) {
Expand Down Expand Up @@ -803,6 +812,14 @@ TEST_F(DateTimeFunctionsTest, getTimestamp) {
VELOX_ASSERT_THROW(
getTimestamp("2023-07-13 21:34", "yyyy-MM-dd HH:II"),
"Specifier I is not supported");

// Returns null for invalid datetime format when legacy date formatter is
// used.
enableLegacyFormatter();
// Empty format.
EXPECT_EQ(getTimestamp("0", ""), std::nullopt);
// Unsupported specifier.
EXPECT_EQ(getTimestamp("0", "l"), std::nullopt);
}

TEST_F(DateTimeFunctionsTest, hour) {
Expand Down Expand Up @@ -926,6 +943,14 @@ TEST_F(DateTimeFunctionsTest, fromUnixtime) {
fromUnixTime(0, "FF/MM/dd"), "Specifier F is not supported");
VELOX_ASSERT_THROW(
fromUnixTime(0, "yyyy-MM-dd HH:II"), "Specifier I is not supported");

// Returns null for invalid datetime format when legacy date formatter is
// used.
enableLegacyFormatter();
// Empty format.
EXPECT_EQ(fromUnixTime(0, ""), std::nullopt);
// Unsupported specifier.
EXPECT_EQ(fromUnixTime(0, "l"), std::nullopt);
}

TEST_F(DateTimeFunctionsTest, makeYMInterval) {
Expand Down
Loading