Skip to content

Commit

Permalink
fix: Remove StringWriter that modifies input in place (#12219)
Browse files Browse the repository at this point in the history
Summary:

StringWriter supports two flavors, one that writes out new strings to a result Vector, and one that
modifies existing strings in place. The latter is only used in two places, providing fast paths for
upper/lower and replace/replace_first. 

I found a related bug with fuzzer that occurs in replace/replace_first that occurs when the StringViews in the first argument's Vector point to overlapping ranges of the same string buffers.
In this case the changes to one row are accidentally applied to multiple rows resulting in
incorrect results.

Since we explicitly allow operations that produce a substring of the original string to do so with
a no-copy implementation, StringViews with overlapping ranges can occur.  E.g. SimpleFunctions
that apply this no-copy optimization on a string argument which could be constant, with other
arguments that determine the range of the original string to take that are non-constant (like 
trim).

I discussed this offline with a few folks and since the above is allowed and this in-place 
optimization is so rarely used, the consensus was to treat the string buffers in a FlatVector as
immutable.

Therefore in this change, I remove the flavor of StringWriter that modifies strings in place. This
fixes the bug in replace/replace_first. upper/lower was using it correctly because the optimization
was only applied to ASCII strings, the function takes a single argument, and the modification is 
idempotent (the value of a byte in the string don't depend on any other bytes in the string or any 
other arguments, and can be reapplied without consequences). Given how precarious this
optimization is (if any of those conditions changed it would result in difficult to detect bugs), and
allowing upper/lower to mutate the string in place would invite others to do so in the future
(potentially leading to more bugs like in replace/replace_first) I think losing this fast path is worth
the added safety.

I also updated the documentation to clarify that the string buffers in a FlatVector should be
treated as immutable.

Differential Revision: D68924324
  • Loading branch information
Kevin Wilfong authored and facebook-github-bot committed Jan 30, 2025
1 parent 49292da commit 7639307
Show file tree
Hide file tree
Showing 23 changed files with 122 additions and 288 deletions.
6 changes: 6 additions & 0 deletions velox/docs/develop/vectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,12 @@ After applying substr(s, 2) function string in position 1 became short enough
to fit inside the StringView, hence, it no longer contains a pointer to a
position in the string buffer.

Allowing these zero-copy implementations of functions that simply change the
starting position/length of a string, means that we may end up with StringViews
pointing to overlapping ranges within `stringBuffers_`. For this reason the
Buffers in `stringBuffers_` should be treated as immutable to prevent
modifications from unintentionally cascading.

Flat vectors of type TIMESTAMP are represented by FlatVector<Timestamp>.
Timestamp struct consists of two 64-bit integers: seconds and nanoseconds. Each
entry uses 16 bytes.
Expand Down
2 changes: 1 addition & 1 deletion velox/docs/develop/view-and-writer-types.rst
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,7 @@ When a given Ti is primitive, the following is valid.
Assignable to std::optional<T> allows writing null or value to the primitive. Returned by complex writers when writing nullable
primitives.

**StringWriter<>**
**StringWriter**

- void **reserve** (size_t newCapacity) : Reserve a space for the output string with size of at least newCapacity.
- void **resize** (size_t newCapacity) : Set the size of the string.
Expand Down
2 changes: 1 addition & 1 deletion velox/docs/functions/presto/json.rst
Original file line number Diff line number Diff line change
Expand Up @@ -211,4 +211,4 @@ are similar. To create a JSON-typed vector, one can use
``BaseVector::create(JSON(), size, pool)`` that creates a flat vector of
StringViews, i.e. FlatVector<StringView>. Reading and writing to a JSON-typed
vector are the same as those for VARCHAR vectors, e.g., via
VectorReader<StringView> and StringWriter<>.
VectorReader<StringView> and StringWriter.
2 changes: 1 addition & 1 deletion velox/expression/CastExpr-inl.h
Original file line number Diff line number Diff line change
Expand Up @@ -341,7 +341,7 @@ void CastExpr::applyCastKernel(
if constexpr (
ToKind == TypeKind::VARCHAR || ToKind == TypeKind::VARBINARY) {
// Write the result output to the output vector
auto writer = exec::StringWriter<>(result, row);
auto writer = exec::StringWriter(result, row);
writer.copy_from(output);
writer.finalize();
} else {
Expand Down
4 changes: 2 additions & 2 deletions velox/expression/CastExpr.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ VectorPtr CastExpr::castFromDate(
try {
// TODO Optimize to avoid creating an intermediate string.
auto output = DATE()->toString(inputFlatVector->valueAt(row));
auto writer = exec::StringWriter<>(resultFlatVector, row);
auto writer = exec::StringWriter(resultFlatVector, row);
writer.resize(output.size());
::memcpy(writer.data(), output.data(), output.size());
writer.finalize();
Expand Down Expand Up @@ -298,7 +298,7 @@ VectorPtr CastExpr::castFromIntervalDayTime(
// TODO Optimize to avoid creating an intermediate string.
auto output =
INTERVAL_DAY_TIME()->valueToString(inputFlatVector->valueAt(row));
auto writer = exec::StringWriter<>(resultFlatVector, row);
auto writer = exec::StringWriter(resultFlatVector, row);
writer.resize(output.size());
::memcpy(writer.data(), output.data(), output.size());
writer.finalize();
Expand Down
54 changes: 1 addition & 53 deletions velox/expression/StringWriter.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,7 @@
#include "velox/vector/FlatVector.h"

namespace facebook::velox::exec {
template <bool reuseInput = false>
class StringWriter;

template <>
class StringWriter<false /*reuseInput*/> : public UDFOutputString {
class StringWriter : public UDFOutputString {
public:
// Used to initialize top-level strings and allow zero-copy writes.
StringWriter(FlatVector<StringView>* vector, int32_t offset)
Expand Down Expand Up @@ -156,52 +152,4 @@ class StringWriter<false /*reuseInput*/> : public UDFOutputString {
template <typename A, typename B>
friend struct VectorWriter;
};

// A string writer with UDFOutputString semantics that utilizes a pre-allocated
// input string for the output allocation, if inPlace is true in the constructor
// the string will be initialized with the input string value.
template <>
class StringWriter<true /*reuseInput*/> : public UDFOutputString {
public:
StringWriter() : vector_(nullptr), offset_(-1) {}

StringWriter(
FlatVector<StringView>* vector,
int32_t offset,
const StringView& stringToReuse,
bool inPlace = false)
: vector_(vector), offset_(offset), stringToReuse_(stringToReuse) {
setData(const_cast<char*>(stringToReuse_.data()));
setCapacity(stringToReuse_.size());

if (inPlace) {
// The string should be intialized with the input value
setSize(stringToReuse_.size());
}
}

void reserve(size_t newCapacity) override {
VELOX_CHECK(
newCapacity <= capacity() && "String writer max capacity extended");
}

/// Not called by the UDF Implementation. Should be called at the end to
/// finalize the allocation and the string writing
void finalize() {
VELOX_DCHECK(size() == 0 || data());
vector_->setNoCopy(offset_, StringView(data(), size()));
}

private:
/// The output vector that this string is being written to
FlatVector<StringView>* vector_;

/// The offset the string writes to within vector_
int32_t offset_;

/// The input string that is reused, held locally to assert the validity of
/// the data pointer throughout the proxy lifetime. More specifically when
/// the string is inlined.
StringView stringToReuse_;
};
} // namespace facebook::velox::exec
5 changes: 2 additions & 3 deletions velox/expression/UdfTypeResolver.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@

namespace facebook::velox::exec {

template <bool reuseInput>
class StringWriter;

template <bool nullable, typename V>
Expand Down Expand Up @@ -120,14 +119,14 @@ template <>
struct resolver<Varchar> {
using in_type = StringView;
using null_free_in_type = in_type;
using out_type = StringWriter<false>;
using out_type = StringWriter;
};

template <>
struct resolver<Varbinary> {
using in_type = StringView;
using null_free_in_type = in_type;
using out_type = StringWriter<false>;
using out_type = StringWriter;
};

template <>
Expand Down
2 changes: 1 addition & 1 deletion velox/expression/VectorWriters.h
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ struct VectorWriter<
std::enable_if_t<std::is_same_v<T, Varchar> | std::is_same_v<T, Varbinary>>>
: public VectorWriterBase {
using vector_t = typename TypeToFlatVector<T>::type;
using exec_out_t = StringWriter<>;
using exec_out_t = StringWriter;

void init(vector_t& vector, bool uniqueAndMutable = false) {
proxy_.vector_ = &vector;
Expand Down
18 changes: 9 additions & 9 deletions velox/expression/tests/StringWriterTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ class StringWriterTest : public functions::test::FunctionBaseTest {};

TEST_F(StringWriterTest, append) {
auto vector = makeFlatVector<StringView>(2);
auto writer = exec::StringWriter<>(vector.get(), 0);
auto writer = exec::StringWriter(vector.get(), 0);
writer.append("1 "_sv);
writer.append(std::string_view("2 "));
writer.append("3 "_sv);
Expand All @@ -42,7 +42,7 @@ TEST_F(StringWriterTest, append) {

TEST_F(StringWriterTest, plusOperator) {
auto vector = makeFlatVector<StringView>(1);
auto writer = exec::StringWriter<>(vector.get(), 0);
auto writer = exec::StringWriter(vector.get(), 0);
writer += "1 "_sv;
writer += "2 ";
writer += std::string_view("3 ");
Expand All @@ -57,19 +57,19 @@ TEST_F(StringWriterTest, plusOperator) {
TEST_F(StringWriterTest, assignment) {
auto vector = makeFlatVector<StringView>(4);

auto writer0 = exec::StringWriter<>(vector.get(), 0);
auto writer0 = exec::StringWriter(vector.get(), 0);
writer0 = "string0"_sv;
writer0.finalize();

auto writer1 = exec::StringWriter<>(vector.get(), 1);
auto writer1 = exec::StringWriter(vector.get(), 1);
writer1 = std::string("string1");
writer1.finalize();

auto writer2 = exec::StringWriter<>(vector.get(), 2);
auto writer2 = exec::StringWriter(vector.get(), 2);
writer2 = std::string_view("string2");
writer2.finalize();

auto writer3 = exec::StringWriter<>(vector.get(), 3);
auto writer3 = exec::StringWriter(vector.get(), 3);
writer3 = folly::StringPiece("string3");
writer3.finalize();

Expand All @@ -81,7 +81,7 @@ TEST_F(StringWriterTest, assignment) {

TEST_F(StringWriterTest, copyFromStringView) {
auto vector = makeFlatVector<StringView>(1);
auto writer = exec::StringWriter<>(vector.get(), 0);
auto writer = exec::StringWriter(vector.get(), 0);
writer.copy_from("1 2 3 4 5 "_sv);
writer.finalize();

Expand All @@ -90,7 +90,7 @@ TEST_F(StringWriterTest, copyFromStringView) {

TEST_F(StringWriterTest, copyFromStdString) {
auto vector = makeFlatVector<StringView>(1);
auto writer = exec::StringWriter<>(vector.get(), 0);
auto writer = exec::StringWriter(vector.get(), 0);
writer.copy_from(std::string("1 2 3 4 5 "));
writer.finalize();

Expand All @@ -99,7 +99,7 @@ TEST_F(StringWriterTest, copyFromStdString) {

TEST_F(StringWriterTest, copyFromCString) {
auto vector = makeFlatVector<StringView>(4);
auto writer = exec::StringWriter<>(vector.get(), 0);
auto writer = exec::StringWriter(vector.get(), 0);
writer.copy_from("1 2 3 4 5 ");
writer.finalize();

Expand Down
2 changes: 1 addition & 1 deletion velox/expression/tests/VariadicViewTest.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -377,7 +377,7 @@ const auto callNullablePrefix = "callNullable "_sv;
const auto callAsciiPrefix = "callAscii "_sv;

void writeInputToOutput(
StringWriter<>& out,
StringWriter& out,
const VariadicView<true, Varchar>* inputs) {
for (const auto& input : *inputs) {
out += input.has_value() ? input.value() : null;
Expand Down
4 changes: 2 additions & 2 deletions velox/functions/lib/Re2Functions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1397,11 +1397,11 @@ class RegexpReplaceWithLambdaFunction : public exec::VectorFunction {
// Sections being replaced should not overlap.
struct Replacer {
const StringView& original;
exec::StringWriter<false>& writer;
exec::StringWriter& writer;
char* result;
size_t start = 0;

Replacer(const StringView& _original, exec::StringWriter<false>& _writer)
Replacer(const StringView& _original, exec::StringWriter& _writer)
: original{_original}, writer{_writer}, result{writer.data()} {}

void replace(size_t offset, size_t size, const StringView& replacement) {
Expand Down
4 changes: 2 additions & 2 deletions velox/functions/lib/ToHex.h
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ namespace facebook::velox::functions {
struct ToHexUtil {
FOLLY_ALWAYS_INLINE static void toHex(
StringView input,
exec::StringWriter<false>& result) {
exec::StringWriter& result) {
// Lookup table to translate unsigned char to its hexadecimal format.
static const char* const kHexTable =
"000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F"
Expand All @@ -49,7 +49,7 @@ struct ToHexUtil {

FOLLY_ALWAYS_INLINE static void toHex(
uint64_t input,
exec::StringWriter<false>& result) {
exec::StringWriter& result) {
static const char* const kHexTable = "0123456789ABCDEF";
if (input == 0) {
result = "0";
Expand Down
16 changes: 0 additions & 16 deletions velox/functions/lib/string/StringImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,6 @@
#include <stdlib.h>
#include <cstdint>
#include <cstring>
#include <sstream>
#include <string>
#include <string_view>
#include <vector>
#include "folly/CPortability.h"
Expand Down Expand Up @@ -65,20 +63,6 @@ FOLLY_ALWAYS_INLINE bool lower(TOutString& output, const TInString& input) {
return true;
}

/// Inplace ascii lower
template <typename T>
FOLLY_ALWAYS_INLINE bool lowerAsciiInPlace(T& str) {
lowerAscii(str.data(), str.data(), str.size());
return true;
}

/// Inplace ascii upper
template <typename T>
FOLLY_ALWAYS_INLINE bool upperAsciiInPlace(T& str) {
upperAscii(str.data(), str.data(), str.size());
return true;
}

/// Apply a set of appenders on an output string, an appender is a lambda
/// that takes an output string and append a string to it. This can be used by
/// code-gen to reduce copying in concat by evaluating nested expressions
Expand Down
6 changes: 3 additions & 3 deletions velox/functions/prestosql/FromUtf8.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ class FromUtf8Function : public exec::VectorFunction {

if (constantReplacement) {
rows.applyToSelected([&](auto row) {
exec::StringWriter<false> writer(flatResult, row);
exec::StringWriter writer(flatResult, row);
auto value = decodedInput.valueAt<StringView>(row);
if (row < firstInvalidRow) {
writer.append(value);
Expand All @@ -109,7 +109,7 @@ class FromUtf8Function : public exec::VectorFunction {
context.applyToSelectedNoThrow(rows, [&](auto row) {
auto replacement =
getReplacementCharacter(args[1]->type(), decodedReplacement, row);
exec::StringWriter<false> writer(flatResult, row);
exec::StringWriter writer(flatResult, row);
auto value = decodedInput.valueAt<StringView>(row);
if (row < firstInvalidRow) {
writer.append(value);
Expand Down Expand Up @@ -261,7 +261,7 @@ class FromUtf8Function : public exec::VectorFunction {
void fixInvalidUtf8(
StringView input,
const std::string& replacement,
exec::StringWriter<false>& fixedWriter) const {
exec::StringWriter& fixedWriter) const {
if (input.empty()) {
fixedWriter.setEmpty();
return;
Expand Down
4 changes: 2 additions & 2 deletions velox/functions/prestosql/Reverse.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ class ReverseFunction : public exec::VectorFunction {
const FlatVector<StringView>* input,
FlatVector<StringView>* result) {
rows.applyToSelected([&](int row) {
auto proxy = exec::StringWriter<>(result, row);
auto proxy = exec::StringWriter(result, row);
stringImpl::reverse<isAscii>(proxy, input->valueAt(row).getString());
proxy.finalize();
});
Expand Down Expand Up @@ -100,7 +100,7 @@ class ReverseFunction : public exec::VectorFunction {
if (originalArg->isConstantEncoding()) {
auto value = originalArg->as<ConstantVector<StringView>>()->valueAt(0);

auto proxy = exec::StringWriter<>(flatResult, rows.begin());
auto proxy = exec::StringWriter(flatResult, rows.begin());
if (isAscii) {
stringImpl::reverse<true>(proxy, value.str());
} else {
Expand Down
Loading

0 comments on commit 7639307

Please sign in to comment.