GH-38432: [C++][Parquet] Try to fix performance regression in the DictByteArrayDecoderImpl #38784

mapleFU · 2023-11-19T07:42:06Z

Rationale for this change

Do some changes mentioned in #38432

I believe this might fix #38577

Problem1:

The BinaryHelper might call Prepare() and Prepare(estimated-output-binary-length) for data. This might because:

For Plain Encoding ByteArray, the len_ is similar to the data-page size, so Reserve is related.
For Dict Encoding. The Data Page is just a RLE encoding Page, it's len_ might didn't directly related to output-binary.

Problem2:

Prepare using ::arrow::kBinaryMemoryLimit as min-value, we should use this->chunk_space_remaining_.

Problem3:

std::optional<int64_t> is hard to optimize for some compilers

What changes are included in this PR?

Mention the behavior of BinaryHelper. And trying to fix it.

Are these changes tested?

No

Are there any user-facing changes?

Regression fixes

Closes: [C++] Parquet reading performance regressions #38432

mapleFU · 2023-11-19T07:43:06Z

@rok @pitrou I try to figure out how the regression happened, and add comment for it. Would you mind take a look?

cpp/src/parquet/encoding.cc

rok · 2023-11-20T00:44:44Z

Thanks for working on this @mapleFU !

mapleFU · 2023-11-20T02:48:27Z

@ursabot please benchmark

ursabot · 2023-11-20T02:48:33Z

Benchmark runs are scheduled for commit c75517b. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

…ession fix

cpp/src/parquet/encoding.cc

Co-authored-by: Gang Wu <[email protected]>

conbench-apache-arrow · 2023-11-20T05:54:28Z

Thanks for your patience. Conbench analyzed the 5 benchmarking runs that have been run so far on PR commit c75517b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

wgtmac

How should we interpret the benchmark report to verify the regression is fixed?

jorisvandenbossche · 2023-11-20T08:43:16Z

How should we interpret the benchmark report to verify the regression is fixed?

The problem is that the machine ursa-i9-9960x is currently not running (not fully sure why), and it's on this one that the affected benchmark is run (see #38437 (comment) in the previous PR where I linked to the relevant in the benchmark run back then)

mapleFU · 2023-11-20T08:50:11Z

Would you mind help re-run or trigger that running? @jorisvandenbossche 🤔

mapleFU · 2023-11-20T10:36:55Z

@ursabot please benchmark

ursabot · 2023-11-20T10:37:01Z

Benchmark runs are scheduled for commit f74b4c1. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2023-11-20T13:16:15Z

@mapleFU Can you please make your PR message more descriptive? It should help understand what this PR is about.

pitrou · 2023-11-20T13:20:17Z

Also, I don't understand why this would fix dictionary decoding, and why this would be ok for non-dictionary decoding. This lacks a serious analysis IMHO.

conbench-apache-arrow · 2023-11-20T13:44:52Z

Thanks for your patience. Conbench analyzed the 5 benchmarking runs that have been run so far on PR commit f74b4c1.

There were 6 benchmark results indicating a performance regression:

Pull Request Run on ursa-thinkcentre-m75q at 2023-11-20 12:17:10Z
- compareBenchmark (Java) with source=java-micro, suite=arrow.memory.util.ArrowBufPointerBenchmarks
- ShortVectorInsertAtEnd (C++) with params=<SMALL_VECTOR(int)>, source=cpp-micro, suite=arrow-small-vector-benchmark
and 4 more (see the report linked below)

The full Conbench report has more details.

mapleFU · 2023-11-20T13:53:39Z

Also, I don't understand why this would fix dictionary decoding, and why this would be ok for non-dictionary decoding. This lacks a serious analysis IMHO.

This is easy because:

len_ is the size of Page's data payload
PlainByteArrayDecoder's len_ is similiar to the final result(each record might have a length). But for dictionary, the len_ might be unrelated to the ByteArray size, which could make the Program reserve unneccessary memory. Evenmore, it doesn't decrease len_ after each decode.

mapleFU · 2023-11-24T15:01:39Z

@ursabot please benchmark

ursabot · 2023-11-24T15:01:46Z

Benchmark runs are scheduled for commit 60cdb80. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

mapleFU · 2023-11-24T17:59:26Z

@jorisvandenbossche I think problem is fixed, aha

See https://conbench.ursa.dev/compare/runs/f12445ad2fb84f32a924a483ab75e0db...a478ff692d97498eade4936be3d54fcf/

I'll merge this later if no negative comments

pitrou · 2023-11-24T18:02:22Z

cpp/src/parquet/encoding.cc

+      RETURN_NOT_OK(acc_->builder->Reserve(entries_remaining_));
+    }
+    return Status::OK();
+  }


What difference does it make to have a separate method?

https://godbolt.org/z/5bf1K55a5

Found it might not optimized with hot path. Prepare is not in hot path, but PrepareNextInput is.

pitrou · 2023-11-24T18:02:53Z

@mapleFU Please don't merge a change if you don't understand how it works.

pitrou · 2023-11-24T18:04:06Z

Note that @jorisvandenbossche above didn't test the same changes...

mapleFU · 2023-11-24T18:06:25Z

#38784 (comment)

@pitrou I think the critical reason is here. The compiler is hard to optimize:

Status PrepareNextInput(int64_t, std::optional<int64_t> = std::nullopt)

The function above is called frequently in a hot-path. Another two reason is about the #38577

mapleFU · 2023-11-24T18:06:39Z

I've try benchmark here: #38784 (comment)

pitrou · 2023-11-24T18:07:34Z

@pitrou I think the critical reason is here. The compiler is hard to optimize:

Are you able to measure a difference between those two versions of the code? Is there a micro-benchmark?

mapleFU · 2023-11-24T18:08:02Z

Sure, please wait for a minute.

Also I'm trying on LLVM-17, maybe I can simplify the case using godbolt

conbench-apache-arrow · 2023-11-24T18:14:05Z

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 60cdb80.

There were 2 benchmark results indicating a performance regression:

Pull Request Run on arm64-m6g-linux-compute at 2023-11-24 16:29:21Z
- IntegerFormatting (C++) with params=, source=cpp-micro, suite=arrow-value-parsing-benchmark
Pull Request Run on arm64-t4g-linux-compute at 2023-11-24 16:33:14Z
- ExecuteScalarExpressionBaseline (C++) with params=/rows_per_batch:1000000/real_time/threads:8, source=cpp-micro, suite=arrow-acero-expression-benchmark

The full Conbench report has more details.

mapleFU · 2023-11-24T18:46:18Z

The current benchmark is out. @jorisvandenbossche would you mind take a look?

@pitrou https://godbolt.org/z/5bf1K55a5 I use same compiler-options, and the std::optional one seems contruct more instr in hot path?

jorisvandenbossche · 2023-11-24T19:45:14Z

Strangely on the latest version I don't directly see any effect locally, but, at least the online benchmarks now seem to confirm some improvement (several of the similar benchmarks are consistent): for example https://conbench.ursa.dev/compare/benchmark-results/0655f5af97857ba780005944ff57195f...06560d57f1f772cc800009452aab4863/

mapleFU · 2023-11-24T20:05:56Z

The difference between current version and 13.0.0 version is it call more BinaryBuilder->Reserve(), I don't think it would make performance worse, so I revert some change in 60cdb80 . Maybe we can considering this later.

This patch also separate PrepareNextInput. Maybe it's a compiler specific problem, I've check a similar case in quick-bench, it doesn't benfit the performance if we remove std::optional<> as argument in gcc-12.3 with C++17. However benchmark shows that the performance has benefits. Maybe we can ask some C++ experts for help.

You can decide how to handle this patch later, maybe I'm becoming crazy because spending long time but don't know why previously, at least we found that:

Reserve might related to the problem, but in ursabot it don't affect the performance a lot
PrepareInput in the patch might make compiler doesn't do some optimizations in ursa bot benchmark

mapleFU · 2023-11-25T05:42:10Z

Will wait util monday night to see any futher request for review

mapleFU · 2023-11-27T14:40:09Z

@pitrou Would you like to checkin this?

conbench-apache-arrow · 2023-11-28T02:04:17Z

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 6070815.

There was 1 benchmark result indicating a performance regression:

Commit Run on ursa-i9-9960x at 2023-11-27 21:41:51Z
- partitioned-dataset-filter (R) with dataset=dataset-taxi-parquet, language=R, query=dims

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

…tByteArrayDecoderImpl (#38784) ### Rationale for this change Do some changes mentioned in #38432 I believe this might fix #38577 Problem1: The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because: 1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related. 2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. Problem2: `Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`. Problem3: `std::optional<int64_t>` is hard to optimize for some compilers ### What changes are included in this PR? Mention the behavior of BinaryHelper. And trying to fix it. ### Are these changes tested? No ### Are there any user-facing changes? Regression fixes * Closes: #38432 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Gang Wu <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

…he DictByteArrayDecoderImpl (apache#38784) ### Rationale for this change Do some changes mentioned in apache#38432 I believe this might fix apache#38577 Problem1: The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because: 1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related. 2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. Problem2: `Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`. Problem3: `std::optional<int64_t>` is hard to optimize for some compilers ### What changes are included in this PR? Mention the behavior of BinaryHelper. And trying to fix it. ### Are these changes tested? No ### Are there any user-facing changes? Regression fixes * Closes: apache#38432 Lead-authored-by: mwish <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Gang Wu <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>

mapleFU requested a review from wgtmac as a code owner November 19, 2023 07:42

mapleFU requested review from rok and pitrou November 19, 2023 07:42

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 19, 2023

rok reviewed Nov 20, 2023

View reviewed changes

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 20, 2023

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 20, 2023

apacheGH-38432: [C++][Parquet] Encoding: Dict Arrow Decoder tiny regr…

a7633d0

…ession fix

mapleFU force-pushed the dict-decoder-regression-fix branch from c75517b to a7633d0 Compare November 20, 2023 03:36

wgtmac changed the title ~~GH-38432: [C++][Parquet] Encoding: Dict Arrow Decoder Regression Fix~~ GH-38432: [C++][Parquet] Fix regression in the DictByteArrayDecoderImpl Nov 20, 2023

wgtmac reviewed Nov 20, 2023

View reviewed changes

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved

cpp/src/parquet/encoding.cc Outdated Show resolved Hide resolved

Apply suggestions from code review

f74b4c1

Co-authored-by: Gang Wu <[email protected]>

wgtmac approved these changes Nov 20, 2023

View reviewed changes

resume logic for PrepareNextInput

60cdb80

mapleFU marked this pull request as ready for review November 24, 2023 15:03

fix format

3dc536c

pitrou reviewed Nov 24, 2023

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels Nov 24, 2023

pitrou approved these changes Nov 24, 2023

View reviewed changes

pitrou changed the title ~~GH-38432: [C++][Parquet] Trying to Fix regression in the DictByteArrayDecoderImpl~~ GH-38432: [C++][Parquet] Try to fix performance regression in the DictByteArrayDecoderImpl Nov 27, 2023

pitrou merged commit 6070815 into apache:main Nov 27, 2023
39 of 40 checks passed

pitrou removed the awaiting changes Awaiting changes label Nov 27, 2023

mapleFU deleted the dict-decoder-regression-fix branch November 27, 2023 17:27

mapleFU mentioned this pull request Feb 6, 2024

[C++] Parquet reader is unable to read LargeString columns #39682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38432: [C++][Parquet] Try to fix performance regression in the DictByteArrayDecoderImpl #38784

GH-38432: [C++][Parquet] Try to fix performance regression in the DictByteArrayDecoderImpl #38784

mapleFU commented Nov 19, 2023 •

edited

Loading

mapleFU commented Nov 19, 2023

rok commented Nov 20, 2023

mapleFU commented Nov 20, 2023

ursabot commented Nov 20, 2023

conbench-apache-arrow bot commented Nov 20, 2023

wgtmac left a comment

jorisvandenbossche commented Nov 20, 2023

mapleFU commented Nov 20, 2023

mapleFU commented Nov 20, 2023

ursabot commented Nov 20, 2023

pitrou commented Nov 20, 2023

pitrou commented Nov 20, 2023

conbench-apache-arrow bot commented Nov 20, 2023

mapleFU commented Nov 20, 2023

mapleFU commented Nov 24, 2023

ursabot commented Nov 24, 2023

mapleFU commented Nov 24, 2023 •

edited

Loading

pitrou Nov 24, 2023

mapleFU Nov 24, 2023 •

edited

Loading

pitrou commented Nov 24, 2023

pitrou commented Nov 24, 2023

mapleFU commented Nov 24, 2023 •

edited

Loading

mapleFU commented Nov 24, 2023

pitrou commented Nov 24, 2023

mapleFU commented Nov 24, 2023 •

edited

Loading

conbench-apache-arrow bot commented Nov 24, 2023

mapleFU commented Nov 24, 2023

jorisvandenbossche commented Nov 24, 2023

mapleFU commented Nov 24, 2023 •

edited

Loading

mapleFU commented Nov 25, 2023

mapleFU commented Nov 27, 2023

conbench-apache-arrow bot commented Nov 28, 2023

GH-38432: [C++][Parquet] Try to fix performance regression in the DictByteArrayDecoderImpl #38784

GH-38432: [C++][Parquet] Try to fix performance regression in the DictByteArrayDecoderImpl #38784

Conversation

mapleFU commented Nov 19, 2023 • edited Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

mapleFU commented Nov 19, 2023

rok commented Nov 20, 2023

mapleFU commented Nov 20, 2023

ursabot commented Nov 20, 2023

conbench-apache-arrow bot commented Nov 20, 2023

wgtmac left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Nov 20, 2023

mapleFU commented Nov 20, 2023

mapleFU commented Nov 20, 2023

ursabot commented Nov 20, 2023

pitrou commented Nov 20, 2023

pitrou commented Nov 20, 2023

conbench-apache-arrow bot commented Nov 20, 2023

mapleFU commented Nov 20, 2023

mapleFU commented Nov 24, 2023

ursabot commented Nov 24, 2023

mapleFU commented Nov 24, 2023 • edited Loading

pitrou Nov 24, 2023

Choose a reason for hiding this comment

mapleFU Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

pitrou commented Nov 24, 2023

pitrou commented Nov 24, 2023

mapleFU commented Nov 24, 2023 • edited Loading

mapleFU commented Nov 24, 2023

pitrou commented Nov 24, 2023

mapleFU commented Nov 24, 2023 • edited Loading

conbench-apache-arrow bot commented Nov 24, 2023

mapleFU commented Nov 24, 2023

jorisvandenbossche commented Nov 24, 2023

mapleFU commented Nov 24, 2023 • edited Loading

mapleFU commented Nov 25, 2023

mapleFU commented Nov 27, 2023

conbench-apache-arrow bot commented Nov 28, 2023

mapleFU commented Nov 19, 2023 •

edited

Loading

mapleFU commented Nov 24, 2023 •

edited

Loading

mapleFU Nov 24, 2023 •

edited

Loading

mapleFU commented Nov 24, 2023 •

edited

Loading

mapleFU commented Nov 24, 2023 •

edited

Loading

mapleFU commented Nov 24, 2023 •

edited

Loading