Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Parquet reading performance regressions #38432

Closed
jorisvandenbossche opened this issue Oct 24, 2023 · 14 comments · Fixed by #38784
Closed

[C++] Parquet reading performance regressions #38432

jorisvandenbossche opened this issue Oct 24, 2023 · 14 comments · Fixed by #38784

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 24, 2023

Looking at some of the Parquet read benchmarks we have (https://conbench.ursa.dev/c-benchmarks/file-read), there are two small regressions to be observed the last months.

For example, zooming in on that time for nyctaxi_2010 with snappy compression (https://conbench.ursa.dev/benchmark-results/0653775157e07b6980005bce6e59a2ad/):

image

And for fanniemae (https://conbench.ursa.dev/benchmark-results/0653774cc429733d8000b2db91fa63fa/); here only the first bump is noticeable:

image

There are two small bumps, which I think are caused by the following two PRs, in order of the bumps:

@jorisvandenbossche
Copy link
Member Author

Although I have to add that we also have "dataset-read" benchmarks (https://conbench.ursa.dev/c-benchmarks/dataset-read), which also benchmark Parquet reading, and there I don't directly see any visible regressions (although those are reading from S3, and might be less sensitive to small regressions in the reading part)

Also, the "file-read" benchmark is actually using pq.read_table, which already did set pre_buffer=True for the default for some time. So it can't be that change in default (on the dataset side) that causes the slowdown. The other change in that PR is changing CacheOptions::Defaults() to LazyDefaults()

@rok
Copy link
Member

rok commented Oct 24, 2023

@jorisvandenbossche I can't seem to find the exact benchmark that is producing this metric.
Is it possible that the new encoding added by #14341 is now used as default for the benchmark or added as another codepath?

@jorisvandenbossche
Copy link
Member Author

No, that benchmark is just using a default pq.read_table (https://github.com/voltrondata-labs/benchmarks/blob/93a57ba1019f65a4cea124d17772b0e8c60ac5ce/benchmarks/file_benchmark.py#L84), which should not use any of the newer encodings by default AFAIK

@mapleFU
Copy link
Member

mapleFU commented Oct 24, 2023

Emmm any profiling message between the benchmark? Or what's the simplest way to re-produce this problem?

I think it might related to cache, since pre-buffer on a SSD might not a good way. I don't think encoding is related.

@jorisvandenbossche
Copy link
Member Author

I don't think encoding is related.

To be clear, there are potentially two separate issues (maybe it was not the best idea to open a single issue for it). One related to the encoding change, and one related to the pre_buffer/caching change.

@rok
Copy link
Member

rok commented Oct 24, 2023

Looking at https://github.com/apache/arrow/pull/14341/files - we abstracted some logic to ArrowBinaryHelper that would get used when decoding plain encoded string arrays. I don't see other potential culprits.

@mapleFU
Copy link
Member

mapleFU commented Oct 24, 2023

I'll try to run micro-benchmark with https://github.com/apache/arrow/pull/14341/files

@mapleFU
Copy link
Member

mapleFU commented Oct 24, 2023

I found DictDecodingByteArray benchmark become slightly slower (about 2% or less). not sure thats the reason or this is just unstable benchmark...

Updated: Oh I found DictDecodingByteArray become slower, let me analyze it...

@rok
Copy link
Member

rok commented Oct 24, 2023

Looking at the commits in vicinity it seems most likely #14341 has caused this and looking at the timeline it doesn't seem like it's noise (except if something changed with the benchmarking infra). Conbench comparison.

mapleFU added a commit to mapleFU/arrow that referenced this issue Oct 24, 2023
@mapleFU
Copy link
Member

mapleFU commented Oct 24, 2023

#38437
@rok @jorisvandenbossche Would you mind run same benchmark here? I've trying to change the logic of ArrowBinaryBuilder as before. But I'm not sure would it helps.

mapleFU added a commit to mapleFU/arrow that referenced this issue Oct 24, 2023
@rok
Copy link
Member

rok commented Oct 24, 2023

On eba5044 I ran:

archery benchmark run --output=run-head-1.json HEAD~1 --benchmark-filter=DictDecodingByteArray
archery benchmark diff HEAD run-head-1.json --benchmark-filter=DictDecodingByteArray
---

And actually got a small regression (but could be noise):

--------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------
BM_DictDecodingByteArray/max-string-length:8/batch-size:512           2750 ns         2750 ns       257429 bytes_per_second=1.3859G/s items_per_second=186.194M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:512          3405 ns         3405 ns       207141 bytes_per_second=5.05129G/s items_per_second=150.383M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:512       10643 ns        10642 ns        67129 bytes_per_second=23.1467G/s items_per_second=48.1097M/s
BM_DictDecodingByteArray/max-string-length:8/batch-size:2048          7744 ns         7744 ns        90086 bytes_per_second=1.96492G/s items_per_second=264.454M/s
BM_DictDecodingByteArray/max-string-length:64/batch-size:2048        10882 ns        10882 ns        63705 bytes_per_second=6.29198G/s items_per_second=188.207M/s
BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048      56034 ns        56034 ns        14860 bytes_per_second=17.5095G/s items_per_second=36.5494M/s
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Non-regressions: (5)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                     benchmark       baseline      contender  change %                                                                                                                                                                                                      counters
  BM_DictDecodingByteArray/max-string-length:8/batch-size:2048  1.951 GiB/sec  1.965 GiB/sec     0.706   {'family_index': 0, 'per_family_instance_index': 3, 'run_name': 'BM_DictDecodingByteArray/max-string-length:8/batch-size:2048', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 86664}
BM_DictDecodingByteArray/max-string-length:1024/batch-size:512 23.075 GiB/sec 23.147 GiB/sec     0.310 {'family_index': 0, 'per_family_instance_index': 2, 'run_name': 'BM_DictDecodingByteArray/max-string-length:1024/batch-size:512', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 69531}
 BM_DictDecodingByteArray/max-string-length:64/batch-size:2048  6.302 GiB/sec  6.292 GiB/sec    -0.154  {'family_index': 0, 'per_family_instance_index': 4, 'run_name': 'BM_DictDecodingByteArray/max-string-length:64/batch-size:2048', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 64627}
  BM_DictDecodingByteArray/max-string-length:64/batch-size:512  5.097 GiB/sec  5.051 GiB/sec    -0.901  {'family_index': 0, 'per_family_instance_index': 1, 'run_name': 'BM_DictDecodingByteArray/max-string-length:64/batch-size:512', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 205168}
   BM_DictDecodingByteArray/max-string-length:8/batch-size:512  1.410 GiB/sec  1.386 GiB/sec    -1.730   {'family_index': 0, 'per_family_instance_index': 0, 'run_name': 'BM_DictDecodingByteArray/max-string-length:8/batch-size:512', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 259846}

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Regressions: (1)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                                                      benchmark       baseline      contender  change %                                                                                                                                                                                                       counters
BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048 21.256 GiB/sec 17.510 GiB/sec   -17.626 {'family_index': 0, 'per_family_instance_index': 5, 'run_name': 'BM_DictDecodingByteArray/max-string-length:1024/batch-size:2048', 'repetitions': 1, 'repetition_index': 0, 'threads': 1, 'iterations': 15530}

@mapleFU
Copy link
Member

mapleFU commented Oct 24, 2023

Because DictDecodingByteArray itself is un-stable...T_T

@mapleFU
Copy link
Member

mapleFU commented Oct 24, 2023

Benchmark runs are scheduled for commit eba5044. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev/ for updates. A comment will be posted here when the runs are complete.

After revert some change seems the dataset-read doesn't become faster. Maybe we can try to profile the two runing for detail..

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 27, 2023

After revert some change seems the dataset-read doesn't become faster.

I posted on the PR as well (#38437 (comment)), but from what I can see, the file-read benchmark did significantly improve with the PR.

mapleFU added a commit to mapleFU/arrow that referenced this issue Nov 19, 2023
mapleFU added a commit to mapleFU/arrow that referenced this issue Nov 20, 2023
pitrou pushed a commit that referenced this issue Nov 27, 2023
…tByteArrayDecoderImpl (#38784)

### Rationale for this change

Do some changes mentioned in #38432

I believe this might fix #38577

Problem1:

The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because:

1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related.
2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. 

Problem2:

`Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`.

Problem3:

`std::optional<int64_t>` is hard to optimize for some compilers

### What changes are included in this PR?

Mention the behavior of BinaryHelper. And trying to fix it.

### Are these changes tested?

No

### Are there any user-facing changes?

Regression fixes

* Closes: #38432

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
@pitrou pitrou added this to the 15.0.0 milestone Nov 27, 2023
@raulcd raulcd modified the milestones: 15.0.0, 14.0.2 Nov 28, 2023
raulcd pushed a commit that referenced this issue Nov 28, 2023
…tByteArrayDecoderImpl (#38784)

### Rationale for this change

Do some changes mentioned in #38432

I believe this might fix #38577

Problem1:

The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because:

1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related.
2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. 

Problem2:

`Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`.

Problem3:

`std::optional<int64_t>` is hard to optimize for some compilers

### What changes are included in this PR?

Mention the behavior of BinaryHelper. And trying to fix it.

### Are these changes tested?

No

### Are there any user-facing changes?

Regression fixes

* Closes: #38432

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…he DictByteArrayDecoderImpl (apache#38784)

### Rationale for this change

Do some changes mentioned in apache#38432

I believe this might fix apache#38577

Problem1:

The `BinaryHelper` might call `Prepare()` and `Prepare(estimated-output-binary-length)` for data. This might because:

1. For Plain Encoding ByteArray, the `len_` is similar to the data-page size, so `Reserve` is related.
2. For Dict Encoding. The Data Page is just a RLE encoding Page, it's `len_` might didn't directly related to output-binary. 

Problem2:

`Prepare` using `::arrow::kBinaryMemoryLimit` as min-value, we should use `this->chunk_space_remaining_`.

Problem3:

`std::optional<int64_t>` is hard to optimize for some compilers

### What changes are included in this PR?

Mention the behavior of BinaryHelper. And trying to fix it.

### Are these changes tested?

No

### Are there any user-facing changes?

Regression fixes

* Closes: apache#38432

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Gang Wu <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment