Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38432: [C++][Parquet] Trying to use original logic of BinaryBuilder #38437

Closed
wants to merge 4 commits into from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Oct 24, 2023

Rationale for this change

Do some changes mentioned in #38432

What changes are included in this PR?

  1. for std::optional argument, it might be used in loop, so separate it
  2. Remove some Prepare.

Are these changes tested?

Already has tests

Are there any user-facing changes?

no

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@mapleFU mapleFU changed the title [C++][Parquet] GH-38432: [DNM] Trying to use original logic of BinaryBuilder MINOR: [C++][Parquet] Trying to use original logic of BinaryBuilder Oct 24, 2023
@raulcd
Copy link
Member

raulcd commented Oct 24, 2023

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Oct 24, 2023

Benchmark runs are scheduled for commit 2ff9ab4. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 24, 2023

@ursabot please benchmark

@ursabot
Copy link

ursabot commented Oct 24, 2023

Benchmark runs are scheduled for commit eba5044. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 0 benchmarking runs that have been run so far on PR commit 2ff9ab4.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit eba5044.

There were 4 benchmark results indicating a performance regression:

The full Conbench report has more details.

@jorisvandenbossche
Copy link
Member

Looking at the benchmark run, there is a clear improvement in the affected file-read benchmarks. Asking for the comparison of the benchmarked commit with the PR base: https://conbench.ursa.dev/compare/runs/b9dce8c2b76f4c278256da7797cb932d...a810bbfce49c4c238686b2097866fc39/, and sorting by positive z score:

image

So this seems to solve the issue that is discussed in #38432

@mapleFU
Copy link
Member Author

mapleFU commented Oct 27, 2023

I didn't find some stable improvement from local benchmark, profile and others

But maybe we can first move forward.

@mapleFU mapleFU marked this pull request as ready for review October 27, 2023 08:09
@mapleFU mapleFU requested a review from wgtmac as a code owner October 27, 2023 08:09
@mapleFU
Copy link
Member Author

mapleFU commented Oct 27, 2023

@jorisvandenbossche @pitrou

I've revert some BinaryHelper change. I don't know the reason because I don't have the regression workload. Maybe you can take a look

@mapleFU mapleFU changed the title MINOR: [C++][Parquet] Trying to use original logic of BinaryBuilder GH-38432: [C++][Parquet] Trying to use original logic of BinaryBuilder Oct 27, 2023
@github-actions
Copy link

⚠️ GitHub issue #38432 has been automatically assigned in GitHub to PR creator.

std::optional<int64_t> estimated_remaining_data_length = {}) {
Status PrepareNextInput(int64_t next_value_length) {
if (ARROW_PREDICT_FALSE(!CanFit(next_value_length))) {
// This element would exceed the capacity of a chunk
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part doesn't call RETURN_NOT_OK(acc_->builder->Reserve(entries_remaining_));, this keeps some logic same as before https://github.com/apache/arrow/pull/14341/files , it leave Prepare() to allocate the data and don't push when chunk is switched. But I don't know if this is right

return Status::OK();
}

Status PrepareNextInputWithEstimatedLength(int64_t next_value_length,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into two functions to make it fast and clear

@@ -1915,7 +1923,6 @@ class DictByteArrayDecoderImpl : public DictDecoderImpl<ByteArrayType>,
int32_t indices[kBufferSize];

ArrowBinaryHelper<ByteArrayType> helper(out, num_values);
RETURN_NOT_OK(helper.Prepare());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I think this Prepare is ok, original code doesn't have this Prepare, perhaps it would make entries large?

@@ -1983,7 +1990,6 @@ class DictByteArrayDecoderImpl : public DictDecoderImpl<ByteArrayType>,
int values_decoded = 0;

ArrowBinaryHelper<ByteArrayType> helper(out, num_values);
RETURN_NOT_OK(helper.Prepare(len_));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@pitrou
Copy link
Member

pitrou commented Nov 8, 2023

Can you please explain clearly why you're doing these changes? It's not obvious at all that they are correct.

@mapleFU
Copy link
Member Author

mapleFU commented Nov 8, 2023

That's because:

  1. [C++] Parquet reading performance regressions #38432 This issue mentioned a performance regression when using 14.0 and Find that DELTA_BYTE_ARRAY causing build arrow slower. After my change the regression disappear but I don't know why =_=
  2. I think Reading parquet file behavior change from 13.0.0 to 14.0.0 #38577 also raise from same reason, but I don't know if is ok to fix it.

@mapleFU mapleFU closed this Nov 19, 2023
@mapleFU mapleFU deleted the test-binary-helper branch November 19, 2023 07:44
@mapleFU
Copy link
Member Author

mapleFU commented Nov 19, 2023

I've created a simpler patch for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Parquet reading performance regressions
5 participants