-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-37470: [Python][Parquet] Add missing arguments to ParquetFileWriteOptions
#37469
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
Would you mind create a Github issue, and rename the title to:
|
write_page_index
to be set for Datasetswrite_page_index
to be set for Datasets
|
write_page_index
to be set for Datasetswrite_page_index
to be set for Datasets
write_page_index
to be set for DatasetsParquetFileWriteOptions
35754e0
to
a041634
Compare
a041634
to
3ec080b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind add a test for write_page_index
? Since we need to ensure page index would be written.
Besides, disable statistics for column can force only write offset-index, maybe we can just test them
I've added a test for this. |
Great! I'll take a careful round later. By the way we can disable statistics to just disable Column Index but write offset index. Maybe this could be add to doc later. |
Are you saying |
That is indeed what you were saying as the test shows! That behaviour strikes me as quite unintuitive! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is ok here, however, I think we should export the flag to parquet/core.py
, and from that to _dataset_parquet.pyx
. You can take a look at your another patch ( #37665 ). Since just a _dataset_parquet
config is hard to use for users.
Cause I'm the author of that behavior lol ( #35958 ) When disable collect statistics, currently it would be hard to collect the Also, by the way, by default, if column index is enabled, the page header statistics will not be written. (Since spec says if column index exists, page header is not tent to be written) |
Yeah, that makes sense! Would be good to clarify in the docs - will see if I get to it in another PR.
Yeah, what gets written and what doesn't is quite confusing. In fact the spec doesn't techincally say that the page-level statistics when writing the ColumnIndex but that it isn't recommended (one might want both in order to support old readers). It's probably sensible default behaviour but it'd be nice to force being able to write both. |
The flags are already there. They were just never added to the arrow/python/pyarrow/parquet/core.py Line 976 in e56726c
|
f86d7da
to
c8ef06f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late reply. Finally I got this. User can write the data with these options in ParquetWriter python sdk, but cannot use them with ParquetFileWriteOptions
.
This is LGTM.
@jorisvandenbossche @AlenkaF Would you mind take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
One thought: would it be worth adding docstrings for |
Any updates? Should we rerun |
c8ef06f
to
73405ec
Compare
Thanks, I've done this now.
I can try to make time to get to this but it feels like a pretty independent issue - especially as the |
Agree totally, it deserves a special issue for sure! What you propose sounds great, thank you! |
|
All test passed now, can we try to checkin this patch? :-) |
Great, will merge 👍 |
After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit f52ebbb. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them. |
…leWriteOptions` (apache#37469) ### Rationale for this change I think this may have been missed when this feature was added. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#37470 Authored-by: Judah Rand <[email protected]> Signed-off-by: AlenkaF <[email protected]>
### Rationale for this change Picking up where #35453 left off. Closes #35331 This PR builds on top of #37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: #35331 Lead-authored-by: Judah Rand <[email protected]> Co-authored-by: Will Jones <[email protected]> Signed-off-by: AlenkaF <[email protected]>
### Rationale for this change Picking up where apache#35453 left off. Closes apache#35331 This PR builds on top of apache#37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#35331 Lead-authored-by: Judah Rand <[email protected]> Co-authored-by: Will Jones <[email protected]> Signed-off-by: AlenkaF <[email protected]>
…leWriteOptions` (apache#37469) ### Rationale for this change I think this may have been missed when this feature was added. ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#37470 Authored-by: Judah Rand <[email protected]> Signed-off-by: AlenkaF <[email protected]>
### Rationale for this change Picking up where apache#35453 left off. Closes apache#35331 This PR builds on top of apache#37469 ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#35331 Lead-authored-by: Judah Rand <[email protected]> Co-authored-by: Will Jones <[email protected]> Signed-off-by: AlenkaF <[email protected]>
Rationale for this change
I think this may have been missed when this feature was added.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
write_page_index
to be set for Datasets #37470