-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Parquet] Parquet Support write and validate CRC #37242
Comments
@danepitkin @AlenkaF This seems useful to expose in Python indeed. |
1 task
@frazar Would you mind "take" here? Github can only assign issue to the one replied to the issue. |
I'll take this! |
AlenkaF
added a commit
that referenced
this issue
Nov 20, 2023
…RC (#38360) ### Rationale for this change The C++ Parquet API already supports enabling CRC checksum for read and write operations. CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error). It would then be beneficial to expose this optional functionality to the Python API too. This PR is based on a previous PR which became stale: #37439 ### What changes are included in this PR? The PyArrow interface is expanded to include a `page_checksum_enabled` flag. ### Are these changes tested? [ ] NOT YET! ### Are there any user-facing changes? The change is backward compatible. An additional, optional keyword argument is added to some interfaces. Closes #37242 Supersedes #37439 * Closes: #37242 Lead-authored-by: Francesco Zardi <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: AlenkaF <[email protected]>
dgreiss
pushed a commit
to dgreiss/arrow
that referenced
this issue
Feb 19, 2024
…Page CRC (apache#38360) ### Rationale for this change The C++ Parquet API already supports enabling CRC checksum for read and write operations. CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error). It would then be beneficial to expose this optional functionality to the Python API too. This PR is based on a previous PR which became stale: apache#37439 ### What changes are included in this PR? The PyArrow interface is expanded to include a `page_checksum_enabled` flag. ### Are these changes tested? [ ] NOT YET! ### Are there any user-facing changes? The change is backward compatible. An additional, optional keyword argument is added to some interfaces. Closes apache#37242 Supersedes apache#37439 * Closes: apache#37242 Lead-authored-by: Francesco Zardi <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Signed-off-by: AlenkaF <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the enhancement requested
Now, C++ Parquet API already supports CRC in reading and write.
Though system like S3 will ensure the storage data works well. But some data storage like HDD or SSD might corrupt. And network might provide bad result. So having CRC would helps.
Now it's better to has crc in Python code.
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: