Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Implement shared parquet footer processing for Spark-RAPIDS #17716

Open
GregoryKimball opened this issue Jan 10, 2025 · 0 comments
Open
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

Is your feature request related to a problem? Please describe.
As of 25.02, the parquet TableScan stack in Spark-RAPIDS parses file footers 3 times. First, the footer is parsed in C++ spark-rapids-jni (code pointer). Second, the footer is parsed in Java by parquet-mr to process predicate pushdown. Third, Spark-RAPIDS writes a custom footer and this is parsed by cuDF.

Describe the solution you'd like
Starting with the single-file read case, we can reduce the time spent in footer processing.

topic owner status
Refactor Spark-RAPIDS to use cudf::io::read_parquet_metadata instead of rapids::jni::deserialize_parquet_footer Spark-RAPIDS
Add features and refactor cudf::...aggregate_reader_metadata to meet the needs of Spark-RAPIDS cuDF
Refactor Spark-RAPIDS to use cuDF's prediction pushdown utility filter_row_groups (code pointer) instead of parquet-mr Spark-RAPIDS
Add features and refactor filter_row_groups to accommodate Spark conventions cuDF
Update the cuDF chunked parquet reader to accept aggregate_reader_metadata as a reader option to skip parsing the file footer cuDF

Additional context
This issue is part of larger work to refactor TableScan in Spark-RAPIDS to improve support for high selectivity queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: No status
Development

No branches or pull requests

1 participant