You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
As of 25.02, the parquet TableScan stack in Spark-RAPIDS parses file footers 3 times. First, the footer is parsed in C++ spark-rapids-jni (code pointer). Second, the footer is parsed in Java by parquet-mr to process predicate pushdown. Third, Spark-RAPIDS writes a custom footer and this is parsed by cuDF.
Describe the solution you'd like
Starting with the single-file read case, we can reduce the time spent in footer processing.
topic
owner
status
Refactor Spark-RAPIDS to use cudf::io::read_parquet_metadata instead of rapids::jni::deserialize_parquet_footer
Spark-RAPIDS
Add features and refactor cudf::...aggregate_reader_metadata to meet the needs of Spark-RAPIDS
cuDF
Refactor Spark-RAPIDS to use cuDF's prediction pushdown utility filter_row_groups (code pointer) instead of parquet-mr
Spark-RAPIDS
Add features and refactor filter_row_groups to accommodate Spark conventions
cuDF
Update the cuDF chunked parquet reader to accept aggregate_reader_metadata as a reader option to skip parsing the file footer
cuDF
Additional context
This issue is part of larger work to refactor TableScan in Spark-RAPIDS to improve support for high selectivity queries.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
As of 25.02, the parquet TableScan stack in Spark-RAPIDS parses file footers 3 times. First, the footer is parsed in C++ spark-rapids-jni (code pointer). Second, the footer is parsed in Java by parquet-mr to process predicate pushdown. Third, Spark-RAPIDS writes a custom footer and this is parsed by cuDF.
Describe the solution you'd like
Starting with the single-file read case, we can reduce the time spent in footer processing.
cudf::io::read_parquet_metadata
instead ofrapids::jni::deserialize_parquet_footer
cudf::...aggregate_reader_metadata
to meet the needs of Spark-RAPIDSfilter_row_groups
(code pointer) instead of parquet-mrfilter_row_groups
to accommodate Spark conventionsaggregate_reader_metadata
as a reader option to skip parsing the file footerAdditional context
This issue is part of larger work to refactor TableScan in Spark-RAPIDS to improve support for high selectivity queries.
The text was updated successfully, but these errors were encountered: