[FEA] Implement shared parquet footer processing for Spark-RAPIDS #17716

GregoryKimball · 2025-01-10T21:03:52Z

Is your feature request related to a problem? Please describe.
As of 25.02, the parquet TableScan stack in Spark-RAPIDS parses file footers 3 times. First, the footer is parsed in C++ spark-rapids-jni (code pointer). Second, the footer is parsed in Java by parquet-mr to process predicate pushdown. Third, Spark-RAPIDS writes a custom footer and this is parsed by cuDF.

Describe the solution you'd like
Starting with the single-file read case, we can reduce the time spent in footer processing.

topic	owner	status
Refactor Spark-RAPIDS to use `cudf::io::read_parquet_metadata` instead of `rapids::jni::deserialize_parquet_footer`	Spark-RAPIDS
Add features and refactor `cudf::...aggregate_reader_metadata` to meet the needs of Spark-RAPIDS	cuDF
Refactor Spark-RAPIDS to use cuDF's prediction pushdown utility `filter_row_groups` (code pointer) instead of parquet-mr	Spark-RAPIDS
Add features and refactor `filter_row_groups` to accommodate Spark conventions	cuDF
Update the cuDF chunked parquet reader to accept `aggregate_reader_metadata` as a reader option to skip parsing the file footer	cuDF

Additional context
This issue is part of larger work to refactor TableScan in Spark-RAPIDS to improve support for high selectivity queries.

The text was updated successfully, but these errors were encountered:

GregoryKimball added cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Jan 10, 2025

GregoryKimball added this to the Parquet continuous improvement milestone Jan 10, 2025

GregoryKimball added this to libcudf Jan 10, 2025

revans2 mentioned this issue Jan 10, 2025

[FEA] Work with CUDF to enable footer parsing of parquet files. NVIDIA/spark-rapids#11954

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement shared parquet footer processing for Spark-RAPIDS #17716

[FEA] Implement shared parquet footer processing for Spark-RAPIDS #17716

GregoryKimball commented Jan 10, 2025

[FEA] Implement shared parquet footer processing for Spark-RAPIDS #17716

[FEA] Implement shared parquet footer processing for Spark-RAPIDS #17716

Comments

GregoryKimball commented Jan 10, 2025