row_groups filters does not use min_value/max_value statistics #491

cclienti · 2020-03-24T13:46:40Z

Hello,

In the file API.py, the function filter_out_stats uses min and max statistic fields, but they are marked as deprecated in the parquet thrift specification.
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift line: 201

As some parquet dataset can be generated from other tools with only min_value or max_value statistics, the row_groups filtering is not usable.

version (conda): fastparquet / 0.3.2 / py37hdd07704_0

Best regards,
Christophe

The text was updated successfully, but these errors were encountered:

martindurant · 2020-03-24T13:55:34Z

Thanks for pointing this out and thanks very much, parquet, for this subtle backwards-incompatible change.

Would you like to update fastparquet's thrift definition from the reference, and add min/max_value to every place that min/max is currently used? Presumably we should look to read the min/max_value first and fall back to min/max. A PR would be welcome - I don't know when I could find the time to do this.

cclienti · 2020-03-24T14:03:49Z

The generated thrift parquet in fastparquet seems to mention also such deprecated field. But I wonder why they moved from min to min_value. I looked in their JIRA projects but I could not find any real explications (https://issues.apache.org/jira/browse/IMPALA-4817 in the comments)

I agree, we can check first for min_value and then for min. I can do the PR.

jalpes196 · 2020-04-09T14:26:43Z

Hi @martindurant , I have been facing the same issue while reading parquet file written using spark parquet writer. Spark parquet writer seems to set values to min_value and max_value fields and failed to set it for min and max. However, during further investigation it was found that spark parquet writer failed to set value for min and max wherever it is string/bytes type. It is working for int type though. So while working with fastparquet lib, it was observed that fastparquet reader fails to filter row group when filter is given on string/bytes column and works fine with int type column.

martindurant · 2020-04-09T14:29:35Z

That's right, the parquet norms have moved on a bit, and fastparquet has not kept up. This, in particular, would be an easy fix - but no one has offered the effort to fix it. There are other similar issues with the evolution of parquet standards ( #493 , etc).

jalpes196 · 2020-04-09T15:41:03Z

@martindurant I have modified code to survive our problems with fastparquet. I would like contribute on this.

martindurant · 2020-04-09T15:42:59Z

Please go ahead!
I would check for min_value/max_value and populate min/max. Note that the column order parameter might mean that the two values should be flipped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

row_groups filters does not use min_value/max_value statistics #491

row_groups filters does not use min_value/max_value statistics #491

cclienti commented Mar 24, 2020 •

edited

Loading

martindurant commented Mar 24, 2020

cclienti commented Mar 24, 2020

jalpes196 commented Apr 9, 2020

martindurant commented Apr 9, 2020

jalpes196 commented Apr 9, 2020 •

edited

Loading

martindurant commented Apr 9, 2020

row_groups filters does not use min_value/max_value statistics #491

row_groups filters does not use min_value/max_value statistics #491

Comments

cclienti commented Mar 24, 2020 • edited Loading

martindurant commented Mar 24, 2020

cclienti commented Mar 24, 2020

jalpes196 commented Apr 9, 2020

martindurant commented Apr 9, 2020

jalpes196 commented Apr 9, 2020 • edited Loading

martindurant commented Apr 9, 2020

cclienti commented Mar 24, 2020 •

edited

Loading

jalpes196 commented Apr 9, 2020 •

edited

Loading