Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row_groups filters does not use min_value/max_value statistics #491

Open
cclienti opened this issue Mar 24, 2020 · 6 comments
Open

row_groups filters does not use min_value/max_value statistics #491

cclienti opened this issue Mar 24, 2020 · 6 comments

Comments

@cclienti
Copy link

cclienti commented Mar 24, 2020

Hello,

In the file API.py, the function filter_out_stats uses min and max statistic fields, but they are marked as deprecated in the parquet thrift specification.
https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift line: 201

As some parquet dataset can be generated from other tools with only min_value or max_value statistics, the row_groups filtering is not usable.

version (conda): fastparquet / 0.3.2 / py37hdd07704_0

Best regards,
Christophe

@martindurant
Copy link
Member

Thanks for pointing this out and thanks very much, parquet, for this subtle backwards-incompatible change.

Would you like to update fastparquet's thrift definition from the reference, and add min/max_value to every place that min/max is currently used? Presumably we should look to read the min/max_value first and fall back to min/max. A PR would be welcome - I don't know when I could find the time to do this.

@cclienti
Copy link
Author

The generated thrift parquet in fastparquet seems to mention also such deprecated field. But I wonder why they moved from min to min_value. I looked in their JIRA projects but I could not find any real explications (https://issues.apache.org/jira/browse/IMPALA-4817 in the comments)

I agree, we can check first for min_value and then for min. I can do the PR.

@jalpes196
Copy link

Hi @martindurant , I have been facing the same issue while reading parquet file written using spark parquet writer. Spark parquet writer seems to set values to min_value and max_value fields and failed to set it for min and max. However, during further investigation it was found that spark parquet writer failed to set value for min and max wherever it is string/bytes type. It is working for int type though. So while working with fastparquet lib, it was observed that fastparquet reader fails to filter row group when filter is given on string/bytes column and works fine with int type column.

@martindurant
Copy link
Member

That's right, the parquet norms have moved on a bit, and fastparquet has not kept up. This, in particular, would be an easy fix - but no one has offered the effort to fix it. There are other similar issues with the evolution of parquet standards ( #493 , etc).

@jalpes196
Copy link

jalpes196 commented Apr 9, 2020

@martindurant I have modified code to survive our problems with fastparquet. I would like contribute on this.

@martindurant
Copy link
Member

Please go ahead!
I would check for min_value/max_value and populate min/max. Note that the column order parameter might mean that the two values should be flipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants