-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
row_groups filters does not use min_value/max_value statistics #491
Comments
Thanks for pointing this out and thanks very much, parquet, for this subtle backwards-incompatible change. Would you like to update fastparquet's thrift definition from the reference, and add min/max_value to every place that min/max is currently used? Presumably we should look to read the min/max_value first and fall back to min/max. A PR would be welcome - I don't know when I could find the time to do this. |
The generated thrift parquet in fastparquet seems to mention also such deprecated field. But I wonder why they moved from min to min_value. I looked in their JIRA projects but I could not find any real explications (https://issues.apache.org/jira/browse/IMPALA-4817 in the comments) I agree, we can check first for min_value and then for min. I can do the PR. |
Hi @martindurant , I have been facing the same issue while reading parquet file written using spark parquet writer. Spark parquet writer seems to set values to min_value and max_value fields and failed to set it for min and max. However, during further investigation it was found that spark parquet writer failed to set value for min and max wherever it is string/bytes type. It is working for int type though. So while working with fastparquet lib, it was observed that fastparquet reader fails to filter row group when filter is given on string/bytes column and works fine with int type column. |
That's right, the parquet norms have moved on a bit, and fastparquet has not kept up. This, in particular, would be an easy fix - but no one has offered the effort to fix it. There are other similar issues with the evolution of parquet standards ( #493 , etc). |
@martindurant I have modified code to survive our problems with fastparquet. I would like contribute on this. |
Please go ahead! |
Hello,
In the file
API.py
, the functionfilter_out_stats
usesmin
andmax
statistic fields, but they are marked as deprecated in the parquet thrift specification.https://github.com/apache/arrow/blob/master/cpp/src/parquet/parquet.thrift line: 201
As some parquet dataset can be generated from other tools with only
min_value
ormax_value
statistics, the row_groups filtering is not usable.version (conda): fastparquet / 0.3.2 / py37hdd07704_0
Best regards,
Christophe
The text was updated successfully, but these errors were encountered: