You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The way timedelta values (a.k.a. durations, intervals...) are stored in parquet does not follow the file format specification. According to the parquet specification, the logical type Interval should be stored as:
INTERVAL is used for an interval of time. It must annotate a fixed_len_byte_array of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date.
(...)
Currently, fastparquet does not follow the format specification on this type. This affects the ability to read parquets written with other tools or to read with other tools parquets written with fastparquet if there is any field with this type.
I guess it might be a known issue rather than a bug, but I couldn't find info about it.
It isn't "known" in the sense that anyone has raised this before, but the INTERVAL type it a particularly unwieldy encoding, as you can see. pyarrow does not use it, but stores the data as INT64 like fastparquet.
Fair, and yet fastparquet and pyarrow do not seem to be compatible when writing and reading this type on a parquet file:
writing a timedelta with fastparquet and loading it with pyarrow transforms it to a datetime.time
writing a timedelta with pyarrow and loading it with fastparquet transforms it to an int
Only when reading it with the same tool (either of the two) you end up preserving the timedelta type.
In any case, what would be the proper solution? Would a PR that implements the format specification for the INTERVAL type be desirable? Would there be any concern about the compatibility against pyarrow?
Would a PR that implements the format specification for the INTERVAL type be desirable?
You are welcome to try, but I think it might be a little work. It is not a high priority for me (we have had this model for a long time!). Fixing reading arrow with the INT encoding is perhaps more important.
Describe the issue:
The way timedelta values (a.k.a. durations, intervals...) are stored in parquet does not follow the file format specification. According to the parquet specification, the logical type
Interval
should be stored as:Currently,
fastparquet
does not follow the format specification on this type. This affects the ability to read parquets written with other tools or to read with other tools parquets written withfastparquet
if there is any field with this type.I guess it might be a known issue rather than a bug, but I couldn't find info about it.
Minimal Complete Verifiable Example:
Then use either hangxie/parquet-tools, ktrueda/parquet-tools or any similar tool to inspect the schema to find that it looks like:
instead of something along the lines of
Anything else we need to know?:
There's a bit more context on this StackOverflow question
Environment:
The text was updated successfully, but these errors were encountered: