Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The parquet format specification is not followed for Interval type (i.e. timedeltas) #937

Open
mgab opened this issue Oct 3, 2024 · 3 comments

Comments

@mgab
Copy link

mgab commented Oct 3, 2024

Describe the issue:

The way timedelta values (a.k.a. durations, intervals...) are stored in parquet does not follow the file format specification. According to the parquet specification, the logical type Interval should be stored as:

INTERVAL is used for an interval of time. It must annotate a fixed_len_byte_array of length 12. This array stores three little-endian unsigned integers that represent durations at different granularities of time. The first stores a number in months, the second stores a number in days, and the third stores a number in milliseconds. This representation is independent of any particular timezone or date.
(...)

Currently, fastparquet does not follow the format specification on this type. This affects the ability to read parquets written with other tools or to read with other tools parquets written with fastparquet if there is any field with this type.

I guess it might be a known issue rather than a bug, but I couldn't find info about it.

Minimal Complete Verifiable Example:

import pandas as pd
from fastparquet import write

df = pd.DataFrame([{'seconds': 30, 'duration': pd.to_timedelta(30, unit='seconds')}])

write('/test/test.parquet', df)

Then use either hangxie/parquet-tools, ktrueda/parquet-tools or any similar tool to inspect the schema to find that it looks like:

{"Tag":"name=Schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT64, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=INT64, convertedtype=TIME_MICROS, repetitiontype=OPTIONAL"}
]}

instead of something along the lines of

{"Tag":"name=Duckdb_schema",
 "Fields":[
  {"Tag":"name=Seconds, type=INT32, convertedtype=INT_32, repetitiontype=OPTIONAL"},
  {"Tag":"name=Duration, type=FIXED_LEN_BYTE_ARRAY, convertedtype=INTERVAL, length=12, repetitiontype=OPTIONAL"}
]}

Anything else we need to know?:

There's a bit more context on this StackOverflow question

Environment:

  • Pandas version: 2.2.2
  • Python version: 2024.5.0
  • Operating System: macOS 14.6.1
  • Install method (conda, pip, source): pip
@martindurant
Copy link
Member

It isn't "known" in the sense that anyone has raised this before, but the INTERVAL type it a particularly unwieldy encoding, as you can see. pyarrow does not use it, but stores the data as INT64 like fastparquet.

@mgab
Copy link
Author

mgab commented Oct 4, 2024

Fair, and yet fastparquet and pyarrow do not seem to be compatible when writing and reading this type on a parquet file:

  • writing a timedelta with fastparquet and loading it with pyarrow transforms it to a datetime.time
  • writing a timedelta with pyarrow and loading it with fastparquet transforms it to an int

Only when reading it with the same tool (either of the two) you end up preserving the timedelta type.

In any case, what would be the proper solution? Would a PR that implements the format specification for the INTERVAL type be desirable? Would there be any concern about the compatibility against pyarrow?

@martindurant
Copy link
Member

Would a PR that implements the format specification for the INTERVAL type be desirable?

You are welcome to try, but I think it might be a little work. It is not a high priority for me (we have had this model for a long time!). Fixing reading arrow with the INT encoding is perhaps more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants