-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678
Comments
This seems to be related to the |
I have the same problem, for datasets in which the schema of the parquet files are identical expect the ordering of the columns. That means for me, currently, I have to rewrite all parquet files with one unique schema (same column ordering). I wonder, if it is necessary, that the ordering of the columns is identical. |
@legout Can you show the error you meet and the code you're using when using dataset writer? Seems that when writing same file the schema should be same, but I don't fully understand how you meet this when using dataset api. |
Sorry for my confusing comment. Here are some more details. The parquet files of the dataset are exports from an oracle database written with another software(knime). Unfortunately, this leads to the parquet files having different column ordering, although the data types of the columns are identical. This means, I am able to read the dataset (parquet files) using pyarrow.dataset or pyarrow.read_table. RuntimeError: AppendRowGroups requires equal schemas. I do understand, that data types have to be identical, but I wonder why the column ordering is important here. I am currently on my mobile. I'll provide some sample code later. |
Create toy dataset with parquet files having identical column types, but different column ordering. import os
import tempfile
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pds
t1 = pa.Table.from_pydict({"A": [1, 2, 3], "B": ["a", "b", "c"]})
t2 = pa.Table.from_pydict({"B": ["a", "b", "c"], "A": [1, 2, 3]})
temp_path = tempfile.mkdtemp()
pq.write_table(t1, os.path.join(temp_path, "t1.parquet"))
pq.write_table(t2, os.path.join(temp_path, "t2.parquet"))
ds = pds.dataset(temp_path)
print(ds.to_table())
Collect metadata of the individual files and create the (global) metadata file. metadata_collector = [frag.metadata for frag in ds.get_fragments()]
metadata = metadata_collector[0]
metadata.append_row_groups(metadata_collector[1])
|
Oh I got this. This is not allowed. Though it looks like it should be allowed. Because Parquet schema is at "FileMetadata" ( see https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1024 ), so different row-group should have same schema. |
This means, there is no other solution than rewriting the data with a unique column ordering? |
I have a similar issue. pyarrow 14.0.2
All partitions have equal schemas. Example are taken from https://arrow.apache.org/docs/14.0/python/parquet.html#writing-metadata-and-common-metadata-files |
When all fields are nullable in the schema this error does not occur. I think it relates with #31957 |
I'm trying to follow the example here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:
This raises the error
But all schemas in the
metadata_collector
list seem to be the same:Environment: MacOS. Python 3.8.10.
pyarrow: '7.0.0'
pandas: '1.4.2'
numpy: '1.22.3'
Reporter: Kyle Barron
Note: This issue was originally created as ARROW-16287. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: