Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678

Open
asfimport opened this issue Apr 22, 2022 · 9 comments

Comments

@asfimport
Copy link
Collaborator

I'm trying to follow the example here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:

from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
    pd.DataFrame({"partition_col": partition_col, "values": values})
)
metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
    table,
    root_path,
    partition_cols=["partition_col"],
    metadata_collector=metadata_collector,
)

Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")


Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / "_metadata", metadata_collector=metadata_collector
) 

This raises the error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [92], in <cell line: 1>()
----> 1 pq.write_metadata(
      2     table.schema, root_path / "_metadata", metadata_collector=metadata_collector
      3 )
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
   2322 metadata = read_metadata(where)
   2323 for m in metadata_collector:
-> 2324     metadata.append_row_groups(m)
   2325 metadata.write_metadata_file(where)
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas. 

But all schemas in the metadata_collector list seem to be the same:

all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
# True 

Environment: MacOS. Python 3.8.10.
pyarrow: '7.0.0'
pandas: '1.4.2'
numpy: '1.22.3'
Reporter: Kyle Barron

Note: This issue was originally created as ARROW-16287. Please see the migration documentation for further details.

@david-waterworth
Copy link

This seems to be related to the partition_cols - if you comment this line from write_to_dataset the error is suppressed. I cannot find an example of writing metadata for a partitioned dataset?

@legout
Copy link

legout commented Aug 22, 2023

I have the same problem, for datasets in which the schema of the parquet files are identical expect the ordering of the columns.

That means for me, currently, I have to rewrite all parquet files with one unique schema (same column ordering). I wonder, if it is necessary, that the ordering of the columns is identical.

@mapleFU
Copy link
Member

mapleFU commented Aug 22, 2023

@legout Can you show the error you meet and the code you're using when using dataset writer? Seems that when writing same file the schema should be same, but I don't fully understand how you meet this when using dataset api.

@legout
Copy link

legout commented Aug 22, 2023

Sorry for my confusing comment. Here are some more details.

The parquet files of the dataset are exports from an oracle database written with another software(knime). Unfortunately, this leads to the parquet files having different column ordering, although the data types of the columns are identical.

This means, I am able to read the dataset (parquet files) using pyarrow.dataset or pyarrow.read_table.
However, when trying to create metadata and common metadata files according to https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files, I get the this error

RuntimeError: AppendRowGroups requires equal schemas.

I do understand, that data types have to be identical, but I wonder why the column ordering is important here.

I am currently on my mobile. I'll provide some sample code later.

@legout
Copy link

legout commented Aug 22, 2023

Create toy dataset with parquet files having identical column types, but different column ordering.

import os
import tempfile

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pds

t1 = pa.Table.from_pydict({"A": [1, 2, 3], "B": ["a", "b", "c"]})
t2 = pa.Table.from_pydict({"B": ["a", "b", "c"], "A": [1, 2, 3]})

temp_path = tempfile.mkdtemp()

pq.write_table(t1, os.path.join(temp_path, "t1.parquet"))
pq.write_table(t2, os.path.join(temp_path, "t2.parquet"))

ds = pds.dataset(temp_path)
print(ds.to_table())
pyarrow.Table
A: int64
B: string
----
A: [[1,2,3],[1,2,3]]
B: [["a","b","c"],["a","b","c"]]

Collect metadata of the individual files and create the (global) metadata file.

metadata_collector = [frag.metadata for frag in ds.get_fragments()]

metadata = metadata_collector[0]
metadata.append_row_groups(metadata_collector[1])
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[193], line 2
      1 metadata = metadata_collector[0]
----> 2 metadata.append_row_groups(metadata_collector[1])

File ~/mambaforge/envs/pydala-dev/lib/python3.11/site-packages/pyarrow/_parquet.pyx:793, in pyarrow._parquet.FileMetaData.append_row_groups()

RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
  name: A,
  path: A,
  physical_type: INT64,
  converted_type: NONE,
  logical_type: None,
  max_definition_level: 1,
  max_repetition_level: 0,
}
column descriptor = {
  name: B,
  path: B,
  physical_type: BYTE_ARRAY,
  converted_type: UTF8,
  logical_type: String,
  max_definition_level: 1,
  max_repetition_level: 0,
}

@mapleFU
Copy link
Member

mapleFU commented Aug 23, 2023

>>> metadata_collector[0].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3cee80>
required group field_id=-1 schema {
  optional int64 field_id=-1 A;
  optional binary field_id=-1 B (String);
}

>>> metadata_collector[1].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3ceec0>
required group field_id=-1 schema {
  optional binary field_id=-1 B (String);
  optional int64 field_id=-1 A;
}

Oh I got this. This is not allowed. Though it looks like it should be allowed.

Because Parquet schema is at "FileMetadata" ( see https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1024 ), so different row-group should have same schema.

@legout
Copy link

legout commented Aug 23, 2023

This means, there is no other solution than rewriting the data with a unique column ordering?

@KernelA
Copy link

KernelA commented Mar 14, 2024

I have a similar issue.

pyarrow 14.0.2

  parquet.write_metadata(
  File ".../lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3589, in write_metadata
    metadata.append_row_groups(m)
  File "pyarrow/_parquet.pyx", line 807, in pyarrow._parquet.FileMetaData.append_row_groups
RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
  name: session_num,
  path: session_num,
  physical_type: INT64,
  converted_type: UINT_64,
  logical_type: Int(bitWidth=64, isSigned=false),
  max_definition_level: 0,
  max_repetition_level: 0,
}
column descriptor = {
  name: session_num,
  path: session_num,
  physical_type: INT64,
  converted_type: UINT_64,
  logical_type: Int(bitWidth=64, isSigned=false),
  max_definition_level: 1,
  max_repetition_level: 0,
}

All partitions have equal schemas. Example are taken from https://arrow.apache.org/docs/14.0/python/parquet.html#writing-metadata-and-common-metadata-files

@KernelA
Copy link

KernelA commented Mar 24, 2024

When all fields are nullable in the schema this error does not occur. I think it relates with #31957

@kou kou changed the title PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file [Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants