[Python] Using unify_schema() during schema evolution fails #37898

PoojaRavi1105 · 2023-09-27T07:34:55Z

Describe the usage question you have. Please include as many useful details as possible.

I am trying to merge multiple parquet files using pyarrow's unify_schema(). During schema evolution, the same field has 2 different data type structures. What is the best way to handle schema evolution in such cases while making sure there is no loss of data?

Component(s)

Parquet, Python

AlenkaF · 2023-09-27T09:01:19Z

As per the documentation: "Note that two fields with different types will fail merging."
https://arrow.apache.org/docs/python/generated/pyarrow.unify_schemas.html

What are you using to merge the parquet files once you get the unified schema?

PoojaRavi1105 · 2023-09-27T09:07:43Z

I'm reading the tables using pyarrow.parquet.read_table() and storing them in a list. After this, I'm looping through the whole list of tables and I'm using pyarrow.parquet.ParquetWriter.write_table() to write the merged parquet file. This is further being uploaded to S3 bucket.
Do we have a feature being planned to handle schema promotion? This would be a helpful feature to have in such scenarios.

AlenkaF · 2023-09-27T10:26:57Z

I see. There are some issues already opened connected to this. I suggest reading through:

Let me know if you get the information you are looking for.

mapleFU · 2023-11-04T10:02:12Z

@PoojaRavi1105

Currently parquet dataset doesn't support iceberg style schema evolution using unify_schema
But when you set the schema in dataset explicitly yourself, it's able to be read.

sergun · 2024-02-03T06:41:07Z

@PoojaRavi1105

Currently parquet dataset doesn't support iceberg style schema evolution using unify_schema

But when you set the schema in dataset explicitly yourself, it's able to be read.

reg. 2

It doesn't work for me in case of adding / removing columns (pyarrow==15.0.0). E.g.

1.parquet has scheme:

        pa.schema([
            ('id', pa.int64()),
            ('x', pa.int64()),
            ('a', pa.struct([
                ('y', pa.int64()),
            ])),
        ])

2.parquet has scheme:

        pa.schema([
            ('id', pa.int64()),
            ('y', pa.int64()),
            ('a', pa.struct([
                ('x', pa.int64()),
            ])),
        ])

When I read them by manually merged schema:

    dataset = ds.dataset(["1.parquet", "2.parquet",], schema=
        pa.schema([
            ('id', pa.int64()),
            ('x', pa.int64()),
            ('y', pa.int64()),
            ('a', pa.struct([
                ('x', pa.int64()),
                ('y', pa.int64()),
            ])),
        ])
    )

get:

pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<x: int64> output fields: struct<x: int64, y: int64>
Guys, @AlenkaF are there some plans to support this in the roadmap?

PoojaRavi1105 added the Type: usage Issue is a user question label Sep 27, 2023

github-actions bot added Component: Parquet Component: Python labels Sep 27, 2023

AlenkaF mentioned this issue Sep 27, 2023

unify_schema() #37897

Closed

AlenkaF changed the title ~~Using unify_schema() during schema evolution fails~~ [Python] Using unify_schema() during schema evolution fails Sep 27, 2023

PoojaRavi1105 closed this as completed Sep 27, 2023

PoojaRavi1105 reopened this Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Using unify_schema() during schema evolution fails #37898

[Python] Using unify_schema() during schema evolution fails #37898

PoojaRavi1105 commented Sep 27, 2023

AlenkaF commented Sep 27, 2023

PoojaRavi1105 commented Sep 27, 2023 •

edited

Loading

AlenkaF commented Sep 27, 2023

mapleFU commented Nov 4, 2023

sergun commented Feb 3, 2024

[Python] Using unify_schema() during schema evolution fails #37898

[Python] Using unify_schema() during schema evolution fails #37898

Comments

PoojaRavi1105 commented Sep 27, 2023

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

AlenkaF commented Sep 27, 2023

PoojaRavi1105 commented Sep 27, 2023 • edited Loading

AlenkaF commented Sep 27, 2023

mapleFU commented Nov 4, 2023

sergun commented Feb 3, 2024

PoojaRavi1105 commented Sep 27, 2023 •

edited

Loading