Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Using unify_schema() during schema evolution fails #37898

Open
PoojaRavi1105 opened this issue Sep 27, 2023 · 5 comments
Open

[Python] Using unify_schema() during schema evolution fails #37898

PoojaRavi1105 opened this issue Sep 27, 2023 · 5 comments

Comments

@PoojaRavi1105
Copy link

Describe the usage question you have. Please include as many useful details as possible.

I am trying to merge multiple parquet files using pyarrow's unify_schema(). During schema evolution, the same field has 2 different data type structures. What is the best way to handle schema evolution in such cases while making sure there is no loss of data?

Component(s)

Parquet, Python

@AlenkaF
Copy link
Member

AlenkaF commented Sep 27, 2023

As per the documentation: "Note that two fields with different types will fail merging."
https://arrow.apache.org/docs/python/generated/pyarrow.unify_schemas.html

What are you using to merge the parquet files once you get the unified schema?

@AlenkaF AlenkaF changed the title Using unify_schema() during schema evolution fails [Python] Using unify_schema() during schema evolution fails Sep 27, 2023
@PoojaRavi1105
Copy link
Author

PoojaRavi1105 commented Sep 27, 2023

I'm reading the tables using pyarrow.parquet.read_table() and storing them in a list. After this, I'm looping through the whole list of tables and I'm using pyarrow.parquet.ParquetWriter.write_table() to write the merged parquet file. This is further being uploaded to S3 bucket.
Do we have a feature being planned to handle schema promotion? This would be a helpful feature to have in such scenarios.

@PoojaRavi1105 PoojaRavi1105 reopened this Sep 27, 2023
@AlenkaF
Copy link
Member

AlenkaF commented Sep 27, 2023

I see. There are some issues already opened connected to this. I suggest reading through:

Let me know if you get the information you are looking for.

@mapleFU
Copy link
Member

mapleFU commented Nov 4, 2023

@PoojaRavi1105

  1. Currently parquet dataset doesn't support iceberg style schema evolution using unify_schema
  2. But when you set the schema in dataset explicitly yourself, it's able to be read.

@sergun
Copy link

sergun commented Feb 3, 2024

@PoojaRavi1105

  1. Currently parquet dataset doesn't support iceberg style schema evolution using unify_schema
  2. But when you set the schema in dataset explicitly yourself, it's able to be read.

reg. 2

It doesn't work for me in case of adding / removing columns (pyarrow==15.0.0). E.g.

1.parquet has scheme:

        pa.schema([
            ('id', pa.int64()),
            ('x', pa.int64()),
            ('a', pa.struct([
                ('y', pa.int64()),
            ])),
        ])

2.parquet has scheme:

        pa.schema([
            ('id', pa.int64()),
            ('y', pa.int64()),
            ('a', pa.struct([
                ('x', pa.int64()),
            ])),
        ])

When I read them by manually merged schema:

    dataset = ds.dataset(["1.parquet", "2.parquet",], schema=
        pa.schema([
            ('id', pa.int64()),
            ('x', pa.int64()),
            ('y', pa.int64()),
            ('a', pa.struct([
                ('x', pa.int64()),
                ('y', pa.int64()),
            ])),
        ])
    )

get:

pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<x: int64> output fields: struct<x: int64, y: int64>
Guys, @AlenkaF are there some plans to support this in the roadmap?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants