Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][C++][Parquet] Support reading Parquet files having a permutation of column order #18353

Closed
asfimport opened this issue Mar 29, 2018 · 6 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Mar 29, 2018

See discussion in dask/fastparquet#320

Reporter: Wes McKinney / @wesm
Assignee: Alenka Frim / @AlenkaF

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-2366. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Uwe Korn / @xhochy:
This is the first iteration of schema evolution. While the issue here is quite simple, I would like to add general schema evolution / resolution rules to Arrow. My favorite would be to stick as close as possible to Avro's rules: http://avro.apache.org/docs/current/spec.html#Schema+Resolution

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
This will need to be addressed as part of general schema conformance in the C++ Datasets API

cc @pitrou @nealrichardson

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
This is now implemented in the C++ Datasets project:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

# create dummy dataset with column order permutation
import pathlib
basedir = pathlib.Path(".")
case = basedir / "dataset_column_order_permutation"
case.mkdir(exist_ok=True)

table1 = pa.table([[1, 2, 3], [.1, .2, .3]], names=['a', 'b'])
pq.write_table(table1, case / "data1.parquet")

table2 = pa.table([[.4, .5, .6], [4, 5, 6]], names=['b', 'a'])
pq.write_table(table2, case / "data2.parquet")

# reading with the old python implementation indeed raises on schema mismatch
pq.read_table(str(case))

# this reads fine
ds.dataset(str(case)).to_table().to_pandas()

So once we use the datasets API under the hood in pyarrow.parquet (ARROW-8039), this issue should be solved (we can still add a test for it to close this issue)

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
@jorisvandenbossche Can we close this now? Perhaps we just need to add a test to ensure this works properly?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Yes, it's on my "needs a test and then can close it" list, so would prefer to keep it open for now

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 11561
#11561

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants