Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't write dfs when column order changes TypeError: expected list of bytes #320

Open
birdsarah opened this issue Mar 24, 2018 · 5 comments

Comments

@birdsarah
Copy link

My code looked like this:

dfs = [delayed(build_df_from_filename)(filename) for filename in test_index]
df = dd.from_delayed(dfs, meta=meta, divisions='sorted')
df.to_parquet('test.parquet', compression='snappy')

It failed with TypeError: expected list of bytes

When I changed the engine to pyarrow, to_parquet worked, but then dd.from_parquet gave me the error:

ValueError: Schema in test.parquet//part.0.parquet was different. <pyarrow._parquet.ParquetSchema object at 0x7f51d48fcc48>
call_stack: BYTE_ARRAY UTF8
crawl_id: INT64
func_name: BYTE_ARRAY UTF8
in_iframe: BOOLEAN
location: BYTE_ARRAY UTF8
operation: BYTE_ARRAY UTF8
script_col: BYTE_ARRAY UTF8
script_line: BYTE_ARRAY UTF8
script_loc_eval: BYTE_ARRAY UTF8
script_url: BYTE_ARRAY UTF8
symbol: BYTE_ARRAY UTF8
time_stamp: BYTE_ARRAY UTF8
value: BYTE_ARRAY UTF8
file_name: BYTE_ARRAY UTF8
arguments: BYTE_ARRAY UTF8
call_id: BYTE_ARRAY UTF8
  vs <pyarrow._parquet.ParquetSchema object at 0x7f51d48fc4c8>
call_id: BYTE_ARRAY UTF8
arguments: BYTE_ARRAY UTF8
call_stack: BYTE_ARRAY UTF8
crawl_id: INT64
file_name: BYTE_ARRAY UTF8
func_name: BYTE_ARRAY UTF8
in_iframe: BOOLEAN
location: BYTE_ARRAY UTF8
operation: BYTE_ARRAY UTF8
script_col: BYTE_ARRAY UTF8
script_line: BYTE_ARRAY UTF8
script_loc_eval: BYTE_ARRAY UTF8
script_url: BYTE_ARRAY UTF8
symbol: BYTE_ARRAY UTF8
time_stamp: BYTE_ARRAY UTF8
value: BYTE_ARRAY UTF8

The schemas are the same, but the order is different.

I then added: df = df.sort_index(axis=1) to my build_df_from_filename function.

After doing the I decided to retry fastparquet and it worked.

It's a bit bonkers to me that:

  1. my columns need to be ordered identically.
  2. that pyarrow only decided to tell me about that on read.

Gripe over.

What would be awesome would be a helpful error message about mismatched columns, not just TypeError: expected list of bytes

@birdsarah
Copy link
Author

birdsarah commented Mar 24, 2018

Small update. Still having problems with pyarrow, the sort_index addition still gives me differently ordered columns with pyarrow sometimes.

Edit: That's not True. Different issue.

@martindurant
Copy link
Member

martindurant commented Mar 24, 2018

It is generally expected that every partition of a dask dataframe has exactly the same structure, as described by the meta zero-length dataframe; in fact, some operations will complain/fail if a difference is detected, even just in the ordering of the columns. Of course many pandas operations involve selecting columns, so in those cases you would not notice. You could solve it easily by doing something like

df = df.map_partitions(lambda d: d[df.columns.tolist()])

For the fastparquet side, it is implicitly expected that the ordering of column chunks within any given row-group is the same (see, for instance, #318 , where this assumption is used to greatly increase metadata parsing efficiency). This is hinted at, but not stated explicitly in the standard.

I don't know from the above what the actual error might be coming from when using fastparquet...

@martindurant
Copy link
Member

PS: if you wanted arrow people to notice your concerns, you should also raise an issue there; or perhaps in dask, since maybe it's only the interaction of the two that can cause this.

@wesm
Copy link

wesm commented Mar 26, 2018

Let us know what behavior change (if any) would be preferred in Apache Arrow -- it might be nice to accept schemas with a permutation of column order (but only if the column names are unique).

@birdsarah
Copy link
Author

@wesm. Arrow was extremely helpful in this case. I'd be using it more if it stored divisions with the index to allow for quick row selection (might be using my terminology wrong i'm new here).

In general, for both Arrow and Fastparquet, it would be helpful if columns could be arbitrarily ordered. In this case I'm reading in json. There's no guarantee of key order, but the keys are all consistent. Allowing consistent columns, but not requiring order seems like it would be extremely handy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants