Can't write dfs when column order changes `TypeError: expected list of bytes` #320

birdsarah · 2018-03-24T17:53:45Z

My code looked like this:

dfs = [delayed(build_df_from_filename)(filename) for filename in test_index]
df = dd.from_delayed(dfs, meta=meta, divisions='sorted')
df.to_parquet('test.parquet', compression='snappy')

It failed with TypeError: expected list of bytes

When I changed the engine to pyarrow, to_parquet worked, but then dd.from_parquet gave me the error:

ValueError: Schema in test.parquet//part.0.parquet was different. <pyarrow._parquet.ParquetSchema object at 0x7f51d48fcc48>
call_stack: BYTE_ARRAY UTF8
crawl_id: INT64
func_name: BYTE_ARRAY UTF8
in_iframe: BOOLEAN
location: BYTE_ARRAY UTF8
operation: BYTE_ARRAY UTF8
script_col: BYTE_ARRAY UTF8
script_line: BYTE_ARRAY UTF8
script_loc_eval: BYTE_ARRAY UTF8
script_url: BYTE_ARRAY UTF8
symbol: BYTE_ARRAY UTF8
time_stamp: BYTE_ARRAY UTF8
value: BYTE_ARRAY UTF8
file_name: BYTE_ARRAY UTF8
arguments: BYTE_ARRAY UTF8
call_id: BYTE_ARRAY UTF8
  vs <pyarrow._parquet.ParquetSchema object at 0x7f51d48fc4c8>
call_id: BYTE_ARRAY UTF8
arguments: BYTE_ARRAY UTF8
call_stack: BYTE_ARRAY UTF8
crawl_id: INT64
file_name: BYTE_ARRAY UTF8
func_name: BYTE_ARRAY UTF8
in_iframe: BOOLEAN
location: BYTE_ARRAY UTF8
operation: BYTE_ARRAY UTF8
script_col: BYTE_ARRAY UTF8
script_line: BYTE_ARRAY UTF8
script_loc_eval: BYTE_ARRAY UTF8
script_url: BYTE_ARRAY UTF8
symbol: BYTE_ARRAY UTF8
time_stamp: BYTE_ARRAY UTF8
value: BYTE_ARRAY UTF8

The schemas are the same, but the order is different.

I then added: df = df.sort_index(axis=1) to my build_df_from_filename function.

After doing the I decided to retry fastparquet and it worked.

It's a bit bonkers to me that:

my columns need to be ordered identically.
that pyarrow only decided to tell me about that on read.

Gripe over.

What would be awesome would be a helpful error message about mismatched columns, not just TypeError: expected list of bytes

The text was updated successfully, but these errors were encountered:

birdsarah · 2018-03-24T18:17:38Z

~~Small update. Still having problems with pyarrow, the sort_index addition still gives me differently ordered columns with pyarrow sometimes.~~

Edit: That's not True. Different issue.

martindurant · 2018-03-24T18:51:06Z

It is generally expected that every partition of a dask dataframe has exactly the same structure, as described by the meta zero-length dataframe; in fact, some operations will complain/fail if a difference is detected, even just in the ordering of the columns. Of course many pandas operations involve selecting columns, so in those cases you would not notice. You could solve it easily by doing something like

df = df.map_partitions(lambda d: d[df.columns.tolist()])

For the fastparquet side, it is implicitly expected that the ordering of column chunks within any given row-group is the same (see, for instance, #318 , where this assumption is used to greatly increase metadata parsing efficiency). This is hinted at, but not stated explicitly in the standard.

I don't know from the above what the actual error might be coming from when using fastparquet...

martindurant · 2018-03-24T18:52:00Z

PS: if you wanted arrow people to notice your concerns, you should also raise an issue there; or perhaps in dask, since maybe it's only the interaction of the two that can cause this.

wesm · 2018-03-26T18:40:57Z

Let us know what behavior change (if any) would be preferred in Apache Arrow -- it might be nice to accept schemas with a permutation of column order (but only if the column names are unique).

birdsarah · 2018-03-26T19:10:39Z

@wesm. Arrow was extremely helpful in this case. I'd be using it more if it stored divisions with the index to allow for quick row selection (might be using my terminology wrong i'm new here).

In general, for both Arrow and Fastparquet, it would be helpful if columns could be arbitrarily ordered. In this case I'm reading in json. There's no guarantee of key order, but the keys are all consistent. Allowing consistent columns, but not requiring order seems like it would be extremely handy.

asfimport mentioned this issue Oct 29, 2021

[Python][C++][Parquet] Support reading Parquet files having a permutation of column order apache/arrow#18353

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't write dfs when column order changes `TypeError: expected list of bytes` #320

Can't write dfs when column order changes `TypeError: expected list of bytes` #320

birdsarah commented Mar 24, 2018

birdsarah commented Mar 24, 2018 •

edited

Loading

martindurant commented Mar 24, 2018 •

edited

Loading

martindurant commented Mar 24, 2018

wesm commented Mar 26, 2018

birdsarah commented Mar 26, 2018

Can't write dfs when column order changes TypeError: expected list of bytes #320

Can't write dfs when column order changes TypeError: expected list of bytes #320

Comments

birdsarah commented Mar 24, 2018

birdsarah commented Mar 24, 2018 • edited Loading

martindurant commented Mar 24, 2018 • edited Loading

martindurant commented Mar 24, 2018

wesm commented Mar 26, 2018

birdsarah commented Mar 26, 2018

Can't write dfs when column order changes `TypeError: expected list of bytes` #320

Can't write dfs when column order changes `TypeError: expected list of bytes` #320

birdsarah commented Mar 24, 2018 •

edited

Loading

martindurant commented Mar 24, 2018 •

edited

Loading