-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't write dfs when column order changes TypeError: expected list of bytes
#320
Comments
Edit: That's not True. Different issue. |
It is generally expected that every partition of a dask dataframe has exactly the same structure, as described by the meta zero-length dataframe; in fact, some operations will complain/fail if a difference is detected, even just in the ordering of the columns. Of course many pandas operations involve selecting columns, so in those cases you would not notice. You could solve it easily by doing something like
For the fastparquet side, it is implicitly expected that the ordering of column chunks within any given row-group is the same (see, for instance, #318 , where this assumption is used to greatly increase metadata parsing efficiency). This is hinted at, but not stated explicitly in the standard. I don't know from the above what the actual error might be coming from when using fastparquet... |
PS: if you wanted arrow people to notice your concerns, you should also raise an issue there; or perhaps in dask, since maybe it's only the interaction of the two that can cause this. |
Let us know what behavior change (if any) would be preferred in Apache Arrow -- it might be nice to accept schemas with a permutation of column order (but only if the column names are unique). |
@wesm. Arrow was extremely helpful in this case. I'd be using it more if it stored divisions with the index to allow for quick row selection (might be using my terminology wrong i'm new here). In general, for both Arrow and Fastparquet, it would be helpful if columns could be arbitrarily ordered. In this case I'm reading in json. There's no guarantee of key order, but the keys are all consistent. Allowing consistent columns, but not requiring order seems like it would be extremely handy. |
My code looked like this:
It failed with
TypeError: expected list of bytes
When I changed the engine to
pyarrow
, to_parquet worked, but thendd.from_parquet
gave me the error:The schemas are the same, but the order is different.
I then added:
df = df.sort_index(axis=1)
to mybuild_df_from_filename
function.After doing the I decided to retry fastparquet and it worked.
It's a bit bonkers to me that:
Gripe over.
What would be awesome would be a helpful error message about mismatched columns, not just
TypeError: expected list of bytes
The text was updated successfully, but these errors were encountered: