Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError when writing to S3 with partition_cols #503

Open
tammymendt opened this issue May 18, 2020 · 4 comments
Open

TypeError when writing to S3 with partition_cols #503

tammymendt opened this issue May 18, 2020 · 4 comments

Comments

@tammymendt
Copy link

The issue can be reproduced as follows:

import pandas as pd

df = pd.DataFrame([
    [1, 'DE', 2.3],
    [2, 'BE', 4.5],
    [3, 'DE', 7.6],
    [4, 'DE', 4.8]
], columns=['id', 'country', 'value'])

df.to_parquet('s3://<my-s3-bucket>/<my-directory>', compression='gzip', index=False, engine='fastparquet', partition_cols=['country'])

When doing the same write operation, without the partition_cols argument, it works fine. The error stacktrace is the following:

Traceback (most recent call last):
  File "<my-python-file>.py", line 20, in <module>
    engine='fastparquet')
  File "pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "pandas/core/frame.py", line 2116, in to_parquet
    **kwargs,
  File "pandas/io/parquet.py", line 264, in to_parquet
    **kwargs,
  File "pandas/io/parquet.py", line 185, in write
    **kwargs,
  File "fastparquet/writer.py", line 895, in write
    fn = join_path(filename, '_metadata')
  File "fastparquet/util.py", line 330, in join_path
    if path[0][0] == '/':
TypeError: 'S3File' object is not subscriptable

The code assumes the path[0] variable is a string, but it is an S3File object. For the S3File object, the path string can be accessed using .path. Thus is should look as follows path[0].path[0].

The package versions are:

fastparquet==0.4.0
packaging==20.3
pandas==1.0.1
s3fs==0.4.2
@martindurant
Copy link
Member

Can you please cross-post on pandas? fastparquet certainly does handle doing this, so apparently the call is being made incorrectly, but I'm not sure exactly how.

(cc pandas-dev/pandas#33452 )

@tammymendt
Copy link
Author

So pandas seems to assume that the first argument to the api.write function can either be a path or a buffer. In the case of an S3 file, it passes an S3File object (buffer), not the string of the filepath. Here is the function that does this (https://github.com/pandas-dev/pandas/blob/master/pandas/io/s3.py#L23). I think this behavior is intended though.

However, the write function in fastparquet expects a filename (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L764). The write_simple function works fine with both a filepath and a File object (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L735). But the rest of the logic in the write function relies on the argument being a string.

Ideally, I suppose pandas should pass an argument to write which is always a same type of object with the same interface (so even when its just a string, it should be wrapped by some class). This way the write function in fastparquet would not have to handle paths and buffers differently. I assume a change like this in pandas would likely break other parts of that code, since the get_filepath_or_buffer function is used quite a lot in pandas (https://github.com/pandas-dev/pandas/search?p=1&q=get_filepath_or_buffer&unscoped_q=get_filepath_or_buffer).

@martindurant
Copy link
Member

I believe this should now be fixed in at least pandas master (but probably released too).

@tammymendt
Copy link
Author

@martindurant cool thanks, I will check and if its fixed I'll close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants