TypeError when writing to S3 with partition_cols #503

tammymendt · 2020-05-18T13:21:27Z

The issue can be reproduced as follows:

import pandas as pd

df = pd.DataFrame([
    [1, 'DE', 2.3],
    [2, 'BE', 4.5],
    [3, 'DE', 7.6],
    [4, 'DE', 4.8]
], columns=['id', 'country', 'value'])

df.to_parquet('s3://<my-s3-bucket>/<my-directory>', compression='gzip', index=False, engine='fastparquet', partition_cols=['country'])

When doing the same write operation, without the partition_cols argument, it works fine. The error stacktrace is the following:

Traceback (most recent call last):
  File "<my-python-file>.py", line 20, in <module>
    engine='fastparquet')
  File "pandas/util/_decorators.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "pandas/core/frame.py", line 2116, in to_parquet
    **kwargs,
  File "pandas/io/parquet.py", line 264, in to_parquet
    **kwargs,
  File "pandas/io/parquet.py", line 185, in write
    **kwargs,
  File "fastparquet/writer.py", line 895, in write
    fn = join_path(filename, '_metadata')
  File "fastparquet/util.py", line 330, in join_path
    if path[0][0] == '/':
TypeError: 'S3File' object is not subscriptable

The code assumes the path[0] variable is a string, but it is an S3File object. For the S3File object, the path string can be accessed using .path. Thus is should look as follows path[0].path[0].

The package versions are:

fastparquet==0.4.0
packaging==20.3
pandas==1.0.1
s3fs==0.4.2

The text was updated successfully, but these errors were encountered:

martindurant · 2020-05-18T18:38:07Z

Can you please cross-post on pandas? fastparquet certainly does handle doing this, so apparently the call is being made incorrectly, but I'm not sure exactly how.

(cc pandas-dev/pandas#33452 )

tammymendt · 2020-05-19T08:57:02Z

So pandas seems to assume that the first argument to the api.write function can either be a path or a buffer. In the case of an S3 file, it passes an S3File object (buffer), not the string of the filepath. Here is the function that does this (https://github.com/pandas-dev/pandas/blob/master/pandas/io/s3.py#L23). I think this behavior is intended though.

However, the write function in fastparquet expects a filename (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L764). The write_simple function works fine with both a filepath and a File object (https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py#L735). But the rest of the logic in the write function relies on the argument being a string.

Ideally, I suppose pandas should pass an argument to write which is always a same type of object with the same interface (so even when its just a string, it should be wrapped by some class). This way the write function in fastparquet would not have to handle paths and buffers differently. I assume a change like this in pandas would likely break other parts of that code, since the get_filepath_or_buffer function is used quite a lot in pandas (https://github.com/pandas-dev/pandas/search?p=1&q=get_filepath_or_buffer&unscoped_q=get_filepath_or_buffer).

martindurant · 2020-09-08T13:24:09Z

I believe this should now be fixed in at least pandas master (but probably released too).

tammymendt · 2020-09-16T07:14:34Z

@martindurant cool thanks, I will check and if its fixed I'll close the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError when writing to S3 with partition_cols #503

TypeError when writing to S3 with partition_cols #503

tammymendt commented May 18, 2020

martindurant commented May 18, 2020

tammymendt commented May 19, 2020

martindurant commented Sep 8, 2020

tammymendt commented Sep 16, 2020

TypeError when writing to S3 with partition_cols #503

TypeError when writing to S3 with partition_cols #503

Comments

tammymendt commented May 18, 2020

martindurant commented May 18, 2020

tammymendt commented May 19, 2020

martindurant commented Sep 8, 2020

tammymendt commented Sep 16, 2020