Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed writing a dataframe to '.avro' file #58

Open
Anna050689 opened this issue Sep 12, 2024 · 0 comments
Open

Failed writing a dataframe to '.avro' file #58

Anna050689 opened this issue Sep 12, 2024 · 0 comments

Comments

@Anna050689
Copy link

Prerequisites:

  • Python 3.10
  • pandavro==1.8.0
  • fastavro==1.9.7

Steps to reproduce the issue:

  • Create a dataframe with the following data:
import pandas as pd

data = {
    'id': [545, 539, 643, 615, 502, 599, 542, 587, 537, 518],
    'first_name': ['caallai', 'Xzaaen', 'olrie', 'Iaairl', 'hfreiio', 'yieri', 'hcninn', 'irannir', 'Cmrnnan', 'Mnaeail'],
    'last_name': ['kroaoe', 'trrot', 'haill', 'kolide', 'errhnd', 'aoaoet', 'yBorrd', 'evbceyd', 'Wcnoee', 'eMloen'],
    'created_date': ['12/22/1992', '06/02/1992', '09/23/1998', '01/01/1997', '03/26/1990', '06/01/1996', '08/08/1992', '01/14/1995', '06/16/1992', '06/24/1991'],
    'Active': [False, False, False, False, False, True, False, False, False, True]
}
df = pd.DataFrame(data=data).astype('object')
  • Attempt to save the dataframe to an '.avro' file using the following command:
import pandavro as pdx

path = 'output.avro'
pdx.to_avro(path, df, schema=None)

Expected behavior:

The dataframe should be saved to an '.avro' file without any errors.

Actual behavior:

The following error is raised:

  File "fastavro/_write.pyx", line 779, in fastavro._write.writer
  File "fastavro/_write.pyx", line 687, in fastavro._write.Writer.__init__
  File "fastavro/_schema.pyx", line 173, in fastavro._schema.parse_schema
  File "fastavro/_schema.pyx", line 407, in fastavro._schema._parse_schema
  File "fastavro/_schema.pyx", line 475, in fastavro._schema.parse_field
  File "fastavro/_schema.pyx", line 233, in fastavro._schema._parse_schema
  File "fastavro/_schema.pyx", line 263, in fastavro._schema._parse_schema
TypeError: argument of type 'NoneType' is not iterable

The inferred schema is:

{
    'fields': [
        {'name': 'id', 'type': ['null', None]},
        {'name': 'first_name', 'type': ['null', 'string']},
        {'name': 'last_name', 'type': ['null', 'string']},
        {'name': 'created_date', 'type': ['null', 'string']},
        {'name': 'Active', 'type': ['null', 'boolean']}
    ],
    'name': 'Root',
    'type': 'record'
}

Additional Information:

The issue occurs because the "id" column is inferred as ['null', None] instead of ['null', 'int'] when its data type is set to object.
When the "id" column has the data type integer, the process of saving the '.avro' file is successful.

Workaround:

As a temporary workaround, the data type of the "id" column should be explicitly set to integer before saving the dataframe to an '.avro' file:

df['id'] = df['id'].astype('int')
pdx.to_avro(path, df, schema=None)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant