Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big strings cause AssertionError: found X raw bytes (expected Y) #2562

Closed
naroom opened this issue Sep 14, 2018 · 3 comments
Closed

Big strings cause AssertionError: found X raw bytes (expected Y) #2562

naroom opened this issue Sep 14, 2018 · 3 comments

Comments

@naroom
Copy link

naroom commented Sep 14, 2018

Writing really long strings from pyarrow causes exception in fastparquet read.

Traceback (most recent call last):
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
    read_fastparquet()
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet
    dff = pf.to_pandas(['A'])
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas
    index=index, assign=parts)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group
    scheme=self.file_scheme)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group
    cats, selfmade, assign=assign)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays
    catdef=out.get(name+'-catdef', None))
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col
    skip_nulls, selfmade=selfmade)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page
    raw_bytes = _read_page(f, header, metadata)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page
    page_header.uncompressed_page_size)
AssertionError: found 175532 raw bytes (expected 200026)

If written with compression, it reports compression errors instead:

SNAPPY: snappy.UncompressError: Error while decompressing: invalid input

GZIP: zlib.error: Error -3 while decompressing data: incorrect header check

Minimal code to reproduce:

import os
import pandas as pd
import pyarrow
import pyarrow.parquet as arrow_pq
from fastparquet import ParquetFile

# data to generate
ROW_LENGTH = 40000  # decreasing below 32750ish eliminates exception
N_ROWS = 10

# file write params
ROW_GROUP_SIZE = 5  # Lower numbers eliminate exception, but strange data is read (e.g. Nones)
FILENAME = 'test.parquet'

def write_arrow():
    df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
    if os.path.isfile(FILENAME):
        os.remove(FILENAME)
    arrow_table = pyarrow.Table.from_pandas(df)
    arrow_pq.write_table(arrow_table,
                         FILENAME,
                         use_dictionary=False,
                         compression='NONE',
                         row_group_size=ROW_GROUP_SIZE)


def read_arrow():
    print "arrow:"
    table2 = arrow_pq.read_table(FILENAME)
    print table2.to_pandas().head()


def read_fastparquet():
    print "fastparquet:"
    pf = ParquetFile(FILENAME)
    dff = pf.to_pandas(['A'])
    print dff.head()


write_arrow()
read_arrow()
read_fastparquet()

Versions:
fastparquet==0.1.6
pyarrow==0.10.0
pandas==0.22.0
sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'

Also opened issue here: dask/fastparquet#375

@wesm
Copy link
Member

wesm commented Sep 14, 2018

Can you please open a JIRA issue? Thanks

@naroom
Copy link
Author

naroom commented Sep 14, 2018

@naroom naroom closed this as completed Sep 14, 2018
@MicPie
Copy link

MicPie commented May 17, 2019

Hello @naroom ,

did you found way to work around this error?

Kind regards
Michael

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants