Big strings cause AssertionError: found X raw bytes (expected Y) #375

naroom · 2018-09-13T19:03:23Z

Writing really long strings from pyarrow causes exception in fastparquet read.

Traceback (most recent call last):
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
    read_fastparquet()
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet
    dff = pf.to_pandas(['A'])
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas
    index=index, assign=parts)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group
    scheme=self.file_scheme)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group
    cats, selfmade, assign=assign)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays
    catdef=out.get(name+'-catdef', None))
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col
    skip_nulls, selfmade=selfmade)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page
    raw_bytes = _read_page(f, header, metadata)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page
    page_header.uncompressed_page_size)
AssertionError: found 175532 raw bytes (expected 200026)

If written with compression, it reports compression errors instead:

SNAPPY: snappy.UncompressError: Error while decompressing: invalid input

GZIP: zlib.error: Error -3 while decompressing data: incorrect header check

Minimal code to reproduce:

import os
import pandas as pd
import pyarrow
import pyarrow.parquet as arrow_pq
from fastparquet import ParquetFile

# data to generate
ROW_LENGTH = 40000  # decreasing below 32750ish eliminates exception
N_ROWS = 10

# file write params
ROW_GROUP_SIZE = 5  # Lower numbers eliminate exception, but strange data is read (e.g. Nones)
FILENAME = 'test.parquet'

def write_arrow():
    df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
    if os.path.isfile(FILENAME):
        os.remove(FILENAME)
    arrow_table = pyarrow.Table.from_pandas(df)
    arrow_pq.write_table(arrow_table,
                         FILENAME,
                         use_dictionary=False,
                         compression='NONE',
                         row_group_size=ROW_GROUP_SIZE)


def read_arrow():
    print "arrow:"
    table2 = arrow_pq.read_table(FILENAME)
    print table2.to_pandas().head()


def read_fastparquet():
    print "fastparquet:"
    pf = ParquetFile(FILENAME)
    dff = pf.to_pandas(['A'])
    print dff.head()


write_arrow()
read_arrow()
read_fastparquet()

Versions:
fastparquet==0.1.6
pyarrow==0.10.0
pandas==0.22.0
sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'

Also opened issue here: apache/arrow#2562

The text was updated successfully, but these errors were encountered:

martindurant · 2018-09-13T19:21:31Z

If you write with fastparquet too, then it works just fine. This will be tricky to find out why.

dargueta · 2018-11-07T23:37:22Z

I've found that changing the row group size changes the error message:

1: Weird data, seems to be all empty strings.
2: found 55906 raw bytes (expected 80014)
3: Weird data with Nones as mentioned earlier
4: found 55648 raw bytes (expected 80014)
5: found 175555 raw bytes (expected 200026)
6: found 135543 raw bytes (expected 160022)
7: RuntimeError: Ran out of input
8: RuntimeError: Ran out of input
9: RuntimeError: Ran out of input
10: found 375482 raw bytes (expected 400046)
11+ gives the same error message as 10 since there's only one group.

Changing the compression to GZIP gives Not a gzipped file (b'AA'). It appears that either the columns aren't being compressed by pyarrow (unlikely), or fastparquet tries to decompress the columns twice.

Packages:

fastparquet==0.1.6
pandas==0.23.4
pyarrow==0.11.1

Python version:

3.7.0 (default, Oct 30 2018, 16:00:26) 
[Clang 10.0.0 (clang-1000.10.44.2)]

This happens with old files that we wrote using pyarrow. See dask/fastparquet#375

martindurant · 2018-11-14T16:03:18Z

I wonder, can someone check with pdb in fastparquet.core.read_col, whether there is a dictionary page in these columns? I know that arrow starts off assuming a dictionary and revert to plan encoding when the dictionary gets too big; so that's one difference I would assume can happen as the string size increases.

naroom mentioned this issue Sep 14, 2018

Big strings cause AssertionError: found X raw bytes (expected Y) apache/arrow#2562

Closed

martindurant mentioned this issue Nov 8, 2018

Test spark dask/dask#4122

Closed

2 tasks

adamhooper added a commit to CJWorkbench/cjworkbench that referenced this issue Nov 8, 2018

Empty table on Fastparquet issue 375

0db534a

This happens with old files that we wrote using pyarrow. See dask/fastparquet#375

hayesgb mentioned this issue Oct 22, 2019

New PR to enhance compatibility with fsspec and initial implementation of ADL Gen2 fsspec/adlfs#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big strings cause AssertionError: found X raw bytes (expected Y) #375

Big strings cause AssertionError: found X raw bytes (expected Y) #375

naroom commented Sep 13, 2018 •

edited

Loading

martindurant commented Sep 13, 2018

dargueta commented Nov 7, 2018 •

edited

Loading

martindurant commented Nov 14, 2018

Big strings cause AssertionError: found X raw bytes (expected Y) #375

Big strings cause AssertionError: found X raw bytes (expected Y) #375

Comments

naroom commented Sep 13, 2018 • edited Loading

martindurant commented Sep 13, 2018

dargueta commented Nov 7, 2018 • edited Loading

martindurant commented Nov 14, 2018

naroom commented Sep 13, 2018 •

edited

Loading

dargueta commented Nov 7, 2018 •

edited

Loading