Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big strings cause AssertionError: found X raw bytes (expected Y) #375

Open
naroom opened this issue Sep 13, 2018 · 3 comments
Open

Big strings cause AssertionError: found X raw bytes (expected Y) #375

naroom opened this issue Sep 13, 2018 · 3 comments

Comments

@naroom
Copy link

naroom commented Sep 13, 2018

Writing really long strings from pyarrow causes exception in fastparquet read.

Traceback (most recent call last):
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
    read_fastparquet()
  File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet
    dff = pf.to_pandas(['A'])
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas
    index=index, assign=parts)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group
    scheme=self.file_scheme)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group
    cats, selfmade, assign=assign)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays
    catdef=out.get(name+'-catdef', None))
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col
    skip_nulls, selfmade=selfmade)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page
    raw_bytes = _read_page(f, header, metadata)
  File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page
    page_header.uncompressed_page_size)
AssertionError: found 175532 raw bytes (expected 200026)

If written with compression, it reports compression errors instead:

SNAPPY: snappy.UncompressError: Error while decompressing: invalid input

GZIP: zlib.error: Error -3 while decompressing data: incorrect header check

Minimal code to reproduce:

import os
import pandas as pd
import pyarrow
import pyarrow.parquet as arrow_pq
from fastparquet import ParquetFile

# data to generate
ROW_LENGTH = 40000  # decreasing below 32750ish eliminates exception
N_ROWS = 10

# file write params
ROW_GROUP_SIZE = 5  # Lower numbers eliminate exception, but strange data is read (e.g. Nones)
FILENAME = 'test.parquet'

def write_arrow():
    df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
    if os.path.isfile(FILENAME):
        os.remove(FILENAME)
    arrow_table = pyarrow.Table.from_pandas(df)
    arrow_pq.write_table(arrow_table,
                         FILENAME,
                         use_dictionary=False,
                         compression='NONE',
                         row_group_size=ROW_GROUP_SIZE)


def read_arrow():
    print "arrow:"
    table2 = arrow_pq.read_table(FILENAME)
    print table2.to_pandas().head()


def read_fastparquet():
    print "fastparquet:"
    pf = ParquetFile(FILENAME)
    dff = pf.to_pandas(['A'])
    print dff.head()


write_arrow()
read_arrow()
read_fastparquet()

Versions:
fastparquet==0.1.6
pyarrow==0.10.0
pandas==0.22.0
sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'

Also opened issue here: apache/arrow#2562

@martindurant
Copy link
Member

If you write with fastparquet too, then it works just fine. This will be tricky to find out why.

@dargueta
Copy link

dargueta commented Nov 7, 2018

I've found that changing the row group size changes the error message:

  • 1: Weird data, seems to be all empty strings.
  • 2: found 55906 raw bytes (expected 80014)
  • 3: Weird data with Nones as mentioned earlier
  • 4: found 55648 raw bytes (expected 80014)
  • 5: found 175555 raw bytes (expected 200026)
  • 6: found 135543 raw bytes (expected 160022)
  • 7: RuntimeError: Ran out of input
  • 8: RuntimeError: Ran out of input
  • 9: RuntimeError: Ran out of input
  • 10: found 375482 raw bytes (expected 400046)
  • 11+ gives the same error message as 10 since there's only one group.

Changing the compression to GZIP gives Not a gzipped file (b'AA'). It appears that either the columns aren't being compressed by pyarrow (unlikely), or fastparquet tries to decompress the columns twice.

Packages:

fastparquet==0.1.6
pandas==0.23.4
pyarrow==0.11.1

Python version:

3.7.0 (default, Oct 30 2018, 16:00:26) 
[Clang 10.0.0 (clang-1000.10.44.2)]

@martindurant martindurant mentioned this issue Nov 8, 2018
2 tasks
adamhooper added a commit to CJWorkbench/cjworkbench that referenced this issue Nov 8, 2018
This happens with old files that we wrote using pyarrow. See
dask/fastparquet#375
@martindurant
Copy link
Member

I wonder, can someone check with pdb in fastparquet.core.read_col, whether there is a dictionary page in these columns? I know that arrow starts off assuming a dictionary and revert to plan encoding when the dictionary gets too big; so that's one difference I would assume can happen as the string size increases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants