You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Writing really long strings from pyarrow causes exception in fastparquet read.
Traceback (most recent call last):
File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 47, in <module>
read_fastparquet()
File "/Users/twalker/repos/cloud-atlas/diag/right.py", line 41, in read_fastparquet
dff = pf.to_pandas(['A'])
File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 426, in to_pandas
index=index, assign=parts)
File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/api.py", line 258, in read_row_group
scheme=self.file_scheme)
File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 344, in read_row_group
cats, selfmade, assign=assign)
File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 321, in read_row_group_arrays
catdef=out.get(name+'-catdef', None))
File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 235, in read_col
skip_nulls, selfmade=selfmade)
File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 99, in read_data_page
raw_bytes = _read_page(f, header, metadata)
File "/Users/twalker/anaconda/lib/python2.7/site-packages/fastparquet/core.py", line 31, in _read_page
page_header.uncompressed_page_size)
AssertionError: found 175532 raw bytes (expected 200026)
If written with compression, it reports compression errors instead:
SNAPPY: snappy.UncompressError: Error while decompressing: invalid input
GZIP: zlib.error: Error -3 while decompressing data: incorrect header check
Minimal code to reproduce:
import os
import pandas as pd
import pyarrow
import pyarrow.parquet as arrow_pq
from fastparquet import ParquetFile
# data to generate
ROW_LENGTH = 40000 # decreasing below 32750ish eliminates exception
N_ROWS = 10
# file write params
ROW_GROUP_SIZE = 5 # Lower numbers eliminate exception, but strange data is read (e.g. Nones)
FILENAME = 'test.parquet'
def write_arrow():
df = pd.DataFrame({'A': ['A'*ROW_LENGTH for _ in range(N_ROWS)]})
if os.path.isfile(FILENAME):
os.remove(FILENAME)
arrow_table = pyarrow.Table.from_pandas(df)
arrow_pq.write_table(arrow_table,
FILENAME,
use_dictionary=False,
compression='NONE',
row_group_size=ROW_GROUP_SIZE)
def read_arrow():
print "arrow:"
table2 = arrow_pq.read_table(FILENAME)
print table2.to_pandas().head()
def read_fastparquet():
print "fastparquet:"
pf = ParquetFile(FILENAME)
dff = pf.to_pandas(['A'])
print dff.head()
write_arrow()
read_arrow()
read_fastparquet()
Writing really long strings from pyarrow causes exception in fastparquet read.
If written with compression, it reports compression errors instead:
SNAPPY:
snappy.UncompressError: Error while decompressing: invalid input
GZIP:
zlib.error: Error -3 while decompressing data: incorrect header check
Minimal code to reproduce:
Versions:
fastparquet==0.1.6
pyarrow==0.10.0
pandas==0.22.0
sys.version '2.7.15 |Anaconda custom (64-bit)| (default, May 1 2018, 18:37:05) \n[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]'
Also opened issue here: dask/fastparquet#375
The text was updated successfully, but these errors were encountered: