TypeError("don't know what type: 14",) while reading from parquet #316

michcio1234 · 2018-03-14T18:28:51Z

Hello
I'm experiencing a strange issue when reading from parquet file. I'm not sure if this is a right place to post an issue; if not, please let me know where I should post it.

I have a rather big dask graph which has reading from parquet (from Google Cloud Storage) on its leaves. Sometimes, and it seems to be happening randomly, a computation will fail with following message in worker:

distributed.worker - WARNING -  Compute Failed
Function:  _read_parquet_row_group
args:      (<gcsfs.dask_link.DaskGCSFileSystem object at 0x7f0d20774e10>, 'bucket/path/filename.parquet/part.0.parquet', ['index-name'], ['column', 'names', 'index-name'], <class 'fastparquet.parquet_thrift.parquet.ttypes.RowGroup'>
columns: [<class 'fastparquet.parquet_thrift.parquet.ttypes.ColumnChunk'>
file_offset: 1862717
file_path: part.0.parquet
meta_data: <class 'fastparquet.parquet_thrift.parquet.ttypes.ColumnMetaData'>
  codec: 2
  data_page_offset: 4
  dictionary_page_offset: None
  encoding_stats: [<class 'fastparquet.parquet_thrift.parquet.ttypes.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
  encodings: [3, 4, 0]
  index_page_offset: None
  key_value_metadata: []
  num_values: 609350
  path_in_schema: ['timestamp']
  statistics: <class 'fastparquet.parquet_thrift.parquet.ttypes.Statistics'>
    distinct_count: None
    max: b'\xb0\x07\x9c\xedSe\x05\x00'
    min: b'\x00\x1c\xf2/He\x05\x00'
    null_count: 0


kwargs:    {}
Exception: TypeError("don't know what type: 14",)

I'm not getting any traceback, just this information.
What's strange, if I try reading this file later with dd.read_parquet().compute(), everything works fine. It happens for different files and each time I can read the file later with no problems. Unfortunately, I cannot share the troublesome files.
I've found a somehow similar issue here, not sure if it's related though.

Versions:

python 3.5.2
dask: 0.17.1
fastparquet: 0.1.3
thrift: 0.10.0
gcsfs: 0.0.3

All versions

asn1crypto==0.22.0
attrs==17.4.0
better-exceptions==0.2.1
better-exceptions-hook==1.0.0
bleach==2.1.3
blosc==1.4.4
bokeh==0.12.13
boto3==1.5.11
botocore==1.8.26
cachetools==2.0.1
cffi==1.10.0
click==6.7
cloudpickle==0.5.2
conda==4.4.7
cryptography==1.8.1
cycler==0.10.0
Cython==0.27.3
dask==0.17.1
decorator==4.1.2
distributed==1.21.3
docutils==0.14
drmigrator==0.1.7+20.g937a29c
drtools==1.9.0+1.g8f027e5
entrypoints==0.2.3
fastparquet==0.1.3
gcsfs==0.0.3
google-api-core==1.0.0
google-auth==1.4.1
google-auth-oauthlib==0.2.0
google-cloud-bigquery==0.31.0
google-cloud-core==0.28.1
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
heapdict==1.0.0
html5lib==1.0.1
httplib2==0.10.3
idna==2.6
ipykernel==4.8.2
ipython==6.2.1
ipython-genutils==0.2.0
jedi==0.11.1
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter-client==5.2.3
jupyter-core==4.4.0
jupyterlab==0.31.12
jupyterlab-launcher==0.10.5
kiwisolver==1.0.1
llvmlite==0.21.0
locket==0.2.0
lockfile==0.12.2
luigi==2.7.1
lz4==0.10.1
MarkupSafe==1.0
matplotlib==2.2.0
mistune==0.8.3
mmh3==2.4
msgpack-python==0.4.8
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.4.0
numba==0.36.2
numexpr==2.6.4
numpy==1.11.3
oauth2client==4.1.2
oauthlib==2.0.6
packaging==16.8
pandas==0.20.3
pandas-gbq==0.3.1
pandocfilters==1.4.2
parso==0.1.1
partd==0.3.8
pexpect==4.4.0
pickleshare==0.7.4
pluggy==0.6.0
prompt-toolkit==1.0.15
protobuf==3.5.2
psutil==5.4.1
ptyprocess==0.5.2
py==1.5.2
pyasn1==0.3.7
pyasn1-modules==0.1.5
pycosat==0.6.3
pycparser==2.18
pycrypto==2.6.1
Pygments==2.2.0
pykube==0.15.0
pyOpenSSL==17.0.0
pyparsing==2.2.0
pytest==3.3.2
python-daemon==2.1.2
python-dateutil==2.6.1
pytz==2017.3
PyYAML==3.12
pyzmq==17.0.0
requests==2.14.2
requests-oauthlib==0.8.0
rsa==3.4.2
ruamel-yaml===-VERSION
s3fs==0.1.2
s3transfer==0.1.11
scikit-learn==0.19.0
scipy==0.19.1
seaborn==0.8.1
Send2Trash==1.5.0
simplegeneric==0.8.1
six==1.10.0
sortedcontainers==1.5.7
tables==3.2.2
tblib==1.3.2
terminado==0.8.1
testpath==0.3.1
thrift==0.10.0
toolz==0.8.2
tornado==4.5.2
tqdm==4.19.7
traitlets==4.3.2
tzlocal==1.5.1
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
xgboost==0.7.post3
zict==0.1.3

I'd appreciate any help in resolving this issue. What should I check, where should I look?
Ragards
Michal

The text was updated successfully, but these errors were encountered:

martindurant · 2018-03-14T18:35:57Z

I suppose this is the right place to log this issue, although I don't know what to make of your traceback.
This error is coming from thrift, not from fastparquet itself. I don't see the value "14" anywhere in the thrift data dump. My only guess would be something to do with the parquet thrift format specification version or version of the thrift library. You may want to try different/latest version?

It would be good to list all versions, and, if possible, try to reproduce the error outside of distributed, so you can get more debugging information.

michcio1234 · 2018-03-14T18:56:30Z

@martindurant thank you for a quick response. I've updated my issue with all installed packages' versions. I'll try to reproduce it outside of distributed.

martindurant · 2018-03-14T20:54:11Z

It is possible, maybe, that GCS isn't returning as many bytes as requested on fetch calls. This is not explicitly tested, and may account for the intermittency of the errors you see. Can you try the following:

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1330,6 +1330,7 @@ class GCSFile:
                                    end + self.blocksize)
                 self.end = end + self.blocksize
                 self.cache = self.cache + new
+        self.end = self.start + len(self.cache)

     def read(self, length=-1):

(or even simpler, a statement that logs whenever the start, end and cache do not tally)

michcio1234 · 2018-03-15T10:26:21Z

Thanks, I'll give it a try.
For now, I think I somehow worked it around by calling client.compute(..., retries=2).

michcio1234 · 2018-03-15T12:06:43Z

@martindurant if I understood you correctly, this is what you meant (in "even simpler" case):

@@ -1330,6 +1330,9 @@ def _fetch(self, start, end):
                                    end + self.blocksize)
                 self.end = end + self.blocksize
                 self.cache = self.cache + new
+        if self.end - self.start != len(self.cache):
+            warnings.warn('Start, end and cache do not tally (_fetch method). '
+                          'Fetching data may go wrong.')
 
     def read(self, length=-1):
         """

However, this warning appears even when files are being read correctly.

martindurant · 2018-03-15T14:47:18Z

The above warning is indeed what I had in mind, but it occurs to me that it could happen where end is beyond the end of a file, and this is not a problem. Since things generally work for you, most of the warnings are probably fine.

michcio1234 changed the title ~~TypeError("don't know what type: 14",)~~ TypeError("don't know what type: 14",) while reading from parquet Mar 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError("don't know what type: 14",) while reading from parquet #316

TypeError("don't know what type: 14",) while reading from parquet #316

michcio1234 commented Mar 14, 2018 •

edited

Loading

martindurant commented Mar 14, 2018 •

edited

Loading

michcio1234 commented Mar 14, 2018

martindurant commented Mar 14, 2018

michcio1234 commented Mar 15, 2018

michcio1234 commented Mar 15, 2018 •

edited

Loading

martindurant commented Mar 15, 2018

TypeError("don't know what type: 14",) while reading from parquet #316

TypeError("don't know what type: 14",) while reading from parquet #316

Comments

michcio1234 commented Mar 14, 2018 • edited Loading

martindurant commented Mar 14, 2018 • edited Loading

michcio1234 commented Mar 14, 2018

martindurant commented Mar 14, 2018

michcio1234 commented Mar 15, 2018

michcio1234 commented Mar 15, 2018 • edited Loading

martindurant commented Mar 15, 2018

michcio1234 commented Mar 14, 2018 •

edited

Loading

martindurant commented Mar 14, 2018 •

edited

Loading

michcio1234 commented Mar 15, 2018 •

edited

Loading