Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError("don't know what type: 14",) while reading from parquet #316

Open
michcio1234 opened this issue Mar 14, 2018 · 6 comments
Open

Comments

@michcio1234
Copy link

michcio1234 commented Mar 14, 2018

Hello
I'm experiencing a strange issue when reading from parquet file. I'm not sure if this is a right place to post an issue; if not, please let me know where I should post it.

I have a rather big dask graph which has reading from parquet (from Google Cloud Storage) on its leaves. Sometimes, and it seems to be happening randomly, a computation will fail with following message in worker:

distributed.worker - WARNING -  Compute Failed
Function:  _read_parquet_row_group
args:      (<gcsfs.dask_link.DaskGCSFileSystem object at 0x7f0d20774e10>, 'bucket/path/filename.parquet/part.0.parquet', ['index-name'], ['column', 'names', 'index-name'], <class 'fastparquet.parquet_thrift.parquet.ttypes.RowGroup'>
columns: [<class 'fastparquet.parquet_thrift.parquet.ttypes.ColumnChunk'>
file_offset: 1862717
file_path: part.0.parquet
meta_data: <class 'fastparquet.parquet_thrift.parquet.ttypes.ColumnMetaData'>
  codec: 2
  data_page_offset: 4
  dictionary_page_offset: None
  encoding_stats: [<class 'fastparquet.parquet_thrift.parquet.ttypes.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]
  encodings: [3, 4, 0]
  index_page_offset: None
  key_value_metadata: []
  num_values: 609350
  path_in_schema: ['timestamp']
  statistics: <class 'fastparquet.parquet_thrift.parquet.ttypes.Statistics'>
    distinct_count: None
    max: b'\xb0\x07\x9c\xedSe\x05\x00'
    min: b'\x00\x1c\xf2/He\x05\x00'
    null_count: 0


kwargs:    {}
Exception: TypeError("don't know what type: 14",)

I'm not getting any traceback, just this information.
What's strange, if I try reading this file later with dd.read_parquet().compute(), everything works fine. It happens for different files and each time I can read the file later with no problems. Unfortunately, I cannot share the troublesome files.
I've found a somehow similar issue here, not sure if it's related though.

Versions:

  • python 3.5.2
  • dask: 0.17.1
  • fastparquet: 0.1.3
  • thrift: 0.10.0
  • gcsfs: 0.0.3
All versions

asn1crypto==0.22.0
attrs==17.4.0
better-exceptions==0.2.1
better-exceptions-hook==1.0.0
bleach==2.1.3
blosc==1.4.4
bokeh==0.12.13
boto3==1.5.11
botocore==1.8.26
cachetools==2.0.1
cffi==1.10.0
click==6.7
cloudpickle==0.5.2
conda==4.4.7
cryptography==1.8.1
cycler==0.10.0
Cython==0.27.3
dask==0.17.1
decorator==4.1.2
distributed==1.21.3
docutils==0.14
drmigrator==0.1.7+20.g937a29c
drtools==1.9.0+1.g8f027e5
entrypoints==0.2.3
fastparquet==0.1.3
gcsfs==0.0.3
google-api-core==1.0.0
google-auth==1.4.1
google-auth-oauthlib==0.2.0
google-cloud-bigquery==0.31.0
google-cloud-core==0.28.1
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
heapdict==1.0.0
html5lib==1.0.1
httplib2==0.10.3
idna==2.6
ipykernel==4.8.2
ipython==6.2.1
ipython-genutils==0.2.0
jedi==0.11.1
Jinja2==2.10
jmespath==0.9.3
jsonschema==2.6.0
jupyter-client==5.2.3
jupyter-core==4.4.0
jupyterlab==0.31.12
jupyterlab-launcher==0.10.5
kiwisolver==1.0.1
llvmlite==0.21.0
locket==0.2.0
lockfile==0.12.2
luigi==2.7.1
lz4==0.10.1
MarkupSafe==1.0
matplotlib==2.2.0
mistune==0.8.3
mmh3==2.4
msgpack-python==0.4.8
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.4.0
numba==0.36.2
numexpr==2.6.4
numpy==1.11.3
oauth2client==4.1.2
oauthlib==2.0.6
packaging==16.8
pandas==0.20.3
pandas-gbq==0.3.1
pandocfilters==1.4.2
parso==0.1.1
partd==0.3.8
pexpect==4.4.0
pickleshare==0.7.4
pluggy==0.6.0
prompt-toolkit==1.0.15
protobuf==3.5.2
psutil==5.4.1
ptyprocess==0.5.2
py==1.5.2
pyasn1==0.3.7
pyasn1-modules==0.1.5
pycosat==0.6.3
pycparser==2.18
pycrypto==2.6.1
Pygments==2.2.0
pykube==0.15.0
pyOpenSSL==17.0.0
pyparsing==2.2.0
pytest==3.3.2
python-daemon==2.1.2
python-dateutil==2.6.1
pytz==2017.3
PyYAML==3.12
pyzmq==17.0.0
requests==2.14.2
requests-oauthlib==0.8.0
rsa==3.4.2
ruamel-yaml===-VERSION
s3fs==0.1.2
s3transfer==0.1.11
scikit-learn==0.19.0
scipy==0.19.1
seaborn==0.8.1
Send2Trash==1.5.0
simplegeneric==0.8.1
six==1.10.0
sortedcontainers==1.5.7
tables==3.2.2
tblib==1.3.2
terminado==0.8.1
testpath==0.3.1
thrift==0.10.0
toolz==0.8.2
tornado==4.5.2
tqdm==4.19.7
traitlets==4.3.2
tzlocal==1.5.1
urllib3==1.22
wcwidth==0.1.7
webencodings==0.5.1
xgboost==0.7.post3
zict==0.1.3

I'd appreciate any help in resolving this issue. What should I check, where should I look?
Ragards
Michal

@michcio1234 michcio1234 changed the title TypeError("don't know what type: 14",) TypeError("don't know what type: 14",) while reading from parquet Mar 14, 2018
@martindurant
Copy link
Member

martindurant commented Mar 14, 2018

I suppose this is the right place to log this issue, although I don't know what to make of your traceback.
This error is coming from thrift, not from fastparquet itself. I don't see the value "14" anywhere in the thrift data dump. My only guess would be something to do with the parquet thrift format specification version or version of the thrift library. You may want to try different/latest version?

It would be good to list all versions, and, if possible, try to reproduce the error outside of distributed, so you can get more debugging information.

@michcio1234
Copy link
Author

@martindurant thank you for a quick response. I've updated my issue with all installed packages' versions. I'll try to reproduce it outside of distributed.

@martindurant
Copy link
Member

It is possible, maybe, that GCS isn't returning as many bytes as requested on fetch calls. This is not explicitly tested, and may account for the intermittency of the errors you see. Can you try the following:

--- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1330,6 +1330,7 @@ class GCSFile:
                                    end + self.blocksize)
                 self.end = end + self.blocksize
                 self.cache = self.cache + new
+        self.end = self.start + len(self.cache)

     def read(self, length=-1):

(or even simpler, a statement that logs whenever the start, end and cache do not tally)

@michcio1234
Copy link
Author

Thanks, I'll give it a try.
For now, I think I somehow worked it around by calling client.compute(..., retries=2).

@michcio1234
Copy link
Author

michcio1234 commented Mar 15, 2018

@martindurant if I understood you correctly, this is what you meant (in "even simpler" case):

@@ -1330,6 +1330,9 @@ def _fetch(self, start, end):
                                    end + self.blocksize)
                 self.end = end + self.blocksize
                 self.cache = self.cache + new
+        if self.end - self.start != len(self.cache):
+            warnings.warn('Start, end and cache do not tally (_fetch method). '
+                          'Fetching data may go wrong.')
 
     def read(self, length=-1):
         """

However, this warning appears even when files are being read correctly.

@martindurant
Copy link
Member

The above warning is indeed what I had in mind, but it occurs to me that it could happen where end is beyond the end of a file, and this is not a problem. Since things generally work for you, most of the warnings are probably fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants