-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError("don't know what type: 14",) while reading from parquet #316
Comments
I suppose this is the right place to log this issue, although I don't know what to make of your traceback. It would be good to list all versions, and, if possible, try to reproduce the error outside of distributed, so you can get more debugging information. |
@martindurant thank you for a quick response. I've updated my issue with all installed packages' versions. I'll try to reproduce it outside of distributed. |
It is possible, maybe, that GCS isn't returning as many bytes as requested on fetch calls. This is not explicitly tested, and may account for the intermittency of the errors you see. Can you try the following: --- a/gcsfs/core.py
+++ b/gcsfs/core.py
@@ -1330,6 +1330,7 @@ class GCSFile:
end + self.blocksize)
self.end = end + self.blocksize
self.cache = self.cache + new
+ self.end = self.start + len(self.cache)
def read(self, length=-1): (or even simpler, a statement that logs whenever the start, end and cache do not tally) |
Thanks, I'll give it a try. |
@martindurant if I understood you correctly, this is what you meant (in "even simpler" case): @@ -1330,6 +1330,9 @@ def _fetch(self, start, end):
end + self.blocksize)
self.end = end + self.blocksize
self.cache = self.cache + new
+ if self.end - self.start != len(self.cache):
+ warnings.warn('Start, end and cache do not tally (_fetch method). '
+ 'Fetching data may go wrong.')
def read(self, length=-1):
""" However, this warning appears even when files are being read correctly. |
The above warning is indeed what I had in mind, but it occurs to me that it could happen where |
Hello
I'm experiencing a strange issue when reading from parquet file. I'm not sure if this is a right place to post an issue; if not, please let me know where I should post it.
I have a rather big dask graph which has reading from parquet (from Google Cloud Storage) on its leaves. Sometimes, and it seems to be happening randomly, a computation will fail with following message in worker:
I'm not getting any traceback, just this information.
What's strange, if I try reading this file later with
dd.read_parquet().compute()
, everything works fine. It happens for different files and each time I can read the file later with no problems. Unfortunately, I cannot share the troublesome files.I've found a somehow similar issue here, not sure if it's related though.
Versions:
All versions
I'd appreciate any help in resolving this issue. What should I check, where should I look?
Ragards
Michal
The text was updated successfully, but these errors were encountered: