-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awswrangler.s3.read_parquet() chunked=True 1 dataframe per result #2086
Comments
Hey, Our documentation appears to be incorrect there. If To give an example, let's say you have two files in your dataset. The first file has 70,000 rows, and the second has 50,000. By passing
Alternatively, if you explicitly pass a chunk size, rows from different files will be spliced together in order to ensure the data size. Therefore, if you pass
I will update the documentation to better reflect the actual behavior. Let me know if this answers your question. |
Thank you for the response! That definitely makes sense. Is there any world where we can chunk where each chuck size equals the number of rows returned on that particular parquet file? i.e. for this data I have there are roughly ~1067300 rows per parquet file (but this will vary by a few hundred to thousand.) If not, maybe you could recommend a solution that might be helpful for my use case. I appreciate your time and help! |
Hey, If I understand correctly, you want exactly one chunk for each file? And each of those chunks would have If so, you could iterate through the files you have in your dataset by using Let me know if this works for you, |
Roughly that many rows, the number of rows will vary depending (data is measurements of optical hardware has an inconsistent number of entries) - but yeah one chunk for each file would be ideal. I will take a look at your recommendation! Thank you so much for taking the time. |
Sorry @LeonLuttenberger one more question: |
Hey,
Did you mean you were assigning 11.5 GB for 12 Parquet files where each file is 8MB seems excessive. Does each have ~1067300 rows in it? Can I ask how many columns are in each, and what you used to measure the memory? I'd like to try to replicate these results to better understand whats happening. The reason The main exception to the rule that |
Hey, Which version of If you're already on Cheers, |
Is there any chance this is an M1 thing? |
Hi there,
I have a question regarding the
chunked=true
option in awswrangler.s3.read_parquet().I'm looking to load parquet files from S3 in the most memory efficient way possible. Our data has a differing number of rows per parquet file, but the same number of columns (11). I'd like it so the results from
read_parquet()
is separated as a pandas DF on a per-parquet file basis. i.e. if based on thefilter_query
it returns 10 parquet files, I will receive 10 pandas DFs in return.chunked=True
works if the number of rows is the same every time, but with our data there will be a different number of rows from time to time, so hard-coding the chunk size isn't feasible.The documentation says:
However it also seems to be choosing an arbitrary size to chunk in (in my case it's chunks of 65536)
Is there something I'm missing here with regards to this? Thanks very much for your help!
The text was updated successfully, but these errors were encountered: