-
-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write without loading to RAM (skip pandas?) #476
Comments
In short: yes, it is often possible to load and process pandas datasets by chunk, and some of the loaders (CSV...) have methods for doing that. For this library, you can use However, you might find that |
Awesome - thank you. I think using the Pseudocode for anyone else interested:
Feel free to close this issue as needed. |
For writing, I would not issue repeated append calls, but instead write separate files and load them together later. Again, dask can help with this wort of thing. |
Hi Martin, Basically, my understanding is that when you append to an existing parquet data set, metadata will get updated. If I write one file at a time, then metadata do not get consolidated, and selective loading / loading by chunck becomes more difficult, does it not? Thanks for your advice on this, bests |
Each append requires reading the whole metadata, altering it in memory, and then writing it all to a file again. With detailed delving into the thrift code, it *would8 be possible to read up to a certain row-group in the metadata and start writing the new metadata there; but this code doesn't exist, and I think would be hard to write. |
My understanding of pandas library is that it requires loading the entire dataset into memory. Is there any way to avoid this requirements and write data from a stream or stored file - without having preloaded the entire table into ram via a Pandas dataset?
My concern with using this library is that it may fail with larger source data files. Is there any collective best practice or mitigation for this concern to avoid failures. Note this concern applies to very large datasets but also to small worker nodes (e.g. in a CI/CD stack) with small amounts of RAM (1-4 GB).
The text was updated successfully, but these errors were encountered: