Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write without loading to RAM (skip pandas?) #476

Open
aaronsteers opened this issue Jan 13, 2020 · 5 comments
Open

Write without loading to RAM (skip pandas?) #476

aaronsteers opened this issue Jan 13, 2020 · 5 comments

Comments

@aaronsteers
Copy link

My understanding of pandas library is that it requires loading the entire dataset into memory. Is there any way to avoid this requirements and write data from a stream or stored file - without having preloaded the entire table into ram via a Pandas dataset?

My concern with using this library is that it may fail with larger source data files. Is there any collective best practice or mitigation for this concern to avoid failures. Note this concern applies to very large datasets but also to small worker nodes (e.g. in a CI/CD stack) with small amounts of RAM (1-4 GB).

@martindurant
Copy link
Member

In short: yes, it is often possible to load and process pandas datasets by chunk, and some of the loaders (CSV...) have methods for doing that. For this library, you can use fastparquet.ParquetFile.iter_row_groups. "Row Group" is logical unit within parquet, and you cannot iterate with smaller pieces.

However, you might find that dask is your best bet for processing bigger-than-memory datasets in a more general sense than iterating over

@aaronsteers
Copy link
Author

aaronsteers commented Jan 13, 2020

Awesome - thank you. I think using the chucksize argument for read_csv() should mitigate this issue. I should be able to define a configurable variable like max_partition_rows or chunksize and then would just pass one "chunk" at a time to the fastparquet write() or iter_row_groups() function. (Also, I should have clarified in my original post that I'm specifically looking to write parquet files with this library.)

Pseudocode for anyone else interested:

import pandas as pd

data_iterator = pd.read_csv("large_data.csv", chunksize=100000)

chunk_list = []  
# Each chunk is in dataframe format
for data_chunk in data_iterator:  
    # use .iter_row_groups() or .write() to write one or more chunks to the parquet file.

Feel free to close this issue as needed.

@martindurant
Copy link
Member

For writing, I would not issue repeated append calls, but instead write separate files and load them together later. Again, dask can help with this wort of thing.

@yohplala
Copy link

yohplala commented Dec 25, 2020

For writing, I would not issue repeated append calls, but instead write separate files and load them together later. Again, dask can help with this wort of thing.

Hi Martin,
Please, could you explain why you unadvise using append in such a case?
I am intending to do the same, but if not recommended, I would as well prefer to know why.

Basically, my understanding is that when you append to an existing parquet data set, metadata will get updated.
Later on, I can then use the metadata to select the data I will want to load again (speaking about time serie, I will be able to know in which parquet file the timestamp range of interest is located thanks to the min/max timestamp per file as indicated in the metadata)

If I write one file at a time, then metadata do not get consolidated, and selective loading / loading by chunck becomes more difficult, does it not?

Thanks for your advice on this, bests

@martindurant
Copy link
Member

Please, could you explain why you unadvise using append in such a case?

Each append requires reading the whole metadata, altering it in memory, and then writing it all to a file again. With detailed delving into the thrift code, it *would8 be possible to read up to a certain row-group in the metadata and start writing the new metadata there; but this code doesn't exist, and I think would be hard to write.
The usefulness of a _metadata file for a dataset that is evolving is questionable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants