Write without loading to RAM (skip pandas?) #476

aaronsteers · 2020-01-13T19:52:28Z

My understanding of pandas library is that it requires loading the entire dataset into memory. Is there any way to avoid this requirements and write data from a stream or stored file - without having preloaded the entire table into ram via a Pandas dataset?

My concern with using this library is that it may fail with larger source data files. Is there any collective best practice or mitigation for this concern to avoid failures. Note this concern applies to very large datasets but also to small worker nodes (e.g. in a CI/CD stack) with small amounts of RAM (1-4 GB).

martindurant · 2020-01-13T19:56:37Z

In short: yes, it is often possible to load and process pandas datasets by chunk, and some of the loaders (CSV...) have methods for doing that. For this library, you can use fastparquet.ParquetFile.iter_row_groups. "Row Group" is logical unit within parquet, and you cannot iterate with smaller pieces.

However, you might find that dask is your best bet for processing bigger-than-memory datasets in a more general sense than iterating over

aaronsteers · 2020-01-13T20:14:29Z

Awesome - thank you. I think using the chucksize argument for read_csv() should mitigate this issue. I should be able to define a configurable variable like max_partition_rows or chunksize and then would just pass one "chunk" at a time to the fastparquet write() or iter_row_groups() function. (Also, I should have clarified in my original post that I'm specifically looking to write parquet files with this library.)

Pseudocode for anyone else interested:

import pandas as pd

data_iterator = pd.read_csv("large_data.csv", chunksize=100000)

chunk_list = []  
# Each chunk is in dataframe format
for data_chunk in data_iterator:  
    # use .iter_row_groups() or .write() to write one or more chunks to the parquet file.

Feel free to close this issue as needed.

martindurant · 2020-01-13T20:18:29Z

For writing, I would not issue repeated append calls, but instead write separate files and load them together later. Again, dask can help with this wort of thing.

yohplala · 2020-12-25T18:08:24Z

For writing, I would not issue repeated append calls, but instead write separate files and load them together later. Again, dask can help with this wort of thing.

Hi Martin,
Please, could you explain why you unadvise using append in such a case?
I am intending to do the same, but if not recommended, I would as well prefer to know why.

Basically, my understanding is that when you append to an existing parquet data set, metadata will get updated.
Later on, I can then use the metadata to select the data I will want to load again (speaking about time serie, I will be able to know in which parquet file the timestamp range of interest is located thanks to the min/max timestamp per file as indicated in the metadata)

If I write one file at a time, then metadata do not get consolidated, and selective loading / loading by chunck becomes more difficult, does it not?

Thanks for your advice on this, bests

martindurant · 2020-12-29T14:55:13Z

Please, could you explain why you unadvise using append in such a case?

Each append requires reading the whole metadata, altering it in memory, and then writing it all to a file again. With detailed delving into the thrift code, it *would8 be possible to read up to a certain row-group in the metadata and start writing the new metadata there; but this code doesn't exist, and I think would be hard to write.
The usefulness of a _metadata file for a dataset that is evolving is questionable.

aaronsteers mentioned this issue Jan 13, 2020

Integrate with RedShfit Spectrum datamill-co/target-redshift#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write without loading to RAM (skip pandas?) #476

Write without loading to RAM (skip pandas?) #476

aaronsteers commented Jan 13, 2020

martindurant commented Jan 13, 2020

aaronsteers commented Jan 13, 2020 •

edited

Loading

martindurant commented Jan 13, 2020

yohplala commented Dec 25, 2020 •

edited

Loading

martindurant commented Dec 29, 2020

Write without loading to RAM (skip pandas?) #476

Write without loading to RAM (skip pandas?) #476

Comments

aaronsteers commented Jan 13, 2020

martindurant commented Jan 13, 2020

aaronsteers commented Jan 13, 2020 • edited Loading

martindurant commented Jan 13, 2020

yohplala commented Dec 25, 2020 • edited Loading

martindurant commented Dec 29, 2020

aaronsteers commented Jan 13, 2020 •

edited

Loading

yohplala commented Dec 25, 2020 •

edited

Loading