read_parquet function takes up a lot of memory even before it returns the iterable object #3010

geetparekh · 2024-10-30T17:10:26Z

Describe the bug

The documentation of read_parquet function suggests that using the 'chunked' argument makes the function memory friendly as it will return an iterable of dataframes instead of a regular dataframe. However, when tested with a 500MB parquet file, with chunked = 1, the function takes up more than 7GB memory even before returning the iterable object. That indicates the function is doing something underneath (possibly loading the file in memory) before being able to give back a streamable object.

If this an expected behavior, then the function unfortunately cannot be considered as memory-friendly as it ends up taking up a lot of memory, and the documentation should explicitly specify that so that the users would know what to expect. If it is not the expected behavior, then it is possibly a bug.

Sharing our code below:


def get_table_chunks_from_s3_file(app_configs: dict, sqs_values_dict: dict):

    bucket = sqs_values_dict["BucketName"]
    key = sqs_values_dict["ObjectKey"]
    boto3_session = app_configs["boto3_session"]

    file_path = "s3://" + bucket + "/" + key

    # Below function call takes up high memory before being able to return the dataframes object.
    dataframes = wr.s3.read_parquet(
        path=file_path, chunked=1, boto3_session=boto3_session
    )

    for dataframe in dataframes:
        yield pyarrow.table(dataframe)

Note that in the above, I have added a comment to share which part of the code takes up a lot of memory.

How to Reproduce

Run the read_parquet function for a relatively large parquet file in S3 and check how much memory it consumes (through a memory profiler) before giving back an iterable object.

Expected behavior

The function (as the documentation suggests) should not be taking up so much memory while trying to return an iterable of dataframes.

Your project

No response

Screenshots

No response

OS

Ubuntu 22.04

Python version

3.10.12

AWS SDK for pandas version

3.9.1

Additional context

Support Case ID: 172918156100319

The text was updated successfully, but these errors were encountered:

kukushking · 2024-11-03T22:45:59Z

Hi @geetparekh, note the rows in the parquet datasets are organized in row groups, and rows within a row group must be read in one go. The assumption that chunked=1 will result in lower memory footprint is not correct.

import boto3
import awswrangler as wr
import pyarrow as pa

# Path to a public parquet dataset 440.6MB file
path = "s3://ursa-labs-taxi-data/2009/01/data.parquet"
session = boto3.Session()
  
def get_table_chunks_from_s3_file(path, chunked, session):
	dataframes = wr.s3.read_parquet(path=path, chunked=chunked, boto3_session=session)
	yield from dataframes

# this line runs immediately and returns a generator as expected
g = get_table_chunks_from_s3_file(path, 1000000, session)

# this line consumes the first item from the generator and loads 1000000 records
next(g)

The parquet file used for the test has 14092413 in 216 row groups:

> parquet-tools inspect data.parquet

############ file meta data ############
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 18
num_rows: 14092413
num_row_groups: 216
format_version: 1.0
serialized_size: 324078

...

Reading with chunked=1000000 memory never peaked above 1G... Given Parquet's efficient compression, this is expected.

I will continue doing some tests to reproduce your issue and keep you updated.

Possibly related to

geetparekh · 2024-11-07T16:44:45Z

Thank you @kukushking for explaining that. Our parquet file only has one row group. So, that kind of explains that the function would try to load the whole file in memory in our case.

Would definitely suggest to explicitly add that in the documentation so that the users would know about it.

geetparekh · 2024-11-07T16:59:25Z

Thank you @kukushking for explaining that. Our parquet file only has one row group. So, that kind of explains that the function would try to load the whole file in memory in our case.

Would definitely suggest to explicitly add that in the documentation so that the users would know about it.

On that note, you can possibly also consider including an example in the documentation explaining what all would be loaded in memory given an example of number of row groups in the parquet file and the chunk size requested.

For example, if a parquet file has 100K records with 100 row groups (for simplicity), each row group would roughly have 1000 rows. If the requested chunk size is 5000, the function will end up loading 5 row groups in memory before returning the iterable object?

FredericKayser · 2024-11-11T18:41:25Z

Hey @geetparekh, @kukushking

I also ran into this issue also with a chunksize of 100000. I figured out that the parquet file is loaded completely before yielding the result of the chunk. So I've raised the PR #3016. With these changes the memory usage is reduced dramatically. Without the changes, I constantly ran out of memory.

Would be nice, if you could review it soon and release a new patch version!

Thanks in advance,
Frederic

geetparekh added the bug Something isn't working label Oct 30, 2024

geetparekh changed the title ~~read_parquet function takes up a lot of memory even before it returns the first dataframe~~ read_parquet function takes up a lot of memory even before it returns the iterable object Oct 30, 2024

FredericKayser mentioned this issue Nov 11, 2024

fix: read parquet file in chunked mode per row group #3016

Merged

jaidisido linked a pull request Nov 15, 2024 that will close this issue

fix: read parquet file in chunked mode per row group #3016

Merged

jaidisido closed this as completed in #3016 Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

geetparekh commented Oct 30, 2024

kukushking commented Nov 3, 2024 •

edited

Loading

geetparekh commented Nov 7, 2024

geetparekh commented Nov 7, 2024 •

edited

Loading

FredericKayser commented Nov 11, 2024

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

Comments

geetparekh commented Oct 30, 2024

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS SDK for pandas version

Additional context

kukushking commented Nov 3, 2024 • edited Loading

geetparekh commented Nov 7, 2024

geetparekh commented Nov 7, 2024 • edited Loading

FredericKayser commented Nov 11, 2024

kukushking commented Nov 3, 2024 •

edited

Loading

geetparekh commented Nov 7, 2024 •

edited

Loading