fix: read parquet file in chunked mode per row group #3016

FredericKayser · 2024-11-11T17:20:02Z

Feature or Bugfix

Bugfix

Detail

Fixes the read_parquet function in chunked mode. Currently a whole file gets loaded into the memory instead of row groups. This can lead to memory issues.

Relates

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

malachi-constant · 2024-11-11T17:35:48Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: cc48750
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-11T18:03:07Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: cc48750
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-11T18:24:10Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 3bad1cd
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-11T18:53:40Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 3bad1cd
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-11T19:44:13Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 5d51c45
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-11T20:13:21Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 5d51c45
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-15T13:33:35Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: fecdbd4
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-15T14:02:10Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: fecdbd4
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-21T12:57:34Z

AWS CodeBuild CI Report

CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
Commit ID: 5a3289d
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

malachi-constant · 2024-11-21T13:25:51Z

AWS CodeBuild CI Report

CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
Commit ID: 5a3289d
Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

vhnguyenae · 2024-12-30T08:22:47Z

Hi @FredericKayser, I'm not sure this is a bug or not but this change broke one of our unit test for empty data frame.
With awswrangler 3.10.0, wr.s3.read_parquet(..., chunked = 10) returns an iterator with len = 0.
But with awswrangler 3.10.1, wr.s3.read_parquet(..., chunked = 10) returns an iterator with len = 1.
It's just a simple debug print("Hello len iterator_df", len(list(iterator_df)))
Is it considered a regression?
If so I can provide more detail if you need more information to investigate.
Thank you

FredericKayser · 2024-12-30T10:07:20Z

Hey @vhnguyenae, in my opinion the new behavior should be intended. This enables the consumer to receive the columns eventough the file has no content rows. Could you maybe share your thoughts why this specific behavior is relevant for you? I think a solid implementation should be able to deal with both variants as you are just receiving an empty data frame.
@kukushking @jaidisido what do you think?

vhnguyenae · 2024-12-31T04:24:26Z

Hi @FredericKayser ,
Thank you for the quick feedback. This is the sample code that I had before to process the data.
In order to keep the old behaviour, I need to add an extra check to make sure the data_frame is not empty before processing it, while with the previous version, the for loop is never entered because the iterator size was 0

    iterator_df = wr.s3.read_parquet(path=s3_path, chunked=10)

    for data_frame in iterator_df:
        # I had to add this extra check in order to fix our test so that we don't process the data of empty dataframe
        if not data_frame.empty:
            data_frame =process_data(data_frame, ....)

btw Happy new year 🥳 🎉

FredericKayser · 2024-12-31T10:12:47Z

Hi @vhnguyenae, first of all happy new year :D Is this change just causing trouble with the test, but the productive code still works?

vhnguyenae · 2024-12-31T10:25:44Z

Hi @FredericKayser, we haven't encounter any issue with the production environment yet, but I think the bahavior is also changed if we have to process an empty data frame, since the test case is used to cover that scenario 😁

FredericKayser · 2024-12-31T10:29:55Z

@vhnguyenae Can you maybe explain what your test exactly does and which asserts do you use?

vhnguyenae · 2024-12-31T10:38:13Z

Our process_data did many things, but let just say we add an extra column with hardcode value:
df['new_column'] = 'random'

Previous code without empty dataframe check:

    result_df = DataFrame()

    iterator_df = wr.s3.read_parquet(path=s3_path, chunked=10)

    for data_frame in iterator_df:
        data_frame =process_data(data_frame, ....)
        result_df = pd.concat([result_df, data_frame], ignore_index=True)
    return result_df

If we don't enter the for loop, then result_df is empty
If we enter the for loop without empty dataframe check, process_data returns a dataframe with the new_column and its new value and we concat it to the result_df

So the behavior of our function is changed, instead of returning an empty df, now we return a non empty df with a hardcode column

FredericKayser · 2024-12-31T11:35:43Z

@vhnguyenae Okay, thank you :) So when you just add a hardcoded-column in an empty data frame, the data frame is still empty, but has a new column. My suggestion is that you pre-check the received results from the library and then decide if you run process_data with the received data frame or not, because the library just ensures that you receive a DataFrame type or iterator of DataFrames, but not the exact format, so in my opinion this is your responsibility to ensure that there is data in the data frame as there are no guarantees in the specification. This would also lead to more stability and robustness of your code.

I hope this helps solving your problem :)

vhnguyenae · 2025-01-02T03:01:54Z

Thanks @FredericKayser for the clarification, that's what I already did in my code 😁

fix: read parquet file in chunked mode per row group

cc48750

Merge branch 'main' into fix/read-parquet-chunked

3bad1cd

FredericKayser mentioned this pull request Nov 11, 2024

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

Closed

fix: Fix test_empty_parquet

5d51c45

jaidisido requested review from jaidisido and kukushking November 15, 2024 10:26

jaidisido linked an issue Nov 15, 2024 that may be closed by this pull request

read_parquet function takes up a lot of memory even before it returns the iterable object #3010

Closed

feat: reduce memory impact of chunked reads

fecdbd4

kukushking approved these changes Nov 19, 2024

View reviewed changes

jaidisido approved these changes Nov 21, 2024

View reviewed changes

Merge branch 'main' into fix/read-parquet-chunked

5a3289d

jaidisido merged commit d485112 into aws:main Nov 21, 2024
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: read parquet file in chunked mode per row group #3016

fix: read parquet file in chunked mode per row group #3016

FredericKayser commented Nov 11, 2024

malachi-constant commented Nov 11, 2024

malachi-constant commented Nov 11, 2024

malachi-constant commented Nov 11, 2024

malachi-constant commented Nov 11, 2024

malachi-constant commented Nov 11, 2024

malachi-constant commented Nov 11, 2024

malachi-constant commented Nov 15, 2024

malachi-constant commented Nov 15, 2024

malachi-constant commented Nov 21, 2024

malachi-constant commented Nov 21, 2024

vhnguyenae commented Dec 30, 2024

FredericKayser commented Dec 30, 2024

vhnguyenae commented Dec 31, 2024

FredericKayser commented Dec 31, 2024

vhnguyenae commented Dec 31, 2024

FredericKayser commented Dec 31, 2024

vhnguyenae commented Dec 31, 2024 •

edited

Loading

FredericKayser commented Dec 31, 2024

vhnguyenae commented Jan 2, 2025

fix: read parquet file in chunked mode per row group #3016

fix: read parquet file in chunked mode per row group #3016

Conversation

FredericKayser commented Nov 11, 2024

Feature or Bugfix

Detail

Relates

malachi-constant commented Nov 11, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 11, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 11, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 11, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 11, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 11, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 15, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 15, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 21, 2024

AWS CodeBuild CI Report

malachi-constant commented Nov 21, 2024

AWS CodeBuild CI Report

vhnguyenae commented Dec 30, 2024

FredericKayser commented Dec 30, 2024

vhnguyenae commented Dec 31, 2024

FredericKayser commented Dec 31, 2024

vhnguyenae commented Dec 31, 2024

FredericKayser commented Dec 31, 2024

vhnguyenae commented Dec 31, 2024 • edited Loading

FredericKayser commented Dec 31, 2024

vhnguyenae commented Jan 2, 2025

vhnguyenae commented Dec 31, 2024 •

edited

Loading