Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: read parquet file in chunked mode per row group #3016

Merged
merged 5 commits into from
Nov 21, 2024

Conversation

FredericKayser
Copy link
Contributor

Feature or Bugfix

  • Bugfix

Detail

Fixes the read_parquet function in chunked mode. Currently a whole file gets loaded into the memory instead of row groups. This can lead to memory issues.

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: cc48750
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: cc48750
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 3bad1cd
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 3bad1cd
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 5d51c45
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 5d51c45
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: fecdbd4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: fecdbd4
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-4rfo0GHQ0u9a
  • Commit ID: 5a3289d
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@malachi-constant
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubDistributedCodeBuild6-jWcl5DLmvupS
  • Commit ID: 5a3289d
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido merged commit d485112 into aws:main Nov 21, 2024
16 of 17 checks passed
@vhnguyenae
Copy link

Hi @FredericKayser, I'm not sure this is a bug or not but this change broke one of our unit test for empty data frame.
With awswrangler 3.10.0, wr.s3.read_parquet(..., chunked = 10) returns an iterator with len = 0.
But with awswrangler 3.10.1, wr.s3.read_parquet(..., chunked = 10) returns an iterator with len = 1.
It's just a simple debug print("Hello len iterator_df", len(list(iterator_df)))
Is it considered a regression?
If so I can provide more detail if you need more information to investigate.
Thank you

@FredericKayser
Copy link
Contributor Author

Hey @vhnguyenae, in my opinion the new behavior should be intended. This enables the consumer to receive the columns eventough the file has no content rows. Could you maybe share your thoughts why this specific behavior is relevant for you? I think a solid implementation should be able to deal with both variants as you are just receiving an empty data frame.
@kukushking @jaidisido what do you think?

@vhnguyenae
Copy link

Hi @FredericKayser ,
Thank you for the quick feedback. This is the sample code that I had before to process the data.
In order to keep the old behaviour, I need to add an extra check to make sure the data_frame is not empty before processing it, while with the previous version, the for loop is never entered because the iterator size was 0

    iterator_df = wr.s3.read_parquet(path=s3_path, chunked=10)

    for data_frame in iterator_df:
        # I had to add this extra check in order to fix our test so that we don't process the data of empty dataframe
        if not data_frame.empty:
            data_frame =process_data(data_frame, ....)

btw Happy new year 🥳 🎉

@FredericKayser
Copy link
Contributor Author

Hi @vhnguyenae, first of all happy new year :D Is this change just causing trouble with the test, but the productive code still works?

@vhnguyenae
Copy link

Hi @FredericKayser, we haven't encounter any issue with the production environment yet, but I think the bahavior is also changed if we have to process an empty data frame, since the test case is used to cover that scenario 😁

@FredericKayser
Copy link
Contributor Author

@vhnguyenae Can you maybe explain what your test exactly does and which asserts do you use?

@vhnguyenae
Copy link

vhnguyenae commented Dec 31, 2024

Our process_data did many things, but let just say we add an extra column with hardcode value:
df['new_column'] = 'random'

Previous code without empty dataframe check:

    result_df = DataFrame()

    iterator_df = wr.s3.read_parquet(path=s3_path, chunked=10)

    for data_frame in iterator_df:
        data_frame =process_data(data_frame, ....)
        result_df = pd.concat([result_df, data_frame], ignore_index=True)
    return result_df        

If we don't enter the for loop, then result_df is empty
If we enter the for loop without empty dataframe check, process_data returns a dataframe with the new_column and its new value and we concat it to the result_df

So the behavior of our function is changed, instead of returning an empty df, now we return a non empty df with a hardcode column

@FredericKayser
Copy link
Contributor Author

@vhnguyenae Okay, thank you :) So when you just add a hardcoded-column in an empty data frame, the data frame is still empty, but has a new column. My suggestion is that you pre-check the received results from the library and then decide if you run process_data with the received data frame or not, because the library just ensures that you receive a DataFrame type or iterator of DataFrames, but not the exact format, so in my opinion this is your responsibility to ensure that there is data in the data frame as there are no guarantees in the specification. This would also lead to more stability and robustness of your code.

I hope this helps solving your problem :)

@vhnguyenae
Copy link

Thanks @FredericKayser for the clarification, that's what I already did in my code 😁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_parquet function takes up a lot of memory even before it returns the iterable object
5 participants