-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: read parquet file in chunked mode per row group #3016
Conversation
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Hi @FredericKayser, I'm not sure this is a bug or not but this change broke one of our unit test for empty data frame. |
Hey @vhnguyenae, in my opinion the new behavior should be intended. This enables the consumer to receive the columns eventough the file has no content rows. Could you maybe share your thoughts why this specific behavior is relevant for you? I think a solid implementation should be able to deal with both variants as you are just receiving an empty data frame. |
Hi @FredericKayser , iterator_df = wr.s3.read_parquet(path=s3_path, chunked=10)
for data_frame in iterator_df:
# I had to add this extra check in order to fix our test so that we don't process the data of empty dataframe
if not data_frame.empty:
data_frame =process_data(data_frame, ....) btw Happy new year 🥳 🎉 |
Hi @vhnguyenae, first of all happy new year :D Is this change just causing trouble with the test, but the productive code still works? |
Hi @FredericKayser, we haven't encounter any issue with the production environment yet, but I think the bahavior is also changed if we have to process an empty data frame, since the test case is used to cover that scenario 😁 |
@vhnguyenae Can you maybe explain what your test exactly does and which asserts do you use? |
Our process_data did many things, but let just say we add an extra column with hardcode value: Previous code without empty dataframe check: result_df = DataFrame()
iterator_df = wr.s3.read_parquet(path=s3_path, chunked=10)
for data_frame in iterator_df:
data_frame =process_data(data_frame, ....)
result_df = pd.concat([result_df, data_frame], ignore_index=True)
return result_df If we don't enter the for loop, then So the behavior of our function is changed, instead of returning an empty df, now we return a non empty df with a hardcode column |
@vhnguyenae Okay, thank you :) So when you just add a hardcoded-column in an empty data frame, the data frame is still empty, but has a new column. My suggestion is that you pre-check the received results from the library and then decide if you run I hope this helps solving your problem :) |
Thanks @FredericKayser for the clarification, that's what I already did in my code 😁 |
Feature or Bugfix
Detail
Fixes the read_parquet function in chunked mode. Currently a whole file gets loaded into the memory instead of row groups. This can lead to memory issues.
Relates
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.