Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1898] Improve performance of ORCWriter Self Tune #3762

Merged
merged 25 commits into from
Sep 12, 2023

Conversation

Will-Lo
Copy link
Contributor

@Will-Lo Will-Lo commented Sep 6, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

The ORCWriter's new self tuning feature leads to slower write frequency when it comes to ingesting datasets with a low volume of records.

This is primarily caused by the assumption that the native ORC writer will be saturated, which leads to the memory footprint of STRIPE_SIZE + avgSizeOfRecord*rowsBetweenMemoryCheck.

However, this is generally not the case when there are only a few records to write due to a low volume dataset, and causes slow writes. We should utilize a newer API on ORCWriter brought in by apache/orc#1057

Some other improvements so that memory consumption by the writer is fluctuates less:

  1. Persist more state in between writers so that writers will start at a more reasonable base point based off the past
  2. When increasing batchSize, raise it by a percentage of the error (proportional control) so that batchsize leads to less overshoot in memory usage
  3. Reveal more configuration knobs to manually tune the limits of the self tuning writer

Bumps ORC Version from 1.6.8 to 1.7.4

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Unit tests

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Sep 6, 2023

Codecov Report

Merging #3762 (0576518) into master (673773a) will increase coverage by 1.59%.
Report is 6 commits behind head on master.
The diff coverage is n/a.

@@             Coverage Diff              @@
##             master    #3762      +/-   ##
============================================
+ Coverage     47.13%   48.72%   +1.59%     
+ Complexity    10893     3552    -7341     
============================================
  Files          2148      694    -1454     
  Lines         85039    28229   -56810     
  Branches       9438     3287    -6151     
============================================
- Hits          40082    13755   -26327     
+ Misses        41315    13079   -28236     
+ Partials       3642     1395    -2247     

see 1463 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@Will-Lo Will-Lo merged commit 5f021ba into apache:master Sep 12, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants