-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-1898] Improve performance of ORCWriter Self Tune #3762
Conversation
…at a regular interval
… properties, fix some more bugs
…o be more accurate to what it's for
… children byte size
…timateMemory(), and improve state management and smoothen algorithm and additional configurations
Codecov Report
@@ Coverage Diff @@
## master #3762 +/- ##
============================================
+ Coverage 47.13% 48.72% +1.59%
+ Complexity 10893 3552 -7341
============================================
Files 2148 694 -1454
Lines 85039 28229 -56810
Branches 9438 3287 -6151
============================================
- Hits 40082 13755 -26327
+ Misses 41315 13079 -28236
+ Partials 3642 1395 -2247 see 1463 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
gobblin-modules/gobblin-orc/src/main/java/org/apache/gobblin/writer/GobblinBaseOrcWriter.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
The ORCWriter's new self tuning feature leads to slower write frequency when it comes to ingesting datasets with a low volume of records.
This is primarily caused by the assumption that the native ORC writer will be saturated, which leads to the memory footprint of STRIPE_SIZE + avgSizeOfRecord*rowsBetweenMemoryCheck.
However, this is generally not the case when there are only a few records to write due to a low volume dataset, and causes slow writes. We should utilize a newer API on ORCWriter brought in by apache/orc#1057
Some other improvements so that memory consumption by the writer is fluctuates less:
Bumps ORC Version from 1.6.8 to 1.7.4
Tests
Unit tests
Commits