Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GOBBLIN-1869] Create instrumented orc writer #3732

Merged
merged 4 commits into from
Aug 11, 2023

Conversation

Will-Lo
Copy link
Contributor

@Will-Lo Will-Lo commented Aug 4, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Currently the metrics for streaming mainly relate to the Kafka consumption side. We want to also capture metrics through events from the ORCWriter to capture the memory usage and internal performance of the writer.

This PR captures the following information:

  1. Schema name
  2. Bytes written
  3. Records written
  4. Rowbatch size (buffer size)
  5. Resize count

Logs of the fields looks like the below:

2023-08-03 18:23:46 PDT INFO  [StreamModelTaskRunner] org.apache.gobblin.writer.InstrumentedGobblinOrcWriter  - Emitting ORC event metrics
2023-08-03 18:23:46 PDT INFO  [StreamModelTaskRunner] org.apache.gobblin.writer.InstrumentedGobblinOrcWriter  - Metrics schema name: FeatureTrackingEvent
2023-08-03 18:23:46 PDT INFO  [StreamModelTaskRunner] org.apache.gobblin.writer.InstrumentedGobblinOrcWriter  - Rowbatch size: 1
2023-08-03 18:23:46 PDT INFO  [StreamModelTaskRunner] org.apache.gobblin.writer.InstrumentedGobblinOrcWriter  - Resize count: 54
2023-08-03 18:23:46 PDT INFO  [StreamModelTaskRunner] org.apache.gobblin.writer.InstrumentedGobblinOrcWriter  - Bytes written: 299372195
2023-08-03 18:23:46 PDT INFO  [StreamModelTaskRunner] org.apache.gobblin.writer.InstrumentedGobblinOrcWriter  - Records written: 934

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link

codecov-commenter commented Aug 4, 2023

Codecov Report

Merging #3732 (0f7ad19) into master (0151890) will increase coverage by 0.11%.
Report is 13 commits behind head on master.
The diff coverage is 9.09%.

@@             Coverage Diff              @@
##             master    #3732      +/-   ##
============================================
+ Coverage     46.95%   47.07%   +0.11%     
- Complexity    10811    10865      +54     
============================================
  Files          2143     2147       +4     
  Lines         84557    84819     +262     
  Branches       9390     9410      +20     
============================================
+ Hits          39704    39928     +224     
- Misses        41235    41269      +34     
- Partials       3618     3622       +4     
Files Changed Coverage Δ
.../gobblin/writer/GenericRecordToOrcValueWriter.java 60.21% <0.00%> (-0.33%) ⬇️
...apache/gobblin/writer/GobblinOrcWriterBuilder.java 0.00% <0.00%> (ø)
...e/gobblin/writer/InstrumentedGobblinOrcWriter.java 0.00% <0.00%> (ø)
...rg/apache/gobblin/writer/GobblinBaseOrcWriter.java 70.68% <50.00%> (-0.22%) ⬇️

... and 38 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

}

GobblinTrackingEvent createOrcWriterMetadataEvent() throws IOException {
GobblinEventBuilder builder = new GobblinEventBuilder(ORC_WRITER_METRICS_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to include the name space in this event? Also any specific reason that we don't use eventSubmitter but directly call metricContext to submit event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace helps with filtering the events since we have so many GobblinTrackingEvent types now.
Uh MetricContext not particularly a strong reason but I used it for uniformity, the InstrumentedDataWriter classes use the metric Context (albeit it is recording metrics not submitting events). But it also helps that it adds metadata tags from the metric context at times compared to the event submitter, not that I think they're being used meaningfully right now. It might also come in handy if we want to record metrics/timers using this too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm referring to this line and think that you should set namespace. https://github.com/apache/gobblin/blob/master/gobblin-metrics-libs/gobblin-metrics-base/src/main/java/org/apache/gobblin/metrics/event/EventSubmitter.java#L97
And evenSubmitter should be even easier for you to set metadata for only the event (not other metrics). But I'm fine if you want to use metric context for consistency, just make sure we have namespace set there

Copy link
Contributor

@ZihanLi58 ZihanLi58 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@ZihanLi58 ZihanLi58 merged commit 666ae2e into apache:master Aug 11, 2023
6 checks passed
phet pushed a commit to phet/gobblin that referenced this pull request Aug 15, 2023
* Creates an instrumented ORC Writer that emits events on commit and close

* Adds ORC writer metrics to track high record conversion metrics

* Fix add final to config strings

* Add namespace and use event submitter to send events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants