Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create and traverse avro schemas once per task (#296) #834

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

smcnamara2-stripe
Copy link

Summary

This adds a SchemaTraverser to the chronon row creation logic, which allows callers of Row.to to specify how the output schema should be constructed.

At the core, this code is centered around removing the AvroConversions.fromChrononSchema call in the lambda passed from fromChrononRow.

Why / Goal

The proximate goal is to allow faster Avro Schema construction, avoiding repeated calls to fromChrononSchema in a tight loop. This has resulted in a 90+% reduction in CPU usage during the reduce phase of the GroupByUpload aggregation.

At stripe, several of our largest GroupByUpload apps are now using 50-80% fewer vcore-seconds per run.

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • Live on Stripe's internal fork.

Checklist

  • Documentation update

Reviewers

@smcnamara2-stripe smcnamara2-stripe changed the title [master] Create and traverse avro schemas once per task (#296) Create and traverse avro schemas once per task (#296) Sep 20, 2024
Copy link
Collaborator

@pengyu-hou pengyu-hou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Solid change!

@pengyu-hou pengyu-hou mentioned this pull request Oct 11, 2024
4 tasks
Copy link
Collaborator

@hzding621 hzding621 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. This leads to huge performance improvements in GB upload tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants