Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to load config_name_mapping from disk just once #1013

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

nemacysts
Copy link
Member

There's enough files (and enough YAML in these files) that the IO/YAML parsing takes a significant amount of time.

While it's nice to always load from the source-of-truth (i.e., the files on-disk) - it's not worth paying the performance penalty (especially at the scale we're seeing internally) for a negligible benefit.

Locally (with a large test config), this results in a ~5x improvement in timings. More concretely, my test configs were taking ~30ish seconds without this diff and ~6ish seconds with this diff.

There's enough files (and enough YAML in these files) that the IO/YAML
parsing takes a significant amount of time.

While it's nice to always load from the source-of-truth (i.e., the files
on-disk) - it's not worth paying the performance penalty (especially at
the scale we're seeing internally) for a negligible benefit.

Locally (with a large test config), this results in a ~5x improvement in
timings. More concretely, my test configs were taking ~30ish seconds
*without* this diff and ~6ish seconds *with* this diff.
jfongatyelp
jfongatyelp previously approved these changes Dec 23, 2024
Copy link
Contributor

@jfongatyelp jfongatyelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there were some things calculating a hash for these configs (reading the files listed from the manifestfile), and those seem to still be reading directly from disk and not the cache -- since we're writing back to the file on every update, that should all be fine but do we need to worry about getting out of sync there?

@nemacysts
Copy link
Member Author

nemacysts commented Jan 9, 2025

I see there were some things calculating a hash for these configs (reading the files listed from the manifestfile), and those seem to still be reading directly from disk and not the cache -- since we're writing back to the file on every update, that should all be fine but do we need to worry about getting out of sync there?

@jfongatyelp good question - i think i looked at this when making these changes, but PTO has wiped all memory of what i thought. let me dig into this again!

EDIT: i think it's fine since the cache is really more for skipping the expensive repeated yaml parsing, but i think it's probably worth making get_hash use the cache for consistency (and to remove that IO :p)

jfongatyelp
jfongatyelp previously approved these changes Jan 9, 2025
# rather than hit the disk to get the raw string - let's convert
# the in-memory dict to a yaml string and hash that to save a couple
# ms (in testing, ~3ms over loading from disk and ~1ms over dumping to json :p)
# TODO: consider storing the hash alongside the config so that we only calculate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! ++

@nemacysts
Copy link
Member Author

@jfongatyelp sorry - i didn't realize i had a typo leftover after one last attempt at timing json vs yaml vs disk hashing!

@nemacysts nemacysts requested a review from jfongatyelp January 9, 2025 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants