-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to load config_name_mapping from disk just once #1013
base: master
Are you sure you want to change the base?
Conversation
There's enough files (and enough YAML in these files) that the IO/YAML parsing takes a significant amount of time. While it's nice to always load from the source-of-truth (i.e., the files on-disk) - it's not worth paying the performance penalty (especially at the scale we're seeing internally) for a negligible benefit. Locally (with a large test config), this results in a ~5x improvement in timings. More concretely, my test configs were taking ~30ish seconds *without* this diff and ~6ish seconds *with* this diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there were some things calculating a hash for these configs (reading the files listed from the manifestfile), and those seem to still be reading directly from disk and not the cache -- since we're writing back to the file on every update, that should all be fine but do we need to worry about getting out of sync there?
@jfongatyelp good question - i think i looked at this when making these changes, but PTO has wiped all memory of what i thought. let me dig into this again! EDIT: i think it's fine since the cache is really more for skipping the expensive repeated yaml parsing, but i think it's probably worth making get_hash use the cache for consistency (and to remove that IO :p) |
# rather than hit the disk to get the raw string - let's convert | ||
# the in-memory dict to a yaml string and hash that to save a couple | ||
# ms (in testing, ~3ms over loading from disk and ~1ms over dumping to json :p) | ||
# TODO: consider storing the hash alongside the config so that we only calculate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea! ++
@jfongatyelp sorry - i didn't realize i had a typo leftover after one last attempt at timing json vs yaml vs disk hashing! |
There's enough files (and enough YAML in these files) that the IO/YAML parsing takes a significant amount of time.
While it's nice to always load from the source-of-truth (i.e., the files on-disk) - it's not worth paying the performance penalty (especially at the scale we're seeing internally) for a negligible benefit.
Locally (with a large test config), this results in a ~5x improvement in timings. More concretely, my test configs were taking ~30ish seconds without this diff and ~6ish seconds with this diff.