-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates postprocessing for new sampling / increased geographic resolution #275
Merged
+220,685
−537,058
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Disables reporting of HVAC system counts which were incorrect because of the introduction of zone multipliers. These may be re-enabled after correct handling of zone multipliers in the reporting system.
Enable writing of metadata to S3 or local filesystem depending on the input. Requires configuration of fsspec to account for possible Pyathena import.
With many upgrades the fkt was becoming too large to collect without memory issues. Transition to a single fkt and add a method to get the fkt with a specified upgrade_id added.
Climate zones in fkt are actually a mix of CEC and ASHRAE. Add mapping from CEC to ASHRAE into fkt creation.
Not all columns to be plotted will be exported normally.
Allow the desired geographies, aggregation levels, partitioning, and file types to be passed into export_metadata_and_annual_results for reusability.
Continue writing uncompressed CSVs if writing locally.
fkt creation is non-deterministic. Cache fkt so it can be reused when a data export is interrupted and needs to be restarted. Hive partition the ComStock wide data to enable much faster filtering down to the upgrade, which was a major bottleneck in exporting data for each geography.
Use the 'detailed' keyword to export all columns present in the raw data, primarily used for internal testing and debugging.
Actually compress gz file, code from @wenyikuang
Collect all geographic aggregates in one frame then split the collected frame into individual geographies. This speeds up the processing significantly for by_state aggregates.
PUMAs should be added to fkt during creation, so this change is temporary.
For aggregates, the full dataframe may be collected without memory issues. For non-aggregates, need to collect a dataframe per geography to avoid memory issues.
If there is no aggregation, collect the entire first-level geography at once and then sub-divide this to separate files later.
Allows an array of variables to be used when aggregating. Helpful when you need multiple geographic variables like state and climate zone included in the output files.
Fixes national-scale partitioning to work when no aggregation is supplied.
Reduce the bottleneck for metadata export by parallelizing writing metadata files
asparke2
added
reporting measure
PR improves or adds reporting measures
postprocessing
PR improves or adds postprocessing content
labels
Jan 22, 2025
ChristopherCaradonna
approved these changes
Jan 23, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
postprocessing
PR improves or adds postprocessing content
reporting measure
PR improves or adds reporting measures
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull request overview
This pull request makes major changes to the postprocessing to accommodate the new sampling method AKA increased geographic resolution. Instead of one row of metadata per modeled building, there is now an apportionment step which creates a set of weights. This apportionment step can reuse a modeled building ID across many different geographies. The per-building annual results (results.csv ~150k rows by ~1,300 columns) is joined onto the weights table of ~27M rows. The resulting file would be ~27M rows by ~1,300 columns (for each upgrade), which is too big to calculate in memory for most computers, and creates a prohibitively large file on disk. Instead, the file is calculated and exported in two separate ways:
metadata_and_annual_results: this is the highest resolution data published, which is aggregated to the census tract. This means a single building ID will exist a maximum 1 time per census tract. Exporting this in one file results in ~16M rows, which is too big. Instead, we slice the file by state and by county within each state. This makes files which are manageable in CSV format.
metadata_and_annual_results_aggregates: this is results aggregated by a certain set of geographies (e.g. county, PUMA, etc.) This means a single building ID will exist a maximum 1 time per geography. Because of this, the files are much more manageable size, especially in CSV format.
There are also a lot of Rubocop syntax changes pulled from main, which was unintentional, but we've decided not to spend the time to remove those changes from this PR.
Pull Request Author
This pull request makes changes to (select all the apply):
Author pull request checklist:
comstock_column_definitions.csv
options_lookup.tsv
files updated.yml
filesREADME.md
filestest/reporting_measure_tests.txt
,test/workflow_measure_tests.txt
, ortest/upgrade_measure_tests.txt
Review Checklist
This will not be exhaustively relevant to every PR.
ComStock Licensing Language - Add to Beginning of Each Code File