Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates postprocessing for new sampling / increased geographic resolution #275

Merged
merged 133 commits into from
Jan 23, 2025

Conversation

asparke2
Copy link
Member

Pull request overview

This pull request makes major changes to the postprocessing to accommodate the new sampling method AKA increased geographic resolution. Instead of one row of metadata per modeled building, there is now an apportionment step which creates a set of weights. This apportionment step can reuse a modeled building ID across many different geographies. The per-building annual results (results.csv ~150k rows by ~1,300 columns) is joined onto the weights table of ~27M rows. The resulting file would be ~27M rows by ~1,300 columns (for each upgrade), which is too big to calculate in memory for most computers, and creates a prohibitively large file on disk. Instead, the file is calculated and exported in two separate ways:

  1. metadata_and_annual_results: this is the highest resolution data published, which is aggregated to the census tract. This means a single building ID will exist a maximum 1 time per census tract. Exporting this in one file results in ~16M rows, which is too big. Instead, we slice the file by state and by county within each state. This makes files which are manageable in CSV format.

  2. metadata_and_annual_results_aggregates: this is results aggregated by a certain set of geographies (e.g. county, PUMA, etc.) This means a single building ID will exist a maximum 1 time per geography. Because of this, the files are much more manageable size, especially in CSV format.

There are also a lot of Rubocop syntax changes pulled from main, which was unintentional, but we've decided not to spend the time to remove those changes from this PR.

Pull Request Author

This pull request makes changes to (select all the apply):

  • Documentation
  • Infrastructure (includes apptainer image, buildstock batch, dependencies, continuous integration tests)
  • Sampling
  • Workflow Measures
  • Upgrade Measures
  • Reporting Measures
  • Postprocessing

Author pull request checklist:

  • Tagged the pull request with the appropriate label (documentation, infrastructure, sampling, workflow measure, upgrade measure, reporting measure, postprocessing) to help categorize changes in the release notes.
  • Added tests for new measures
  • Updated measure .xml(s)
  • Register values added to comstock_column_definitions.csv
  • Both options_lookup.tsv files updated
  • 10k+ test run
  • Change documentation written
  • Measure documentation written
  • ComStock documentation updated
  • Changes reflected in example .yml files
  • Changes reflected in README.md files
  • Added 'See ComStock License' language to first two lines of each code file
  • Implements corresponding measure tests and indexing path in test/reporting_measure_tests.txt, test/workflow_measure_tests.txt, or test/upgrade_measure_tests.txt
  • All new and existing tests pass the CI

Review Checklist

This will not be exhaustively relevant to every PR.

  • Perform a code review on GitHub
  • All related changes have been implemented: data and method additions, changes, tests
  • If fixing a defect, verify by running develop branch and reproducing defect, then running PR and reproducing fix
  • Reviewed change documentation
  • Ensured code files contain License reference
  • Results differences are reasonable
  • Make sure the newly added measures has been added with tests and indexed properly
  • CI status: all tests pass

ComStock Licensing Language - Add to Beginning of Each Code File

# ComStock™, Copyright (c) 2023 Alliance for Sustainable Energy, LLC. All rights reserved.
# See top level LICENSE.txt file for license terms.

Disables reporting of HVAC system counts which were incorrect because of the introduction of zone multipliers. These may be re-enabled after correct handling of zone multipliers in the reporting system.
Enable writing of metadata to S3 or local filesystem depending on the input. Requires configuration of fsspec to account for possible Pyathena import.
With many upgrades the fkt was becoming too large to collect without memory issues. Transition to a single fkt and add a method to get the fkt with a specified upgrade_id added.
Climate zones in fkt are actually a mix of CEC and ASHRAE. Add mapping from CEC to ASHRAE into fkt creation.
Not all columns to be plotted will be exported normally.
Allow the desired geographies, aggregation levels, partitioning, and file types to be passed into export_metadata_and_annual_results for reusability.
Continue writing uncompressed CSVs if writing locally.
fkt creation is non-deterministic. Cache fkt so it can be reused when a data export is interrupted and needs to be restarted. Hive partition the ComStock wide data to enable much faster filtering down to the upgrade, which was a major bottleneck in exporting data for each geography.
Use the 'detailed' keyword to export all columns present in the raw data, primarily used for internal testing and debugging.
Actually compress gz file, code from @wenyikuang
Collect all geographic aggregates in one frame then split the collected frame into individual geographies. This speeds up the processing significantly for by_state aggregates.
PUMAs should be added to fkt during creation, so this change is temporary.
For aggregates, the full dataframe may be collected without memory issues. For non-aggregates, need to collect a dataframe per geography to avoid memory issues.
If there is no aggregation, collect the entire first-level geography at once and then sub-divide this to separate files later.
Allows an array of variables to be used when aggregating. Helpful when you need multiple geographic variables like state and climate zone included in the output files.
Fixes national-scale partitioning to work when no aggregation is supplied.
Reduce the bottleneck for metadata export by parallelizing writing metadata files
@asparke2 asparke2 added reporting measure PR improves or adds reporting measures postprocessing PR improves or adds postprocessing content labels Jan 22, 2025
@asparke2 asparke2 mentioned this pull request Jan 22, 2025
29 tasks
@ChristopherCaradonna ChristopherCaradonna merged commit 2c3cf2b into sdr_2024_r2 Jan 23, 2025
1 check was pending
@ChristopherCaradonna ChristopherCaradonna deleted the wenyi/euss_r3 branch January 23, 2025 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postprocessing PR improves or adds postprocessing content reporting measure PR improves or adds reporting measures
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants