Updates postprocessing for new sampling / increased geographic resolution #275

asparke2 · 2025-01-22T20:05:45Z

Pull request overview

This pull request makes major changes to the postprocessing to accommodate the new sampling method AKA increased geographic resolution. Instead of one row of metadata per modeled building, there is now an apportionment step which creates a set of weights. This apportionment step can reuse a modeled building ID across many different geographies. The per-building annual results (results.csv ~150k rows by ~1,300 columns) is joined onto the weights table of ~27M rows. The resulting file would be ~27M rows by ~1,300 columns (for each upgrade), which is too big to calculate in memory for most computers, and creates a prohibitively large file on disk. Instead, the file is calculated and exported in two separate ways:

metadata_and_annual_results: this is the highest resolution data published, which is aggregated to the census tract. This means a single building ID will exist a maximum 1 time per census tract. Exporting this in one file results in ~16M rows, which is too big. Instead, we slice the file by state and by county within each state. This makes files which are manageable in CSV format.
metadata_and_annual_results_aggregates: this is results aggregated by a certain set of geographies (e.g. county, PUMA, etc.) This means a single building ID will exist a maximum 1 time per geography. Because of this, the files are much more manageable size, especially in CSV format.

There are also a lot of Rubocop syntax changes pulled from main, which was unintentional, but we've decided not to spend the time to remove those changes from this PR.

Pull Request Author

This pull request makes changes to (select all the apply):

Author pull request checklist:

Review Checklist

This will not be exhaustively relevant to every PR.

Perform a code review on GitHub
All related changes have been implemented: data and method additions, changes, tests
If fixing a defect, verify by running develop branch and reproducing defect, then running PR and reproducing fix
Reviewed change documentation
Ensured code files contain License reference
Results differences are reasonable
Make sure the newly added measures has been added with tests and indexed properly
CI status: all tests pass

ComStock Licensing Language - Add to Beginning of Each Code File

# ComStock™, Copyright (c) 2023 Alliance for Sustainable Energy, LLC. All rights reserved.
# See top level LICENSE.txt file for license terms.

Disables reporting of HVAC system counts which were incorrect because of the introduction of zone multipliers. These may be re-enabled after correct handling of zone multipliers in the reporting system.

Enable writing of metadata to S3 or local filesystem depending on the input. Requires configuration of fsspec to account for possible Pyathena import.

With many upgrades the fkt was becoming too large to collect without memory issues. Transition to a single fkt and add a method to get the fkt with a specified upgrade_id added.

Climate zones in fkt are actually a mix of CEC and ASHRAE. Add mapping from CEC to ASHRAE into fkt creation.

Not all columns to be plotted will be exported normally.

Allow the desired geographies, aggregation levels, partitioning, and file types to be passed into export_metadata_and_annual_results for reusability.

Continue writing uncompressed CSVs if writing locally.

fkt creation is non-deterministic. Cache fkt so it can be reused when a data export is interrupted and needs to be restarted. Hive partition the ComStock wide data to enable much faster filtering down to the upgrade, which was a major bottleneck in exporting data for each geography.

Use the 'detailed' keyword to export all columns present in the raw data, primarily used for internal testing and debugging.

@wenyikuang

Actually compress gz file, code from @wenyikuang

Collect all geographic aggregates in one frame then split the collected frame into individual geographies. This speeds up the processing significantly for by_state aggregates.

PUMAs should be added to fkt during creation, so this change is temporary.

For aggregates, the full dataframe may be collected without memory issues. For non-aggregates, need to collect a dataframe per geography to avoid memory issues.

If there is no aggregation, collect the entire first-level geography at once and then sub-divide this to separate files later.

Allows an array of variables to be used when aggregating. Helpful when you need multiple geographic variables like state and climate zone included in the output files.

Fixes national-scale partitioning to work when no aggregation is supplied.

Reduce the bottleneck for metadata export by parallelizing writing metadata files

mdahlhausen added 30 commits October 15, 2024 17:27

Update .gitignore

ae0316e

Update test_template.rb

9c57ceb

update add_blinds_to_selected_windows test

79521cb

update ChangeBuildingLocation tests

a59f6cf

rubocop edits to add_blinds_to_selected_windows

67ddbfd

update add_hvac_nighttime_operation_variability

90f803c

update simulation_settings measure tests

a87d924

update add thermostat_setpoint_variability_test

791f318

update adjust_occupancy_schedule test

f5f98e2

update fault_hvac_economizer_changeover_temperature

b0c02c8

update fault_hvac_economizer_damper_stuck

a2c881e

remove hvac_nighttime_oa_controls

e4f8bf2

update replace_baseline_windows

68dd59f

update set_electric_equipment_bpr

8fe1aeb

update set_exterior_lighting_template

1f6586c

update set_exterior_lighting_template

614c52b

update set_heating_fuel

0b5f6e0

update set_hvac_template

90a9173

update set_interior_equipment_template

adb352e

update set_interior_lighting_bpr

b20fbb6

update set_interior_lighting_technology

339f73f

update set_primary_kitchen_equipment

684e561

update set_roof_template

ea33b6b

update set_wall_template

13815b4

update set_space_type_load_categories

30c2647

update set_service_water_heating_fuel

5fa99e5

update simulating_settings test file name

300c4ce

update set_service_water_heating_template

491f779

update set_nist_infiltration_correlations

6536b12

update hardsize_model

62c6df6

asparke2 added 23 commits December 17, 2024 16:39

Fixes #240 by disabling reporting of HVAC counts

080326e

Disables reporting of HVAC system counts which were incorrect because of the introduction of zone multipliers. These may be re-enabled after correct handling of zone multipliers in the reporting system.

Merge remote-tracking branch 'origin/wenyi/euss_r3' into wenyi/euss_r3

bad577d

Adds units to in.sqft column name

fcc1205

Enables writing metadata to S3

7444c63

Enable writing of metadata to S3 or local filesystem depending on the input. Requires configuration of fsspec to account for possible Pyathena import.

Remove upgrade_id from fkt

232be17

With many upgrades the fkt was becoming too large to collect without memory issues. Transition to a single fkt and add a method to get the fkt with a specified upgrade_id added.

Reducing log messages

0f2ba2b

Fix climate zones is plotting LazyFrame

eae55c7

Climate zones in fkt are actually a mix of CEC and ASHRAE. Add mapping from CEC to ASHRAE into fkt creation.

Export all columns for plotting LazyFrame

db926ab

Not all columns to be plotted will be exported normally.

Make export_metadata configurable

451fbf2

Allow the desired geographies, aggregation levels, partitioning, and file types to be passed into export_metadata_and_annual_results for reusability.

Allows writing of compressed CSVs to S3

5b2a764

Continue writing uncompressed CSVs if writing locally.

Report full path for CBECS

c661829

Use full path for Apportionment

af78c81

Export all columns with no downselection

b51f3d1

Use the 'detailed' keyword to export all columns present in the raw data, primarily used for internal testing and debugging.

Removes metadata_index and orders basic columns

4f54f8a

Fixes csv.gz creation

2c3fb15

Actually compress gz file, code from @wenyikuang

Collect all geographies at once

2e9658d

Collect all geographic aggregates in one frame then split the collected frame into individual geographies. This speeds up the processing significantly for by_state aggregates.

Fix raw un-aggregated and adds PUMA to fkt

60a8b7e

PUMAs should be added to fkt during creation, so this change is temporary.

Handle aggregate and non-aggregate differently

a4a7c73

For aggregates, the full dataframe may be collected without memory issues. For non-aggregates, need to collect a dataframe per geography to avoid memory issues.

Collect first-level geographies for multi-level no-aggregations

a24497d

If there is no aggregation, collect the entire first-level geography at once and then sub-divide this to separate files later.

Multi-variable aggregations in export_metadata

ccc3a86

Allows an array of variables to be used when aggregating. Helpful when you need multiple geographic variables like state and climate zone included in the output files.

Fixes national-scale partitioning

e91a230

Fixes national-scale partitioning to work when no aggregation is supplied.

Parallelize export_metadata

7c157c0

Reduce the bottleneck for metadata export by parallelizing writing metadata files

asparke2 added reporting measure PR improves or adds reporting measures postprocessing PR improves or adds postprocessing content labels Jan 22, 2025

asparke2 assigned ChristopherCaradonna Jan 22, 2025

asparke2 mentioned this pull request Jan 22, 2025

Merge develop into main #252

Closed

29 tasks

ChristopherCaradonna approved these changes Jan 23, 2025

View reviewed changes

ChristopherCaradonna merged commit 2c3cf2b into sdr_2024_r2 Jan 23, 2025
1 check was pending

ChristopherCaradonna deleted the wenyi/euss_r3 branch January 23, 2025 18:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates postprocessing for new sampling / increased geographic resolution #275

Updates postprocessing for new sampling / increased geographic resolution #275

asparke2 commented Jan 22, 2025

Updates postprocessing for new sampling / increased geographic resolution #275

Updates postprocessing for new sampling / increased geographic resolution #275

Conversation

asparke2 commented Jan 22, 2025

Pull request overview

Pull Request Author

Review Checklist

ComStock Licensing Language - Add to Beginning of Each Code File