Inconsistent leading zeros in emissions_unit_id_epa
#3992
Labels
bug
Things that are just plain broken.
internal-onboarding
Good first issues, for folks who have access to all of our systems.
Describe the bug
As has been noted in other issues, leading zeros can cause issues when crosswalking data (#964 and #2366).
It appears that EPA data suffers from the same issue with
emissions_unit_id_epa
, specifically across the hourly CEMS emissions data, and thecore_epa__assn_eia_epacamd
table.Some examples:
Plant 50852, unit 002001:
Plant 2446, unit 051B:
It appears that EPA is inconsistent with how if it strips zeros across its data products: in some cases, it strips them, and in other cases, it leaves them in place. Even within the PSDC, there are some units that have leading zeros and some where this is stripped. This is leading to issues when trying to crosswalk both sources across time.
I'd propose that PUDL programmatically strip leading zeros from this field like you already do for boiler and generator IDs.
Bug Severity
How badly is this bug affecting you?
Expected behavior
Consistent EPA unit IDs across datasets
Software Environment?
N/A
Additional context
This is causing some issues with crosswalking in OGE. Currently, we are manually stripping leading zeros once we import the data.
The text was updated successfully, but these errors were encountered: