Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent leading zeros in emissions_unit_id_epa #3992

Open
grgmiller opened this issue Dec 17, 2024 · 1 comment
Open

Inconsistent leading zeros in emissions_unit_id_epa #3992

grgmiller opened this issue Dec 17, 2024 · 1 comment
Assignees
Labels
bug Things that are just plain broken. internal-onboarding Good first issues, for folks who have access to all of our systems.

Comments

@grgmiller
Copy link
Collaborator

Describe the bug

As has been noted in other issues, leading zeros can cause issues when crosswalking data (#964 and #2366).

It appears that EPA data suffers from the same issue with emissions_unit_id_epa, specifically across the hourly CEMS emissions data, and the core_epa__assn_eia_epacamd table.

Some examples:

Plant 50852, unit 002001:

  • CAMD online database: 002001
  • PSDC: 2001
  • PUDL hourly emissions table: 2001

Plant 2446, unit 051B:

  • CAMD online database: 051B
  • PSDC: 051B
  • PUDL hourly emissions table: 051B

It appears that EPA is inconsistent with how if it strips zeros across its data products: in some cases, it strips them, and in other cases, it leaves them in place. Even within the PSDC, there are some units that have leading zeros and some where this is stripped. This is leading to issues when trying to crosswalk both sources across time.

I'd propose that PUDL programmatically strip leading zeros from this field like you already do for boiler and generator IDs.

Bug Severity

How badly is this bug affecting you?

  • Medium: With some effort, I can work around the bug.

Expected behavior

Consistent EPA unit IDs across datasets

Software Environment?

N/A

Additional context

This is causing some issues with crosswalking in OGE. Currently, we are manually stripping leading zeros once we import the data.

@e-belfer
Copy link
Member

e-belfer commented Jan 7, 2025

Right now we are calling the helper method remove_leading_zeros_from_numeric_strings on the emissions_unit_id_epa columns in core_epa__assn_eia_epacamd and in the transform method in pudl.transform.epacems. The problem is that the EPA unit IDs aren't always numeric, as the example provided above (051B) illustrates. We're only removing leading zeros when there are only numbers (e.g., 002001 becomes 2001), not when letters are included because we haven't seen cases where the alphanumeric values have inconsistent leading zeros. It's the same way that we process our generator and boiler IDs.

Thanks for catching this one and alerting us, @grgmiller ! Are there cases that you're seeing where alphanumeric IDs with leading zeros are causing matching problems?

@bendnorman bendnorman added the internal-onboarding Good first issues, for folks who have access to all of our systems. label Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. internal-onboarding Good first issues, for folks who have access to all of our systems.
Projects
Status: Backlog
Development

No branches or pull requests

3 participants