Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add column mapping for proposed multifuel table #3988

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

jmelot
Copy link

@jmelot jmelot commented Dec 8, 2024

Overview

Closes #3438

What problem does this address?

What did you change?

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Preview Give feedback

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Preview Give feedback

@e-belfer
Copy link
Member

e-belfer commented Dec 9, 2024

@jmelot Hello! Thanks for opening this PR, excited to get this in! I'm happy to answer any questions you have as you work on this, and review it when you reach a good pausing point. Just ping me!

@e-belfer e-belfer added community eia860 Anything having to do with EIA Form 860 new-data Requests for integration of new data. labels Dec 9, 2024
@jmelot
Copy link
Author

jmelot commented Dec 11, 2024

Thanks @e-belfer, I appreciate it!

  • I have been able to run the pipeline end to end and examine the multifuel outputs. I followed some of the other transformations I saw applied in transform/eia860.py but I doubt I got everything that's needed, could you please let me know what else would be useful? I'll also do another pass to make sure I got all the boolean columns and spot check the proposed column mapping.
  • One thing I noticed was that the operable and retired multifuel tables have Operating month and Operating year columns while the proposed table has Current month and Current year tables which presumably have a different meaning. I wasn't sure whether it would be better to store the proposed values in the same generator_operating_{month,year} columns or keep them separate. For now, I've stored them in the same column.

@jmelot jmelot marked this pull request as ready for review December 11, 2024 14:04
@e-belfer e-belfer requested a review from aesharpe December 16, 2024 19:28
Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of feedback

Thanks for your additions @jmelot! You did a nice job mimicking existing functions for other tables. I've added a few comments and suggested additions below!

Here are some similar issues to review for context.
wind: #3460
solar: #3461
storage: #3463
equipment: #3462

If you look at the transform issues in the task lists and then the transform PRs linked to those issues you may find some helpful comparisons. For example, #3522.

Regarding your comment about operating date:

One thing I noticed was that the operable and retired multifuel tables have Operating month and Operating year columns while the proposed table has Current month and Current year tables which presumably have a different meaning. I wasn't sure whether it would be better to store the proposed values in the same generator_operating_{month,year} columns or keep them separate. For now, I've stored them in the same column.

In the proposed generators table, this "current" column refers to the current planned operating date and is referred to as current_planned_generator_operating_month/year. The generator table also has an original_planned_generator_operating_month/year column, so it makes sense to distinguish the two with "current" and "original" prefixes. Because this table doesn't have an "original" column, I could see us going either way:

A) Keep the column name generator_operating_month/year and relying on the operational_status column to know whether it's planned or in effect.

B) Re-use the current_planned_generator_operating_month/year column name for consistency and extra specificity.

@e-belfer or @cmgosnell what do you think?

Add table to harvested asset factory

Add this table to the harvested_entity_asset_factory and finished_eia_assets in src/pudl/transform/eia.py. See the description of the core layer in our docs for more detail about the harvesting process.

Add table to RESOURCE_METADATA

Add the table schema and information to the src/pudl/metadata/resources/eia860.py/RESOURCE_METADATA dictionary.

Update release notes

You'll want to add a description of your changes to our docs/release_notes.rst file

Testing

This is more of a question for @e-belfer and @cmgosnell -- should we add this table to any of the testing infrastructure? E.g.: test_minmax_rows or test_unique_rows_eia in eia_test.py? It's not clear to me given the exclusion of the other generator subset tables (scd_generators_solar etc.) from both of these functions.

src/pudl/transform/eia.py Outdated Show resolved Hide resolved
src/pudl/transform/eia860.py Outdated Show resolved Hide resolved
src/pudl/transform/eia860.py Show resolved Hide resolved
@cmgosnell
Copy link
Member

Regarding your comment about operating date:

  • definitely think this looks and sounds like current_planned_generator_operating_month/year. great catch that these columns look different! So i think this clearly fits under your suggestion B) austne

@cmgosnell
Copy link
Member

Re testing

All of the test/validate/eia_test.py stuff is all dependent on the PudlTabl, which we are deprecating. But rn most or all of the scd tables are included in these tests via _register_output_methods. and then the name of the values in that _register_output_methods dictionary are used in these validation tests.

We are planning on re-vamping our data validation process and depreciating PudlTabl ( an early example of that is using the similar tests for FERC1 in this pr #3860)

if you wanted to add new tables into test_minmax_rows I believe the steps would be:

  1. add any new tables to _register_output_methods. keys are the asset names (which would be the core_ table names) - value can be the same tbh the shortened names are only the maintain backwards compatibility with our old method names.
  2. add these new tables with the parameters for test_minmax_rows with the counts you see now.
  3. w/ or w/o adding new tables into the validation tests you'll definitely want to run the data validations using make pytest-validate

Also general comment, with all of these steps to fully fully integrate these new tables, please feel free to say "I'd like to pass this off" or "can this be out of scope" or "i'll need more support in order to make ___ happen" or whatever! The extract mappings and the transform function is already really helpful!

@jmelot
Copy link
Author

jmelot commented Dec 20, 2024

Thanks @aesharpe and @cmgosnell (and sorry to be slow getting back to you). On a quick skim, this makes sense. I'll have much more time next week, and I'll take a closer look and try to make the necessary edits before the new year.

Copy link
Member

@aesharpe aesharpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmelot I left a little comment about boolean column naming and primary keys, otherwise just need the release notes!

I'll try and materialize the assets in a bit to see if it works / I have any other feeback. I'll also get back to you with more specific testing specifications in a moment.

Comment on lines 90 to 93
"air_permit_limits": {
"type": "boolean",
"description": "True if there are air permit limits",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've started trying to add a verb suffix to boolean columns highlight their binary nature. E.g.: is_{x} or has_{x} or served_{x}. This column would thus be has_air_permit_limits. Applicable to a few other column names you added.

(this is also a great reminder to add this to our naming conventions docs page...)

src/pudl/metadata/resources/eia860.py Show resolved Hide resolved
@jmelot
Copy link
Author

jmelot commented Jan 3, 2025

Thanks, on it (and sorry this is dragging out a bit, the holidays were busier than I expected)! I think I was running into some kind of schema mismatch error when I last ran etl_full so after I've finished updating the boolean column names I'll give it another try and see if I can track that down.

@jmelot
Copy link
Author

jmelot commented Jan 4, 2025

I updated the boolean column names, added a short release note, and updated the primary key!

Two questions:

  • When I was updating src/pudl/metadata/fields.py I felt like I was kind of guessing at some of the columns' meanings and on a few of them I was unsure enough that I just wrote "TK" for now. I looked around here but didn't see any column-level documentation. Do you know of a good resource where I could look to see the meaning of these columns so I can improve the column descriptions?
  • I tried rerunning the etl_full pipeline but it died with the error below. I made sure to update my database schema so I'm not sure what I'm doing wrong here - none of my other column renamings resulted in a similar error message. Does anything stand out to you?
Screenshot 2025-01-04 at 4 38 34 PM

@cmgosnell cmgosnell self-assigned this Jan 6, 2025
@aesharpe aesharpe self-assigned this Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community eia860 Anything having to do with EIA Form 860 new-data Requests for integration of new data.
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Integrate EIA-860 Multifuel table
4 participants