Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes #3005 and #3592: Hide null ORCID fields in contributor representation and correct flipped starting/ending balance for FERC1 liabilities #3969

Closed
wants to merge 6 commits into from

Conversation

yolandazzz13
Copy link

This PR addresses two distinct issues:

Issue #3005: Ensures that null ORCID fields in contributor do not appear in the datapackage.json file or in class representations.

I added a repr method to the Contributor class that excludes None values in its string representation. This ensures that fields like ORCID are omitted when not present.
Verified that this change applies consistently across all cases where Contributor class is called. This is done by checking the output values for DataSource.from_id("eia860").contributors, in which eia860 is replaced with other datasets in src/pudl/metadata/sources.py.

Issue #3592: Corrected the flipped starting and ending balances in out_ferc1_yearly_other_regulatory_liabilities_sched278

Changed the FERC1 transformation logic to:
Assign "beg_yr_bal" to "starting_balance"
Assign "end_yr_bal" to "ending_balance"
whereas previously the fields are flipped, which caused the bug.

Verified the change actually worked by running the ETL pipeline and materialized out_ferc1_yearly_other_regulatory_liabilities_sched278, then accessed multiple data rows to confirm that the starting balance of one year is equal to the ending balance of the previous year.
image

Testing:
I ran make pytest-coverage and got 84 passed, 4 skipped, 6 xfailed, 2 xpassed, 888 warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are no changes to the database schema, we don't need to have an alembic migration. This file can be removed.

Comment on lines -4652 to +4653
"beg_yr_bal": "ending_balance",
"end_yr_bal": "starting_balance",
"beg_yr_bal": "starting_balance",
"end_yr_bal": "ending_balance",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg we literally just switched them 🤦🏼 Thank you for fixing!

@@ -897,11 +897,19 @@ class Contributor(PudlMeta):
"work package leader",
] = "project member"
organization: String | None = None
orcid: String | None = None
orcid: String | None | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe there's an extra | None in this attribute definition.

Comment on lines +902 to +906
def __repr__(self):
"""Representation excluding None values."""
attrs = {k: v for k, v in self.__dict__.items() if v is not None}
attr_str = ", ".join(f"{k}={repr(v)}" for k, v in attrs.items())
return f"{self.__class__.__name__}({attr_str})"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the consequences of having omitting the attribute in all cases of None values will be. Under the datapackage spec it's allowed, but I could also see the existence of those fields being expected on the other end where this is being used in the pudl-archiver repo -- potentially with the None value being used to check for completeness of the fields we want to have.

I'm also not sure if __repr__() is the method being used to output the attributes of the class. This is an immutable Pydandic BaseModel, so getting the JSON representation is probably being done via Contributor.model_dump().

If you instantiate a Contributor and use model_dump() does it behave as expected?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and found that redefining __repr__ doesn't affect the output of model_dump() and it seems like the right place for this to happen is probably over in the pudl-archiver repository since the serialization behavior we're trying to fix is specific to when we're creating a new datapackage.json for use on Zenodo.

I went ahead and tried to make a PR that does this

@zaneselvans zaneselvans added bug Things that are just plain broken. ferc1 Anything having to do with FERC Form 1 datapkg Frictionless data package input, output, metadata, manipulation metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. zenodo Issues having to do with Zenodo data archiving and retrieval. community labels Nov 21, 2024
Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's no change to the database schema, and the null ORCID issue is (hopefully) fixed in catalyst-cooperative/pudl-archiver#486 I think we should just merge in the fix to the swapped start/end date columns.

@zaneselvans
Copy link
Member

I went ahead and merged in the null ORCID fix in the pudl-archiver repo.

@zaneselvans
Copy link
Member

I brought the flipped start/end balance fix into #3979 and the null ORCID fix is made in the pudl-archiver repo so I'm going to close this. Thank you @yolandazzz13 for moving this fixes forward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that are just plain broken. community datapkg Frictionless data package input, output, metadata, manipulation ferc1 Anything having to do with FERC Form 1 metadata Anything having to do with the content, formatting, or storage of metadata. Mostly datapackages. zenodo Issues having to do with Zenodo data archiving and retrieval.
Projects
Archived in project
2 participants