Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to splink FERC to EIA record linkage notebook #3976

Merged
merged 4 commits into from
Nov 27, 2024

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Nov 26, 2024

Overview

This PR updates the devtools notebook for FERC to EIA record linkage with splink. This notebook isn't actually run in production, but it's helpful for visualizing the model weights and the decisions that go into making the match.

What problem does this address?

The notebook now runs with the latest version of splink and reads in parquet inputs from S3 instead of from dagster storage.

What did you change?

Updated the notebook to have more documentation cells and run with the latest version of splink. Also, fixed an error that was cropping up in the plant parts list creation with datetime types. I think this came up because I'm now reading in parquet inputs from S3 instead of dagster storage, and the report_date column was an object instead of a datetime.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Preview Give feedback

Testing

How did you make sure this worked? How can a reviewer verify this?

Run the FERC to EIA match notebook from top to bottom.

To-do list

Preview Give feedback

@katie-lamb katie-lamb added the record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys. label Nov 26, 2024
@katie-lamb katie-lamb requested a review from zschira November 26, 2024 03:50
@katie-lamb katie-lamb self-assigned this Nov 26, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 labels Nov 26, 2024
Comment on lines 1486 to 1489
if not pd.api.types.is_datetime64_any_dtype(ppe_part_df["report_date"]):
ppe_part_df["report_date"] = pd.to_datetime(ppe_part_df["report_date"])
if not pd.api.types.is_datetime64_any_dtype(multi_gran_df["report_date"]):
multi_gran_df["report_date"] = pd.to_datetime(multi_gran_df["report_date"])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of janky but I think when inputs are read in from S3 (instead of from Dagster storage) the report date is an object not a datetime. I was getting errors, so I threw this in there, but it's a little sloppy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in the python module? Or a pre-processing step in the notebook? If the dataframe doesn't match the expected schema, then maybe it should fail?

When we read from Dagster storage if it's not a persisted asset it'll be a pickled dataframe, with the datetime type intact. If we're reading from a persisted asset, it'll be coming from Parquet, and IIRC our IOManager will enforce the correct PUDL dtypes when we read the data in. If you're reading a Parquet file from S3 and passing it in outside of the Dagster context, you might try using dtype_backend="pyarrow" to ensure that the datetime column gets parsed not as an object.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay let me try using the pyarrow backend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyarrow backend introduced other type issues, so I updated to cast to a datetime within the notebook, after reading in the Parquet inputs.

@katie-lamb katie-lamb marked this pull request as ready for review November 26, 2024 19:33
Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@katie-lamb katie-lamb added this pull request to the merge queue Nov 27, 2024
Merged via the queue into main with commit 6d9e272 Nov 27, 2024
17 checks passed
@katie-lamb katie-lamb deleted the splink-notebook-update branch November 27, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants