Updates to splink FERC to EIA record linkage notebook #3976

katie-lamb · 2024-11-26T03:50:27Z

Overview

This PR updates the devtools notebook for FERC to EIA record linkage with splink. This notebook isn't actually run in production, but it's helpful for visualizing the model weights and the decisions that go into making the match.

What problem does this address?

The notebook now runs with the latest version of splink and reads in parquet inputs from S3 instead of from dagster storage.

What did you change?

Updated the notebook to have more documentation cells and run with the latest version of splink. Also, fixed an error that was cropping up in the plant parts list creation with datetime types. I think this came up because I'm now reading in parquet inputs from S3 instead of dagster storage, and the report_date column was an object instead of a datetime.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Give feedback

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.
Options

Testing

How did you make sure this worked? How can a reviewer verify this?

Run the FERC to EIA match notebook from top to bottom.

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

review-notebook-app · 2024-11-26T03:50:33Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

katie-lamb · 2024-11-26T18:40:59Z

src/pudl/analysis/plant_parts_eia.py

+    if not pd.api.types.is_datetime64_any_dtype(ppe_part_df["report_date"]):
+        ppe_part_df["report_date"] = pd.to_datetime(ppe_part_df["report_date"])
+    if not pd.api.types.is_datetime64_any_dtype(multi_gran_df["report_date"]):
+        multi_gran_df["report_date"] = pd.to_datetime(multi_gran_df["report_date"])


This is kind of janky but I think when inputs are read in from S3 (instead of from Dagster storage) the report date is an object not a datetime. I was getting errors, so I threw this in there, but it's a little sloppy.

Should this be in the python module? Or a pre-processing step in the notebook? If the dataframe doesn't match the expected schema, then maybe it should fail?

When we read from Dagster storage if it's not a persisted asset it'll be a pickled dataframe, with the datetime type intact. If we're reading from a persisted asset, it'll be coming from Parquet, and IIRC our IOManager will enforce the correct PUDL dtypes when we read the data in. If you're reading a Parquet file from S3 and passing it in outside of the Dagster context, you might try using dtype_backend="pyarrow" to ensure that the datetime column gets parsed not as an object.

Ah okay let me try using the pyarrow backend.

The pyarrow backend introduced other type issues, so I updated to cast to a datetime within the notebook, after reading in the Parquet inputs.

zschira

Looks good!

updates to splink ferc to eia notebook

582fac2

katie-lamb added the record-linkage Issues related to connecting related records / entities that don't have explicit IDs or keys. label Nov 26, 2024

katie-lamb requested a review from zschira November 26, 2024 03:50

katie-lamb self-assigned this Nov 26, 2024

zaneselvans added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 labels Nov 26, 2024

katie-lamb added 2 commits November 26, 2024 13:19

Merge branch 'main' into splink-notebook-update

fe97ecd

update release notes

e0db40b

katie-lamb commented Nov 26, 2024

View reviewed changes

katie-lamb marked this pull request as ready for review November 26, 2024 19:33

take out casting to datetime from plant parts list creation

4449d73

zschira approved these changes Nov 27, 2024

View reviewed changes

katie-lamb added this pull request to the merge queue Nov 27, 2024

Merged via the queue into main with commit 6d9e272 Nov 27, 2024
17 checks passed

katie-lamb deleted the splink-notebook-update branch November 27, 2024 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to splink FERC to EIA record linkage notebook #3976

Updates to splink FERC to EIA record linkage notebook #3976

katie-lamb commented Nov 26, 2024 •

edited

Loading

Tasks

To-do list

review-notebook-app bot commented Nov 26, 2024

katie-lamb Nov 26, 2024

zaneselvans Nov 26, 2024

katie-lamb Nov 26, 2024

katie-lamb Nov 26, 2024

zschira left a comment

Updates to splink FERC to EIA record linkage notebook #3976

Updates to splink FERC to EIA record linkage notebook #3976

Conversation

katie-lamb commented Nov 26, 2024 • edited Loading

Overview

What problem does this address?

What did you change?

Documentation

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

review-notebook-app bot commented Nov 26, 2024

katie-lamb Nov 26, 2024

Choose a reason for hiding this comment

zaneselvans Nov 26, 2024

Choose a reason for hiding this comment

katie-lamb Nov 26, 2024

Choose a reason for hiding this comment

katie-lamb Nov 26, 2024

Choose a reason for hiding this comment

zschira left a comment

Choose a reason for hiding this comment

katie-lamb commented Nov 26, 2024 •

edited

Loading