-
-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to splink FERC to EIA record linkage notebook #3976
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
src/pudl/analysis/plant_parts_eia.py
Outdated
if not pd.api.types.is_datetime64_any_dtype(ppe_part_df["report_date"]): | ||
ppe_part_df["report_date"] = pd.to_datetime(ppe_part_df["report_date"]) | ||
if not pd.api.types.is_datetime64_any_dtype(multi_gran_df["report_date"]): | ||
multi_gran_df["report_date"] = pd.to_datetime(multi_gran_df["report_date"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of janky but I think when inputs are read in from S3 (instead of from Dagster storage) the report date is an object not a datetime. I was getting errors, so I threw this in there, but it's a little sloppy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in the python module? Or a pre-processing step in the notebook? If the dataframe doesn't match the expected schema, then maybe it should fail?
When we read from Dagster storage if it's not a persisted asset it'll be a pickled dataframe, with the datetime type intact. If we're reading from a persisted asset, it'll be coming from Parquet, and IIRC our IOManager will enforce the correct PUDL dtypes when we read the data in. If you're reading a Parquet file from S3 and passing it in outside of the Dagster context, you might try using dtype_backend="pyarrow"
to ensure that the datetime column gets parsed not as an object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay let me try using the pyarrow backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyarrow backend introduced other type issues, so I updated to cast to a datetime within the notebook, after reading in the Parquet inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Overview
This PR updates the
devtools
notebook for FERC to EIA record linkage withsplink
. This notebook isn't actually run in production, but it's helpful for visualizing the model weights and the decisions that go into making the match.What problem does this address?
The notebook now runs with the latest version of
splink
and reads in parquet inputs from S3 instead of from dagster storage.What did you change?
Updated the notebook to have more documentation cells and run with the latest version of splink. Also, fixed an error that was cropping up in the plant parts list creation with datetime types. I think this came up because I'm now reading in parquet inputs from S3 instead of dagster storage, and the
report_date
column was an object instead of a datetime.Documentation
Make sure to update relevant aspects of the documentation.
Tasks
Testing
How did you make sure this worked? How can a reviewer verify this?
Run the FERC to EIA match notebook from top to bottom.
To-do list