Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix validation test_fbp_ferc1_mismatched_fuels error #3025

Merged
merged 3 commits into from
Nov 8, 2023

Conversation

cmgosnell
Copy link
Member

@cmgosnell cmgosnell commented Nov 7, 2023

PR Overview

We were getting the following error:

FAILED test/validate/fbp_ferc1_test.py::test_fbp_ferc1_mismatched_fuels[ferc1_annual] - AssertionError: Too many records (9.69%) have mismatched primary fuel types.

I determined that this was happening bc the nulls in the fuel table were somehow getting converted to "nan":

image

first solution:

A part of the pudl.analysis.classify_plants_ferc1.fuel_by_plant_ferc1 step does convert nulls to "" and there is a post-processing step that we were already applying (pudl.analysis.classify_plants_ferc1.revert_filled_in_string_nulls). I simply added "nan" in here.

Still tbh do no know why these "nan"s were all of a sudden appearing but hey.. they are gone now 👀

current solution:

it looooks as if the "nan" is added in right here...

for fuel_str in fuel_categories:
    try:
        mmbtu_mask = df[f"{fuel_str}_fraction_mmbtu"] > thresh
        df.loc[mmbtu_mask, "primary_fuel_by_mmbtu"] = fuel_str
    except KeyError:
        pass

    try:
        cost_mask = df[f"{fuel_str}_fraction_cost"] > thresh
        df.loc[cost_mask, "primary_fuel_by_cost"] = fuel_str
    except KeyError:
        pass

So I stopped em in their tracks 🚥 . these columns don't exist before this loop and not all of the records get labeled with a fuel_str bc not all of the records meet the thresh so I added the whole column before the loop w/ values of pd.NA

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@zaneselvans
Copy link
Member

This is disconcerting. Previously the ID assignment process was expecting empty strings in there. Rather than turning the mysterious "nan” strings into np.nan at the end did you look at the place where the np.nan values are turned into empty strings on the way in?

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the code snipped you cite as the source of the "nan" values, how are they getting there?

for fuel_str in fuel_categories:
    try:
        mmbtu_mask = df2[f"{fuel_str}_fraction_mmbtu"] > thresh
        df2.loc[mmbtu_mask, "primary_fuel_by_mmbtu"] = fuel_str
    except KeyError:
        pass

    try:
        cost_mask = df2[f"{fuel_str}_fraction_cost"] > thresh
        df2.loc[cost_mask, "primary_fuel_by_cost"] = fuel_str
    except KeyError:
        pass

Is it that "nan" is somehow showing up the fuel_categories? That doesn't seem right, since it's just the keys of the pudl.transform.params.ferc1.FUEL_CATEGORIES["categories"] dictionary.

src/pudl/analysis/classify_plants_ferc1.py Show resolved Hide resolved
src/pudl/analysis/classify_plants_ferc1.py Show resolved Hide resolved
Copy link

codecov bot commented Nov 8, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f25e921) 88.7% compared to head (b7533cf) 88.7%.
Report is 1 commits behind head on dev.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #3025   +/-   ##
=====================================
  Coverage   88.7%   88.7%           
=====================================
  Files         91      91           
  Lines      11010   11011    +1     
=====================================
+ Hits        9768    9769    +1     
  Misses      1242    1242           
Files Coverage Δ
src/pudl/analysis/classify_plants_ferc1.py 92.5% <100.0%> (+<0.1%) ⬆️
src/pudl/output/ferc1.py 88.2% <ø> (-0.1%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cmgosnell
Copy link
Member Author

Is it that "nan" is somehow showing up the fuel_categories? That doesn't seem right, since it's just the keys of the pudl.transform.params.ferc1.FUEL_CATEGORIES["categories"] dictionary.

no, the fuel cats have a specific na_category but that definitely isn't the way the "nan"s are coming from. see my comment above

@cmgosnell cmgosnell merged commit a2bdffa into dev Nov 8, 2023
9 checks passed
@cmgosnell cmgosnell deleted the fix_mystery_fuel_cat_nan branch November 8, 2023 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants