Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 3770 transform phmsagas data #3929

Conversation

seeess1
Copy link
Contributor

@seeess1 seeess1 commented Oct 24, 2024

Overview

This is still WIP!

Relates to #3770.

What problem does this address?

Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.

What did you change?

  • Added a new transform script for PHMSA data.
  • Made one helper method (fix_eia_na) more generic since it can be applied across data sets (and updated references accordingly).
  • Added a few other helper methods.
  • Updated column map files with 2023 values (pulled from whatever was in 2022).
  • Specifically changed the ordering of columns in /Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv since we had fax columns mapped to emails and vice versa.

Will come back to everything below once this draft PR has gone through an initial round of review.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Preview Give feedback

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Preview Give feedback

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@e-belfer e-belfer self-requested a review October 24, 2024 18:11
@e-belfer e-belfer added phmsa Data from the Pipeline and Hazardous Material Safety Administration community labels Oct 24, 2024
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good! A few comments on what's already been done, you're making great progress.

Some next steps:

  • the values in percent_unaccounted_for_gas look like they could use some normalization (I see some negative values)
  • once you're happy with the table, we can go ahead and write it to the database. See here for guidance on how to do this.

src/pudl/helpers.py Show resolved Hide resolved

# Remove unwanted characters (parentheses, spaces, periods, and dashes) from the main phone number
phone_main = re.sub(
r"[^\d]", "", phone_main.replace(".0", "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the special instance of replace is doing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were lots of phone numbers in the raw data with ".0" at the end of the value. So I'm using replace to remove those first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to use a regex that only matches this pattern at the end of the string to avoid accidentally removing characters from the middle of a number that could be reported like 303.443.0123

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm removing that logic. The regex should be all that's needed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm gonna leave in one line toward the top to get rid of these substrings just to make sure the rest of the number standardization is working with cleaner data.

Updated.

intl_code = phone_main[:-10] # Digits before the last 10
main_number = phone_main[-10:] # Last 10 digits are the phone number
formatted_phone = (
f"+{intl_code}-{main_number[:3]}-{main_number[3:6]}-{main_number[6:]}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

International phone numbers don't all follow the pattern 3-3-#, so we probably just want to report the number without enforcing formatting in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating that now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

additional_information,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,,,,,,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information
preparer_name,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name
preparer_title,,,,,,,,,,,,,,,,,,,,,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title
preparer_phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,pphone,pphone,pphone,pphone,pphone,pphone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@asset(
ins={"raw_data": AssetIn(key="raw_phmsagas__yearly_distribution")},
io_manager_key=None, # TODO: check on this one ("pudl_io_manager"? something else?)
compute_kind=None, # TODO: check on this one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this does is add a pretty graphic to dagster, it's totally optional but if you want to include it use pandas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know. I'll play around with this a little bit more.

# Ensure all "report_year" values have four digits
mask = df["report_year"] < 100

# Convert 2-digit years to appropriate 4-digit format (assume cutoff at year 50)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you seeing values as low as 50? Anything lower than 1990 would be a red flag for me at this point (at least until we bring in the older data in #3290).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this column is unreliable, we can also parse the spreadsheet data year and add it as a column during the extraction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No values below 50 in the data right now. I mostly put this in to keep it flexible enough to also be able to handle post-2000 data if we started seeing values like "00", "01", etc. The column still feels pretty reliable to me overall. It was just a 2 vs 4 digit year formatting issue.

return exception_cols


def standardize_state_columns(df: pd.DataFrame, state_columns: list) -> pd.DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See POLITICAL_SUBDIVISIONS in pudl.metadata.dfs - there's already a mapping that can be used to convert names to state code, so this function should be redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

].apply(lambda col: col.str.title())

# List of state columns to standardize
df = standardize_state_columns(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments in the helpers function - use POLITICAL_SUBDIVISIONS here instead of creating a new function, but this is definitely a step we still want to take! Would also be good to check that there aren't any non-valid entries here.

@zaneselvans - is there any conceivable reason we don't already use an enum constraint on the state field in fields.py? I can't imagine us not wanting to enforce restrictions on this one.

Copy link
Member

@zaneselvans zaneselvans Oct 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we've used an enum constraint for the state column in EPA CEMS because it's a huge table and we want it to be interpreted as a categorical value to save space, and we don't use foreign key constraints in the hourly data that's only written to parquet.

But for tables going into the database we can either constrain the value by defining a data source or table specific enum constraint on the field (see the bottom of pudl/metadata/fields.py -- this would result in it being a categorical dtype) or by defining a foreign key constraint that points to the subdivision_code column of the core_pudl__codes_subdivisions (which would result in it remaining a string).

We haven't historically applied either kind of constraint on the state fields, but we really should! The enum route is probably the simpler for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of my duplicate function and utilizing POLITICAL_SUBDIVISIONS. Plus adding an enum constraint to "state" now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

# (on the scale of thousands) do not actually sum up to "excavation_damage_total".
df[YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]] = df[
YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]
].fillna(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We distinguish between NA as "not reported" and 0 as "reported as 0", so filling this would actually change the meaning here (from "I didn't fill this in" to "I had 0 incidents"). Please drop this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

)

# Fill NA values with zeroes because these columns are simply counts.
# Note that "excavation_damage..." columns should sum up to the value in "excavation_damage_total". However, many rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely add this validation as an asset check, and use fillna(0) there to do the comparison calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I'm seeing that even if I fill in NAs with zero, there are almost 4k rows where the excavation_damage... values don't sum to excavation_damage_total. How would you recommend handling these?

}

for col in state_columns:
df[col] = df[col].replace(state_abbreviations).str.upper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you have a complete mapping between old and new values, the col.map() method should be much faster. However any value that appears in the series but not in the dictionary defining the mapping gets converted to NA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 2191 to 2203
"""Standardize phone numbers in the specified columns of the DataFrame.
US numbers: ###-###-####
International numbers with the international code at the beginning.
Numbers with extensions will be appended with "x#".
Non-numeric entries will be returned as np.nan. Entries with fewer than
10 digits will be returned with no hyphens.

Args:
df: The DataFrame to modify.
columns: A list of the names of the columns that need to be standardized
Returns:
The modified DataFrame with standardized phone numbers in the same column.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring won't render correctly in the docs. Check how it looks in the built API docutmentation.

Suggested change
"""Standardize phone numbers in the specified columns of the DataFrame.
US numbers: ###-###-####
International numbers with the international code at the beginning.
Numbers with extensions will be appended with "x#".
Non-numeric entries will be returned as np.nan. Entries with fewer than
10 digits will be returned with no hyphens.
Args:
df: The DataFrame to modify.
columns: A list of the names of the columns that need to be standardized
Returns:
The modified DataFrame with standardized phone numbers in the same column.
"""
"""Standardize phone numbers in the specified columns of the DataFrame.
US numbers: ###-###-####
International numbers with the international code at the beginning.
Numbers with extensions will be appended with "x#".
Non-numeric entries will be returned as np.nan. Entries with fewer than
10 digits will be returned with no hyphens.
Args:
df: The DataFrame to modify.
columns: A list of the names of the columns that need to be standardized
Returns:
The modified DataFrame with standardized phone numbers in the same column.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment on lines 2264 to 2267
"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis
and does not perform any data transformation or cleaning.
This function checks each column for missing or custom missing values and prints
a summary of the findings for string (object), numeric, and datetime columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis
and does not perform any data transformation or cleaning.
This function checks each column for missing or custom missing values and prints
a summary of the findings for string (object), numeric, and datetime columns.
"""Analyze columns of a DataFrame for missing or invalid values.
Note that this is purely for analysis and does not perform any data
transformation or cleaning. This function checks each column for
missing or custom missing values and prints a summary of the findings
for string (object), numeric, and datetime columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@seeess1
Copy link
Contributor Author

seeess1 commented Oct 30, 2024

This is looking really good! A few comments on what's already been done, you're making great progress.

Some next steps:

  • the values in percent_unaccounted_for_gas look like they could use some normalization (I see some negative values)
  • once you're happy with the table, we can go ahead and write it to the database. See here for guidance on how to do this.

Ah yep I meant to point that part out about percent_unaccounted_for_gas. I noticed that too. I honestly wasn't sure how we wanted to handle this one. Any suggestions?

@zaneselvans
Copy link
Member

Would percent_unaccounted_for_gas < 0 mean that extra gas appeared from somewhere?

Is the distribution of the absolute values of positive and negative values similar?

Does it seem likely that different respondents just assumed different sign conventions here? Can we see what question they're responding to and if it is phrased ambiguously? Like is "5% of the gas was unaccounted for" supposed to be -5% or 5%?

@seeess1
Copy link
Contributor Author

seeess1 commented Nov 3, 2024

@e-belfer @zaneselvans I've responded to the initial round of comments. I still haven't solved the riddle of the negative unaccounted gas values. And I need to take a closer look at io_manager_key and compute_kind. But hopefully the PR is starting to shape up here.

@e-belfer e-belfer self-requested a review November 4, 2024 14:00
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but I think you're basically ready to move on to writing this to the PUDL DB. The next steps should be outlined in the issue, but obviously let me know if you have questions about anything!

Regarding the unaccounted for gas - for now, if there's not a clear pattern in the problem, I think knowing the scale of the problem is useful, and we should write an asset check that will give us a sense of whether the problem changes in scale in the future. But, we can hold off on actually trying to fix this problem in the PR.

& (POLITICAL_SUBDIVISIONS.subdivision_type == "state"),
"subdivision_name",
]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you label this one and put it up near its country code peers at the top of this file, so it's easier to find? E.g.,

Suggested change
)
)
""" Two-letter ANSI state codes."""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -4249,6 +4250,7 @@
# different tables. E.g. state of the utility's HQ address vs. state that a
# plant is located in vs. state in which a utility provides service.
"description": "Two letter US state abbreviation.",
"constraints": {"enum": US_STATE_CODES},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enforce this, we'll need to make a new alembic migration (you can do this alongside adding the table to the database). E.g.,:

alembic revision --autogenerate -m "Enforce state constraint and whatever other changes you're making" && alembic upgrade head

See here for more info.

Note that this might kick up some errors in other tables that have invalid values we haven't noticed before, so we should keep an eye out for this when rerunning all the assets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to work on this part next.

df = df.replace(na_patterns, np.nan, regex=True)

# Attempt to infer original types where possible to avoid unwanted type changes
df = df.infer_objects()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a lot of dtype manipulation and enforcement downstream of this function, which typically gets called as one of the first steps when dtypes are poorly enforced. I think we ought to drop this step, as I'm worried it'll introduce some unforeseen consequences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"""Replace common ill-posed EIA NA spreadsheet values with np.nan.

Currently replaces empty string, single decimal points with no numbers,
any single whitespace character, and hyphens with np.nan.
any single whitespace character, and hyphens with np.nan. Also converts
columns to their original types where possible (using 'infer_objects')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below re: my thoughts on this part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to the part you're talking about? I'm having a hard time finding it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this was vaguely in reference to my comment about convert_dtypes() vs infer_objects(), sorry that wasn't clearer!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops! Looks like my documentation was misleading. No call is made to infer_objects in this method. I've removed that part of the function description.

and does not perform any data transformation or cleaning.
This function checks each column for missing or custom missing values and prints
a summary of the findings for string (object), numeric, and datetime columns.
def analyze_missing_values(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seeess1 Am I right that this currently isn't getting called anywhere? I'm assuming you were using it in the notebook / to get a quick lay of the land of what was going on in each column.

I could see the value of keeping a version of this as a parameterized asset check on a per-table basis (e.g., make sure no more than 10% of values per year are NAs) or as dagster asset metadata attached to an asset, but we certainly don't want to have 5 log warnings for each column we're analyzing as it'll be too much noise to interpret. However, I think that's out of scope for this PR.

In the interim, would you be up for noting in the docstring that this function should be used during development as a helping tool and not in any actual final code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep all of that is correct. This was just a tool for analysis that could be used in the middle of dev work.

But I honestly don't feel too precious about it. I'll add a note in the docstring now to make sure that it's clear this is just an analytical tool and not one used in any data transformation. But I'm also totally okay if you want to drop it entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

def core_phmsagas__yearly_distribution_operators(
raw_data: pd.DataFrame,
) -> pd.DataFrame:
"""Build the :ref:`core_phmsagas__yearly_distribution_operators`."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, include a short bulleted list of the transforms you make in this function so that this shows up in our docs. See pudl.transform.eia860 for some examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

]

# Fill NaN values with pd.NA, then cast to "Int64" nullable integer type
df[cols_to_convert] = df[cols_to_convert].fillna(pd.NA).astype("Int64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: convert_dtypes() might more simply achieve what you're trying to do here, since it handles the conversion of nulls to Pandas nullable dtypes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"preparer_phone",
"preparer_title",
],
"columns_to_convert_to_ints": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

federal_land_leaks_repaired_or_scheduled should also be in this list, I believe!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"services_shutoff_valve_in_system",
"services_shutoff_valve_installed",
],
"capitalization_exclusion": ["headquarters_address_state", "office_address_state"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preparer_email should also get excluded here, emails are case insensitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -206,7 +206,7 @@ class ZenodoDoiSettings(BaseSettings):
ferc60: ZenodoDoi = "10.5281/zenodo.13149090"
ferc714: ZenodoDoi = "10.5281/zenodo.13149091"
gridpathratoolkit: ZenodoDoi = "10.5281/zenodo.10892394"
phmsagas: ZenodoDoi = "10.5281/zenodo.10493790"
phmsagas: ZenodoDoi = "10.5281/zenodo.13624526"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seeess1 One more thing to flag - we published new PHMSA data this month, so the DOI should now be:

Suggested change
phmsagas: ZenodoDoi = "10.5281/zenodo.13624526"
phmsagas: ZenodoDoi = "10.5281/zenodo.14026420"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@seeess1
Copy link
Contributor Author

seeess1 commented Nov 24, 2024

A few minor comments, but I think you're basically ready to move on to writing this to the PUDL DB. The next steps should be outlined in the issue, but obviously let me know if you have questions about anything!

Regarding the unaccounted for gas - for now, if there's not a clear pattern in the problem, I think knowing the scale of the problem is useful, and we should write an asset check that will give us a sense of whether the problem changes in scale in the future. But, we can hold off on actually trying to fix this problem in the PR.

Added an asset check to help us keep an eye out for any significant upticks here.

I'm gonna work on the final steps for getting this into the PUDL DB.

@zaneselvans
Copy link
Member

I could see the change of adding an ENUM constraint on the state field being complicated since it changes the type to categorical rather than string which may make it hard to work with different state columns in different tables together. If that's a frustration or a hold up then I think just cleaning the column and then leaving it unconstrained (until we can come back and really deal with state across the board) is also totally fine.

@seeess1
Copy link
Contributor Author

seeess1 commented Dec 7, 2024

Hey @e-belfer! Made some more updates (commit here):

  • Updated .init files.
  • Added a new PHMSA resource in pudl.metadata.resources.
  • Added new fields.

I think the last couple things are for me to:

  • Define the PK (more on that below).
  • Create a migration.
  • Run all tests to make sure nothing is off.

Does that sound right?

Primary keys

Regarding primary keys, I figured that the logical PK identifier for this asset would be the combined operator_id_phmsa and report_number. But apparently there are a few rows with the same values here but slightly different metadata (for example, different office addresses for each row). I've attached the output from filtering down to rows that have the same combined values for these columns. Let me know how you'd like to handle these.
duplicates.csv

@e-belfer
Copy link
Member

e-belfer commented Dec 9, 2024

@seeess1 Always some annoying duplicates! Low stakes here since all the actual data is luckily identical, but thanks for catching this. Some thoughts:

  • The last four look completely identical to me and could get handled by df.drop_duplicates() unless I'm missing some tiny difference?
  • For Williamsville, the one that doesn't have any "testing" in the names will be the one we want.
  • For the other 2 cases, would prefer to just have a consistent rule and in this case the first row for each looks intuitively more correct (both city addresses actually match the system they're located in). Would be good to add a tiny assertion in here that checks that you aren't dropping more than the expected number of rows of data (2) in case this becomes a huge problem in the next year of data, e.g.

Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking on track! I can do a final review once the last little bits are in place, but your remaining to-do list looks right to me.


Args:
df: The DataFrame to analyze.
custom_missing_values: Optional list of custom values to consider
custom_missing_values: Optional list of custom values to consider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing an indent here, I think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"Percentage is calculated as follows: "
"Take the sum of: purchased gas + produced gas minus customer use + company use "
"+ appropriate adjustments. Then divide by the sum of customer use + company use "
"+ appropriate adjustments. Multiply the output by 100."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, were you seeing lots of values between 0-1 in this column, or mostly 0-100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the count breakdowns I'm seeing:

percent_unaccounted_for_gas == 0        14787
0 < percent_unaccounted_for_gas <= 1    11041
1 > percent_unaccounted_for_gas < 50    23665
percent_unaccounted_for_gas >= 50          51

@seeess1
Copy link
Contributor Author

seeess1 commented Dec 12, 2024

Looking on track! I can do a final review once the last little bits are in place, but your remaining to-do list looks right to me.

Pushed up my latest updates. Let me know what you think.

@e-belfer e-belfer self-requested a review December 17, 2024 17:11
Copy link
Member

@e-belfer e-belfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the pandera schema check is failing when I run this locally, and I'm not totally sure why based on the metadata returned from the check failure.

Are you seeing the same issue on your end? I'd suggest getting this into alembic and written to the DB and then we can try to debug the problem, as we have some dtype enforcement built into the IO managers.

Otherwise, one recommendation about the other asset checks - which is to simplify them, as I think the way you've set this up won't work conceptually, seeing as the check gets called after the problematic data has already been dropped. Let me know if that all makes sense?

(As always, just let me know if you need any additional support on my end to get this over the line!)


if error:
@asset_check(asset=spec.asset, blocking=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely following the logic here - the asset check will run on the df after you've dropped these rows in the transform step to create a unique PK, so the non_unique_groups returned in this asset check should always be 0. Rather than writing an asset check here, we'll want to actually embed this assertion in the code itself.

Could be as simple as something like this in the actual filtering functions you've written:

assert unfiltered_df.shape[0] - filtered_df.shape[0]  <=3, "Dropped more than expected number of rows due to non-unique groups. Rows dropped: {dropped_rows}

Comment on lines +208 to +215
1. For any group of rows with the same combination of "operator_id_phmsa"
and "report_number":
- If at least one row in the group does not contain the string "test"
(case-insensitive) in either "office_address_street" or
"headquarters_address_street", keep only the rows in the group
that do not contain "test" in these columns.
- If all rows in the group contain "test" in either of the columns,
leave the group unchanged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation here is causing make docs-build to fail. Should look like:

Suggested change
1. For any group of rows with the same combination of "operator_id_phmsa"
and "report_number":
- If at least one row in the group does not contain the string "test"
(case-insensitive) in either "office_address_street" or
"headquarters_address_street", keep only the rows in the group
that do not contain "test" in these columns.
- If all rows in the group contain "test" in either of the columns,
leave the group unchanged.
1. For any group of rows with the same combination of "operator_id_phmsa"
and "report_number":
- If at least one row in the group does not contain the string "test"
(case-insensitive) in either "office_address_street" or
"headquarters_address_street", keep only the rows in the group
that do not contain "test" in these columns.
- If all rows in the group contain "test" in either of the columns,
leave the group unchanged.

return AssetCheckResult(passed=False, metadata={"errors": error})

return AssetCheckResult(passed=True)

return _check
return [_check_percent_unaccounted_for_gas, _check_pk_deduplication]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now these asset checks aren't loaded into dagster, at least partially because the function to produce them isn't called anywhere but possibly also because you're returning a list a checks here rather than just one check. If we aren't parameterizing the checks (e.g., specifying expected values by year) then looking at the way in which multiple checks are defined in eia860.py would be a better analogy than the FERC 714 model.

@e-belfer e-belfer self-assigned this Jan 6, 2025
@e-belfer e-belfer mentioned this pull request Jan 7, 2025
6 tasks
@e-belfer
Copy link
Member

e-belfer commented Jan 7, 2025

@seeess1 Thanks for all your work on this one! I'm closing this and picking up your work in #4005 to get this over the last few hurdles!

@e-belfer e-belfer closed this Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community phmsa Data from the Pipeline and Hazardous Material Safety Administration
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants