-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 3770 transform phmsagas data #3929
Issue 3770 transform phmsagas data #3929
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
For more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good! A few comments on what's already been done, you're making great progress.
Some next steps:
- the values in
percent_unaccounted_for_gas
look like they could use some normalization (I see some negative values) - once you're happy with the table, we can go ahead and write it to the database. See here for guidance on how to do this.
src/pudl/helpers.py
Outdated
|
||
# Remove unwanted characters (parentheses, spaces, periods, and dashes) from the main phone number | ||
phone_main = re.sub( | ||
r"[^\d]", "", phone_main.replace(".0", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what the special instance of replace
is doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were lots of phone numbers in the raw data with ".0" at the end of the value. So I'm using replace
to remove those first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to use a regex that only matches this pattern at the end of the string to avoid accidentally removing characters from the middle of a number that could be reported like 303.443.0123
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'm removing that logic. The regex should be all that's needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm gonna leave in one line toward the top to get rid of these substrings just to make sure the rest of the number standardization is working with cleaner data.
Updated.
src/pudl/helpers.py
Outdated
intl_code = phone_main[:-10] # Digits before the last 10 | ||
main_number = phone_main[-10:] # Last 10 digits are the phone number | ||
formatted_phone = ( | ||
f"+{intl_code}-{main_number[:3]}-{main_number[3:6]}-{main_number[6:]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
International phone numbers don't all follow the pattern 3-3-#, so we probably just want to report the number without enforcing formatting in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating that now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
additional_information,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,,,,,,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information | ||
preparer_name,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name | ||
preparer_title,,,,,,,,,,,,,,,,,,,,,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title | ||
preparer_phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,pphone,pphone,pphone,pphone,pphone,pphone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
@asset( | ||
ins={"raw_data": AssetIn(key="raw_phmsagas__yearly_distribution")}, | ||
io_manager_key=None, # TODO: check on this one ("pudl_io_manager"? something else?) | ||
compute_kind=None, # TODO: check on this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this does is add a pretty graphic to dagster, it's totally optional but if you want to include it use pandas
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to know. I'll play around with this a little bit more.
# Ensure all "report_year" values have four digits | ||
mask = df["report_year"] < 100 | ||
|
||
# Convert 2-digit years to appropriate 4-digit format (assume cutoff at year 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you seeing values as low as 50? Anything lower than 1990 would be a red flag for me at this point (at least until we bring in the older data in #3290).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this column is unreliable, we can also parse the spreadsheet data year and add it as a column during the extraction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No values below 50 in the data right now. I mostly put this in to keep it flexible enough to also be able to handle post-2000 data if we started seeing values like "00", "01", etc. The column still feels pretty reliable to me overall. It was just a 2 vs 4 digit year formatting issue.
src/pudl/helpers.py
Outdated
return exception_cols | ||
|
||
|
||
def standardize_state_columns(df: pd.DataFrame, state_columns: list) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See POLITICAL_SUBDIVISIONS
in pudl.metadata.dfs
- there's already a mapping that can be used to convert names to state code, so this function should be redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/transform/phmsagas.py
Outdated
].apply(lambda col: col.str.title()) | ||
|
||
# List of state columns to standardize | ||
df = standardize_state_columns( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comments in the helpers
function - use POLITICAL_SUBDIVISIONS
here instead of creating a new function, but this is definitely a step we still want to take! Would also be good to check that there aren't any non-valid entries here.
@zaneselvans - is there any conceivable reason we don't already use an enum
constraint on the state
field in fields.py
? I can't imagine us not wanting to enforce restrictions on this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we've used an enum
constraint for the state column in EPA CEMS because it's a huge table and we want it to be interpreted as a categorical value to save space, and we don't use foreign key constraints in the hourly data that's only written to parquet.
But for tables going into the database we can either constrain the value by defining a data source or table specific enum constraint on the field (see the bottom of pudl/metadata/fields.py
-- this would result in it being a categorical dtype) or by defining a foreign key constraint that points to the subdivision_code
column of the core_pudl__codes_subdivisions
(which would result in it remaining a string).
We haven't historically applied either kind of constraint on the state
fields, but we really should! The enum
route is probably the simpler for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting rid of my duplicate function and utilizing POLITICAL_SUBDIVISIONS
. Plus adding an enum
constraint to "state" now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/transform/phmsagas.py
Outdated
# (on the scale of thousands) do not actually sum up to "excavation_damage_total". | ||
df[YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]] = df[ | ||
YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"] | ||
].fillna(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We distinguish between NA as "not reported" and 0 as "reported as 0", so filling this would actually change the meaning here (from "I didn't fill this in" to "I had 0 incidents"). Please drop this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/transform/phmsagas.py
Outdated
) | ||
|
||
# Fill NA values with zeroes because these columns are simply counts. | ||
# Note that "excavation_damage..." columns should sum up to the value in "excavation_damage_total". However, many rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should definitely add this validation as an asset check, and use fillna(0) there to do the comparison calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I'm seeing that even if I fill in NAs with zero, there are almost 4k rows where the excavation_damage...
values don't sum to excavation_damage_total
. How would you recommend handling these?
src/pudl/helpers.py
Outdated
} | ||
|
||
for col in state_columns: | ||
df[col] = df[col].replace(state_abbreviations).str.upper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you have a complete mapping between old and new values, the col.map() method should be much faster. However any value that appears in the series but not in the dictionary defining the mapping gets converted to NA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated logic for standardizing state abbreviations here:
src/pudl/helpers.py
Outdated
"""Standardize phone numbers in the specified columns of the DataFrame. | ||
US numbers: ###-###-#### | ||
International numbers with the international code at the beginning. | ||
Numbers with extensions will be appended with "x#". | ||
Non-numeric entries will be returned as np.nan. Entries with fewer than | ||
10 digits will be returned with no hyphens. | ||
|
||
Args: | ||
df: The DataFrame to modify. | ||
columns: A list of the names of the columns that need to be standardized | ||
Returns: | ||
The modified DataFrame with standardized phone numbers in the same column. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This docstring won't render correctly in the docs. Check how it looks in the built API docutmentation.
"""Standardize phone numbers in the specified columns of the DataFrame. | |
US numbers: ###-###-#### | |
International numbers with the international code at the beginning. | |
Numbers with extensions will be appended with "x#". | |
Non-numeric entries will be returned as np.nan. Entries with fewer than | |
10 digits will be returned with no hyphens. | |
Args: | |
df: The DataFrame to modify. | |
columns: A list of the names of the columns that need to be standardized | |
Returns: | |
The modified DataFrame with standardized phone numbers in the same column. | |
""" | |
"""Standardize phone numbers in the specified columns of the DataFrame. | |
US numbers: ###-###-#### | |
International numbers with the international code at the beginning. | |
Numbers with extensions will be appended with "x#". | |
Non-numeric entries will be returned as np.nan. Entries with fewer than | |
10 digits will be returned with no hyphens. | |
Args: | |
df: The DataFrame to modify. | |
columns: A list of the names of the columns that need to be standardized | |
Returns: | |
The modified DataFrame with standardized phone numbers in the same column. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/helpers.py
Outdated
"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis | ||
and does not perform any data transformation or cleaning. | ||
This function checks each column for missing or custom missing values and prints | ||
a summary of the findings for string (object), numeric, and datetime columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis | |
and does not perform any data transformation or cleaning. | |
This function checks each column for missing or custom missing values and prints | |
a summary of the findings for string (object), numeric, and datetime columns. | |
"""Analyze columns of a DataFrame for missing or invalid values. | |
Note that this is purely for analysis and does not perform any data | |
transformation or cleaning. This function checks each column for | |
missing or custom missing values and prints a summary of the findings | |
for string (object), numeric, and datetime columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Ah yep I meant to point that part out about |
Would Is the distribution of the absolute values of positive and negative values similar? Does it seem likely that different respondents just assumed different sign conventions here? Can we see what question they're responding to and if it is phrased ambiguously? Like is "5% of the gas was unaccounted for" supposed to be -5% or 5%? |
For more information, see https://pre-commit.ci
For more information, see https://pre-commit.ci
@e-belfer @zaneselvans I've responded to the initial round of comments. I still haven't solved the riddle of the negative unaccounted gas values. And I need to take a closer look at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, but I think you're basically ready to move on to writing this to the PUDL DB. The next steps should be outlined in the issue, but obviously let me know if you have questions about anything!
Regarding the unaccounted for gas - for now, if there's not a clear pattern in the problem, I think knowing the scale of the problem is useful, and we should write an asset check that will give us a sense of whether the problem changes in scale in the future. But, we can hold off on actually trying to fix this problem in the PR.
src/pudl/metadata/enums.py
Outdated
& (POLITICAL_SUBDIVISIONS.subdivision_type == "state"), | ||
"subdivision_name", | ||
] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you label this one and put it up near its country code peers at the top of this file, so it's easier to find? E.g.,
) | |
) | |
""" Two-letter ANSI state codes.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -4249,6 +4250,7 @@ | |||
# different tables. E.g. state of the utility's HQ address vs. state that a | |||
# plant is located in vs. state in which a utility provides service. | |||
"description": "Two letter US state abbreviation.", | |||
"constraints": {"enum": US_STATE_CODES}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enforce this, we'll need to make a new alembic migration (you can do this alongside adding the table to the database). E.g.,:
alembic revision --autogenerate -m "Enforce state constraint and whatever other changes you're making" && alembic upgrade head
See here for more info.
Note that this might kick up some errors in other tables that have invalid values we haven't noticed before, so we should keep an eye out for this when rerunning all the assets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to work on this part next.
src/pudl/helpers.py
Outdated
df = df.replace(na_patterns, np.nan, regex=True) | ||
|
||
# Attempt to infer original types where possible to avoid unwanted type changes | ||
df = df.infer_objects() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do a lot of dtype manipulation and enforcement downstream of this function, which typically gets called as one of the first steps when dtypes are poorly enforced. I think we ought to drop this step, as I'm worried it'll introduce some unforeseen consequences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/helpers.py
Outdated
"""Replace common ill-posed EIA NA spreadsheet values with np.nan. | ||
|
||
Currently replaces empty string, single decimal points with no numbers, | ||
any single whitespace character, and hyphens with np.nan. | ||
any single whitespace character, and hyphens with np.nan. Also converts | ||
columns to their original types where possible (using 'infer_objects') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See below re: my thoughts on this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to the part you're talking about? I'm having a hard time finding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this was vaguely in reference to my comment about convert_dtypes()
vs infer_objects()
, sorry that wasn't clearer!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! Looks like my documentation was misleading. No call is made to infer_objects
in this method. I've removed that part of the function description.
and does not perform any data transformation or cleaning. | ||
This function checks each column for missing or custom missing values and prints | ||
a summary of the findings for string (object), numeric, and datetime columns. | ||
def analyze_missing_values( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seeess1 Am I right that this currently isn't getting called anywhere? I'm assuming you were using it in the notebook / to get a quick lay of the land of what was going on in each column.
I could see the value of keeping a version of this as a parameterized asset check on a per-table basis (e.g., make sure no more than 10% of values per year are NAs) or as dagster asset metadata attached to an asset, but we certainly don't want to have 5 log warnings for each column we're analyzing as it'll be too much noise to interpret. However, I think that's out of scope for this PR.
In the interim, would you be up for noting in the docstring that this function should be used during development as a helping tool and not in any actual final code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep all of that is correct. This was just a tool for analysis that could be used in the middle of dev work.
But I honestly don't feel too precious about it. I'll add a note in the docstring now to make sure that it's clear this is just an analytical tool and not one used in any data transformation. But I'm also totally okay if you want to drop it entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/transform/phmsagas.py
Outdated
def core_phmsagas__yearly_distribution_operators( | ||
raw_data: pd.DataFrame, | ||
) -> pd.DataFrame: | ||
"""Build the :ref:`core_phmsagas__yearly_distribution_operators`.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, include a short bulleted list of the transforms you make in this function so that this shows up in our docs. See pudl.transform.eia860 for some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/transform/phmsagas.py
Outdated
] | ||
|
||
# Fill NaN values with pd.NA, then cast to "Int64" nullable integer type | ||
df[cols_to_convert] = df[cols_to_convert].fillna(pd.NA).astype("Int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: convert_dtypes()
might more simply achieve what you're trying to do here, since it handles the conversion of nulls to Pandas nullable dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
"preparer_phone", | ||
"preparer_title", | ||
], | ||
"columns_to_convert_to_ints": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
federal_land_leaks_repaired_or_scheduled
should also be in this list, I believe!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/transform/phmsagas.py
Outdated
"services_shutoff_valve_in_system", | ||
"services_shutoff_valve_installed", | ||
], | ||
"capitalization_exclusion": ["headquarters_address_state", "office_address_state"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preparer_email
should also get excluded here, emails are case insensitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
src/pudl/workspace/datastore.py
Outdated
@@ -206,7 +206,7 @@ class ZenodoDoiSettings(BaseSettings): | |||
ferc60: ZenodoDoi = "10.5281/zenodo.13149090" | |||
ferc714: ZenodoDoi = "10.5281/zenodo.13149091" | |||
gridpathratoolkit: ZenodoDoi = "10.5281/zenodo.10892394" | |||
phmsagas: ZenodoDoi = "10.5281/zenodo.10493790" | |||
phmsagas: ZenodoDoi = "10.5281/zenodo.13624526" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seeess1 One more thing to flag - we published new PHMSA data this month, so the DOI should now be:
phmsagas: ZenodoDoi = "10.5281/zenodo.13624526" | |
phmsagas: ZenodoDoi = "10.5281/zenodo.14026420" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
For more information, see https://pre-commit.ci
Added an asset check to help us keep an eye out for any significant upticks here. I'm gonna work on the final steps for getting this into the PUDL DB. |
I could see the change of adding an |
…om/seeess1/pudl into issue-3770-transform-phmsagas-data
For more information, see https://pre-commit.ci
Hey @e-belfer! Made some more updates (commit here):
I think the last couple things are for me to:
Does that sound right? Primary keys Regarding primary keys, I figured that the logical PK identifier for this asset would be the combined |
@seeess1 Always some annoying duplicates! Low stakes here since all the actual data is luckily identical, but thanks for catching this. Some thoughts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking on track! I can do a final review once the last little bits are in place, but your remaining to-do list looks right to me.
src/pudl/helpers.py
Outdated
|
||
Args: | ||
df: The DataFrame to analyze. | ||
custom_missing_values: Optional list of custom values to consider | ||
custom_missing_values: Optional list of custom values to consider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing an indent here, I think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
"Percentage is calculated as follows: " | ||
"Take the sum of: purchased gas + produced gas minus customer use + company use " | ||
"+ appropriate adjustments. Then divide by the sum of customer use + company use " | ||
"+ appropriate adjustments. Multiply the output by 100." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, were you seeing lots of values between 0-1 in this column, or mostly 0-100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the count breakdowns I'm seeing:
percent_unaccounted_for_gas == 0 14787
0 < percent_unaccounted_for_gas <= 1 11041
1 > percent_unaccounted_for_gas < 50 23665
percent_unaccounted_for_gas >= 50 51
…om/seeess1/pudl into issue-3770-transform-phmsagas-data
For more information, see https://pre-commit.ci
Pushed up my latest updates. Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the pandera schema check is failing when I run this locally, and I'm not totally sure why based on the metadata returned from the check failure.
Are you seeing the same issue on your end? I'd suggest getting this into alembic and written to the DB and then we can try to debug the problem, as we have some dtype enforcement built into the IO managers.
Otherwise, one recommendation about the other asset checks - which is to simplify them, as I think the way you've set this up won't work conceptually, seeing as the check gets called after the problematic data has already been dropped. Let me know if that all makes sense?
(As always, just let me know if you need any additional support on my end to get this over the line!)
|
||
if error: | ||
@asset_check(asset=spec.asset, blocking=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely following the logic here - the asset check will run on the df
after you've dropped these rows in the transform step to create a unique PK, so the non_unique_groups
returned in this asset check should always be 0. Rather than writing an asset check here, we'll want to actually embed this assertion in the code itself.
Could be as simple as something like this in the actual filtering functions you've written:
assert unfiltered_df.shape[0] - filtered_df.shape[0] <=3, "Dropped more than expected number of rows due to non-unique groups. Rows dropped: {dropped_rows}
1. For any group of rows with the same combination of "operator_id_phmsa" | ||
and "report_number": | ||
- If at least one row in the group does not contain the string "test" | ||
(case-insensitive) in either "office_address_street" or | ||
"headquarters_address_street", keep only the rows in the group | ||
that do not contain "test" in these columns. | ||
- If all rows in the group contain "test" in either of the columns, | ||
leave the group unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation here is causing make docs-build
to fail. Should look like:
1. For any group of rows with the same combination of "operator_id_phmsa" | |
and "report_number": | |
- If at least one row in the group does not contain the string "test" | |
(case-insensitive) in either "office_address_street" or | |
"headquarters_address_street", keep only the rows in the group | |
that do not contain "test" in these columns. | |
- If all rows in the group contain "test" in either of the columns, | |
leave the group unchanged. | |
1. For any group of rows with the same combination of "operator_id_phmsa" | |
and "report_number": | |
- If at least one row in the group does not contain the string "test" | |
(case-insensitive) in either "office_address_street" or | |
"headquarters_address_street", keep only the rows in the group | |
that do not contain "test" in these columns. | |
- If all rows in the group contain "test" in either of the columns, | |
leave the group unchanged. |
return AssetCheckResult(passed=False, metadata={"errors": error}) | ||
|
||
return AssetCheckResult(passed=True) | ||
|
||
return _check | ||
return [_check_percent_unaccounted_for_gas, _check_pk_deduplication] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now these asset checks aren't loaded into dagster, at least partially because the function to produce them isn't called anywhere but possibly also because you're returning a list a checks here rather than just one check. If we aren't parameterizing the checks (e.g., specifying expected values by year) then looking at the way in which multiple checks are defined in eia860.py
would be a better analogy than the FERC 714 model.
Overview
This is still WIP!
Relates to #3770.
What problem does this address?
Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.
What did you change?
fix_eia_na
) more generic since it can be applied across data sets (and updated references accordingly)./Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv
since we had fax columns mapped to emails and vice versa.Will come back to everything below once this draft PR has gone through an initial round of review.
Documentation
Make sure to update relevant aspects of the documentation.
Tasks
Testing
How did you make sure this worked? How can a reviewer verify this?
To-do list