Issue 3770 transform phmsagas data #3929

seeess1 · 2024-10-24T13:21:56Z

Overview

This is still WIP!

Relates to #3770.

What problem does this address?

Goal of the PR is to complete the first transformation of raw PHMSA data into a core asset.

What did you change?

Added a new transform script for PHMSA data.
Made one helper method (fix_eia_na) more generic since it can be applied across data sets (and updated references accordingly).
Added a few other helper methods.
Updated column map files with 2023 values (pulled from whatever was in 2022).
Specifically changed the ordering of columns in /Users/sam/Documents/pudl/src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv since we had fax columns mapped to emails and vice versa.

Will come back to everything below once this draft PR has gone through an initial round of review.

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Give feedback

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.
Options

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

…m-phmsagas-data

review-notebook-app · 2024-10-24T13:22:03Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

For more information, see https://pre-commit.ci

e-belfer

This is looking really good! A few comments on what's already been done, you're making great progress.

Some next steps:

the values in percent_unaccounted_for_gas look like they could use some normalization (I see some negative values)
once you're happy with the table, we can go ahead and write it to the database. See here for guidance on how to do this.

src/pudl/helpers.py

e-belfer · 2024-10-25T18:22:07Z

src/pudl/helpers.py

+
+        # Remove unwanted characters (parentheses, spaces, periods, and dashes) from the main phone number
+        phone_main = re.sub(
+            r"[^\d]", "", phone_main.replace(".0", "")


Can you explain what the special instance of replace is doing here?

There were lots of phone numbers in the raw data with ".0" at the end of the value. So I'm using replace to remove those first.

You might want to use a regex that only matches this pattern at the end of the string to avoid accidentally removing characters from the middle of a number that could be reported like 303.443.0123

Good point. I'm removing that logic. The regex should be all that's needed here.

Actually I'm gonna leave in one line toward the top to get rid of these substrings just to make sure the rest of the number standardization is working with cleaner data.

Updated.

e-belfer · 2024-10-25T18:26:01Z

src/pudl/helpers.py

+            intl_code = phone_main[:-10]  # Digits before the last 10
+            main_number = phone_main[-10:]  # Last 10 digits are the phone number
+            formatted_phone = (
+                f"+{intl_code}-{main_number[:3]}-{main_number[3:6]}-{main_number[6:]}"


International phone numbers don't all follow the pattern 3-3-#, so we probably just want to report the number without enforcing formatting in this case.

Updating that now.

e-belfer · 2024-10-25T18:28:39Z

src/pudl/package_data/phmsagas/column_maps/yearly_distribution.csv

+additional_information,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,comments,,,,,,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information,additional_information
+preparer_name,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,pname,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name,preparers_name
+preparer_title,,,,,,,,,,,,,,,,,,,,,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title,preparers_title
+preparer_phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,phone,pphone,pphone,pphone,pphone,pphone,pphone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone,preparers_phone


Good catch!

e-belfer · 2024-10-25T18:31:03Z

src/pudl/transform/phmsagas.py

+@asset(
+    ins={"raw_data": AssetIn(key="raw_phmsagas__yearly_distribution")},
+    io_manager_key=None,  # TODO: check on this one ("pudl_io_manager"? something else?)
+    compute_kind=None,  # TODO: check on this one


All this does is add a pretty graphic to dagster, it's totally optional but if you want to include it use pandas.

Good to know. I'll play around with this a little bit more.

e-belfer · 2024-10-25T18:35:59Z

src/pudl/transform/phmsagas.py

+    # Ensure all "report_year" values have four digits
+    mask = df["report_year"] < 100
+
+    # Convert 2-digit years to appropriate 4-digit format (assume cutoff at year 50)


Are you seeing values as low as 50? Anything lower than 1990 would be a red flag for me at this point (at least until we bring in the older data in #3290).

If this column is unreliable, we can also parse the spreadsheet data year and add it as a column during the extraction.

No values below 50 in the data right now. I mostly put this in to keep it flexible enough to also be able to handle post-2000 data if we started seeing values like "00", "01", etc. The column still feels pretty reliable to me overall. It was just a 2 vs 4 digit year formatting issue.

e-belfer · 2024-10-25T18:44:57Z

src/pudl/helpers.py

+    return exception_cols
+
+
+def standardize_state_columns(df: pd.DataFrame, state_columns: list) -> pd.DataFrame:


See POLITICAL_SUBDIVISIONS in pudl.metadata.dfs - there's already a mapping that can be used to convert names to state code, so this function should be redundant.

e-belfer · 2024-10-25T18:49:42Z

src/pudl/transform/phmsagas.py

+    ].apply(lambda col: col.str.title())
+
+    # List of state columns to standardize
+    df = standardize_state_columns(


See my comments in the helpers function - use POLITICAL_SUBDIVISIONS here instead of creating a new function, but this is definitely a step we still want to take! Would also be good to check that there aren't any non-valid entries here.

@zaneselvans - is there any conceivable reason we don't already use an enum constraint on the state field in fields.py? I can't imagine us not wanting to enforce restrictions on this one.

I think we've used an enum constraint for the state column in EPA CEMS because it's a huge table and we want it to be interpreted as a categorical value to save space, and we don't use foreign key constraints in the hourly data that's only written to parquet.

But for tables going into the database we can either constrain the value by defining a data source or table specific enum constraint on the field (see the bottom of pudl/metadata/fields.py -- this would result in it being a categorical dtype) or by defining a foreign key constraint that points to the subdivision_code column of the core_pudl__codes_subdivisions (which would result in it remaining a string).

We haven't historically applied either kind of constraint on the state fields, but we really should! The enum route is probably the simpler for now.

Getting rid of my duplicate function and utilizing POLITICAL_SUBDIVISIONS. Plus adding an enum constraint to "state" now.

e-belfer · 2024-10-25T19:01:20Z

src/pudl/transform/phmsagas.py

+    # (on the scale of thousands) do not actually sum up to "excavation_damage_total".
+    df[YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]] = df[
+        YEARLY_DISTRIBUTION_OPERATORS_COLUMNS["columns_with_nas_to_fill"]
+    ].fillna(0)


We distinguish between NA as "not reported" and 0 as "reported as 0", so filling this would actually change the meaning here (from "I didn't fill this in" to "I had 0 incidents"). Please drop this step.

e-belfer · 2024-10-25T19:01:56Z

src/pudl/transform/phmsagas.py

+    )
+
+    # Fill NA values with zeroes because these columns are simply counts.
+    # Note that "excavation_damage..." columns should sum up to the value in "excavation_damage_total". However, many rows


We should definitely add this validation as an asset check, and use fillna(0) there to do the comparison calculation.

Got it. I'm seeing that even if I fill in NAs with zero, there are almost 4k rows where the excavation_damage... values don't sum to excavation_damage_total. How would you recommend handling these?

zaneselvans · 2024-10-26T04:34:20Z

src/pudl/helpers.py

+    }
+
+    for col in state_columns:
+        df[col] = df[col].replace(state_abbreviations).str.upper()


When you have a complete mapping between old and new values, the col.map() method should be much faster. However any value that appears in the series but not in the dictionary defining the mapping gets converted to NA.

Updated logic for standardizing state abbreviations here:

https://github.com/catalyst-cooperative/pudl/pull/3929/files#diff-941a3989c384fa527b5188c725a5b3beb1f1e0dbf9d427dae9dd82f4f9fe5c09R126

zaneselvans · 2024-10-26T04:42:58Z

src/pudl/helpers.py

+    """Standardize phone numbers in the specified columns of the DataFrame.
+    US numbers: ###-###-####
+    International numbers with the international code at the beginning.
+    Numbers with extensions will be appended with "x#".
+    Non-numeric entries will be returned as np.nan. Entries with fewer than
+    10 digits will be returned with no hyphens.
+
+    Args:
+    df: The DataFrame to modify.
+    columns: A list of the names of the columns that need to be standardized
+    Returns:
+    The modified DataFrame with standardized phone numbers in the same column.
+    """


This docstring won't render correctly in the docs. Check how it looks in the built API docutmentation.

Suggested change

"""Standardize phone numbers in the specified columns of the DataFrame.

US numbers: ###-###-####

International numbers with the international code at the beginning.

Numbers with extensions will be appended with "x#".

Non-numeric entries will be returned as np.nan. Entries with fewer than

10 digits will be returned with no hyphens.

Args:

df: The DataFrame to modify.

columns: A list of the names of the columns that need to be standardized

Returns:

The modified DataFrame with standardized phone numbers in the same column.

"""

"""Standardize phone numbers in the specified columns of the DataFrame.

US numbers: ###-###-####

International numbers with the international code at the beginning.

Numbers with extensions will be appended with "x#".

Non-numeric entries will be returned as np.nan. Entries with fewer than

10 digits will be returned with no hyphens.

Args:

df: The DataFrame to modify.

columns: A list of the names of the columns that need to be standardized

Returns:

The modified DataFrame with standardized phone numbers in the same column.

"""

zaneselvans · 2024-10-26T04:44:14Z

src/pudl/helpers.py

+    """Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis
+    and does not perform any data transformation or cleaning.
+    This function checks each column for missing or custom missing values and prints
+    a summary of the findings for string (object), numeric, and datetime columns.


Suggested change

"""Analyze columns of a DataFrame for missing or invalid values. Note that this is purely for analysis

and does not perform any data transformation or cleaning.

This function checks each column for missing or custom missing values and prints

a summary of the findings for string (object), numeric, and datetime columns.

"""Analyze columns of a DataFrame for missing or invalid values.

Note that this is purely for analysis and does not perform any data

transformation or cleaning. This function checks each column for

missing or custom missing values and prints a summary of the findings

for string (object), numeric, and datetime columns.

seeess1 · 2024-10-30T12:52:06Z

This is looking really good! A few comments on what's already been done, you're making great progress.

Some next steps:

the values in percent_unaccounted_for_gas look like they could use some normalization (I see some negative values)

once you're happy with the table, we can go ahead and write it to the database. See here for guidance on how to do this.

Ah yep I meant to point that part out about percent_unaccounted_for_gas. I noticed that too. I honestly wasn't sure how we wanted to handle this one. Any suggestions?

zaneselvans · 2024-10-30T13:45:56Z

Would percent_unaccounted_for_gas < 0 mean that extra gas appeared from somewhere?

Is the distribution of the absolute values of positive and negative values similar?

Does it seem likely that different respondents just assumed different sign conventions here? Can we see what question they're responding to and if it is phrased ambiguously? Like is "5% of the gas was unaccounted for" supposed to be -5% or 5%?

For more information, see https://pre-commit.ci

…m-phmsagas-data

seeess1 · 2024-11-03T22:05:04Z

@e-belfer @zaneselvans I've responded to the initial round of comments. I still haven't solved the riddle of the negative unaccounted gas values. And I need to take a closer look at io_manager_key and compute_kind. But hopefully the PR is starting to shape up here.

e-belfer

A few minor comments, but I think you're basically ready to move on to writing this to the PUDL DB. The next steps should be outlined in the issue, but obviously let me know if you have questions about anything!

Regarding the unaccounted for gas - for now, if there's not a clear pattern in the problem, I think knowing the scale of the problem is useful, and we should write an asset check that will give us a sense of whether the problem changes in scale in the future. But, we can hold off on actually trying to fix this problem in the PR.

e-belfer · 2024-11-05T15:20:53Z

src/pudl/metadata/enums.py

+        & (POLITICAL_SUBDIVISIONS.subdivision_type == "state"),
+        "subdivision_name",
+    ]
+)


Can you label this one and put it up near its country code peers at the top of this file, so it's easier to find? E.g.,

Suggested change

)

)

""" Two-letter ANSI state codes."""

e-belfer · 2024-11-05T15:29:23Z

src/pudl/metadata/fields.py

@@ -4249,6 +4250,7 @@
        # different tables. E.g. state of the utility's HQ address vs. state that a
        # plant is located in vs. state in which a utility provides service.
        "description": "Two letter US state abbreviation.",
+        "constraints": {"enum": US_STATE_CODES},


To enforce this, we'll need to make a new alembic migration (you can do this alongside adding the table to the database). E.g.,:

alembic revision --autogenerate -m "Enforce state constraint and whatever other changes you're making" && alembic upgrade head

See here for more info.

Note that this might kick up some errors in other tables that have invalid values we haven't noticed before, so we should keep an eye out for this when rerunning all the assets.

I'm going to work on this part next.

e-belfer · 2024-11-05T15:31:59Z

src/pudl/helpers.py

+    df = df.replace(na_patterns, np.nan, regex=True)
+
+    # Attempt to infer original types where possible to avoid unwanted type changes
+    df = df.infer_objects()


We do a lot of dtype manipulation and enforcement downstream of this function, which typically gets called as one of the first steps when dtypes are poorly enforced. I think we ought to drop this step, as I'm worried it'll introduce some unforeseen consequences.

e-belfer · 2024-11-05T15:32:13Z

src/pudl/helpers.py

    """Replace common ill-posed EIA NA spreadsheet values with np.nan.

    Currently replaces empty string, single decimal points with no numbers,
-    any single whitespace character, and hyphens with np.nan.
+    any single whitespace character, and hyphens with np.nan. Also converts
+    columns to their original types where possible (using 'infer_objects')


See below re: my thoughts on this part.

Could you point me to the part you're talking about? I'm having a hard time finding it.

Ah this was vaguely in reference to my comment about convert_dtypes() vs infer_objects(), sorry that wasn't clearer!

Whoops! Looks like my documentation was misleading. No call is made to infer_objects in this method. I've removed that part of the function description.

e-belfer · 2024-11-05T15:46:03Z

src/pudl/helpers.py

-    and does not perform any data transformation or cleaning.
-    This function checks each column for missing or custom missing values and prints
-    a summary of the findings for string (object), numeric, and datetime columns.
+def analyze_missing_values(


@seeess1 Am I right that this currently isn't getting called anywhere? I'm assuming you were using it in the notebook / to get a quick lay of the land of what was going on in each column.

I could see the value of keeping a version of this as a parameterized asset check on a per-table basis (e.g., make sure no more than 10% of values per year are NAs) or as dagster asset metadata attached to an asset, but we certainly don't want to have 5 log warnings for each column we're analyzing as it'll be too much noise to interpret. However, I think that's out of scope for this PR.

In the interim, would you be up for noting in the docstring that this function should be used during development as a helping tool and not in any actual final code?

Yep all of that is correct. This was just a tool for analysis that could be used in the middle of dev work.

But I honestly don't feel too precious about it. I'll add a note in the docstring now to make sure that it's clear this is just an analytical tool and not one used in any data transformation. But I'm also totally okay if you want to drop it entirely.

e-belfer · 2024-11-06T19:45:06Z

src/pudl/transform/phmsagas.py

+def core_phmsagas__yearly_distribution_operators(
+    raw_data: pd.DataFrame,
+) -> pd.DataFrame:
+    """Build the :ref:`core_phmsagas__yearly_distribution_operators`."""


Here, include a short bulleted list of the transforms you make in this function so that this shows up in our docs. See pudl.transform.eia860 for some examples.

e-belfer · 2024-11-06T19:47:02Z

src/pudl/transform/phmsagas.py

+    ]
+
+    # Fill NaN values with pd.NA, then cast to "Int64" nullable integer type
+    df[cols_to_convert] = df[cols_to_convert].fillna(pd.NA).astype("Int64")


Non-blocking: convert_dtypes() might more simply achieve what you're trying to do here, since it handles the conversion of nulls to Pandas nullable dtypes.

e-belfer · 2024-11-06T19:51:15Z

src/pudl/transform/phmsagas.py

+        "preparer_phone",
+        "preparer_title",
+    ],
+    "columns_to_convert_to_ints": [


federal_land_leaks_repaired_or_scheduled should also be in this list, I believe!

e-belfer · 2024-11-06T19:51:52Z

src/pudl/transform/phmsagas.py

+        "services_shutoff_valve_in_system",
+        "services_shutoff_valve_installed",
+    ],
+    "capitalization_exclusion": ["headquarters_address_state", "office_address_state"],


preparer_email should also get excluded here, emails are case insensitive

e-belfer · 2024-11-06T19:56:35Z

src/pudl/workspace/datastore.py

@@ -206,7 +206,7 @@ class ZenodoDoiSettings(BaseSettings):
    ferc60: ZenodoDoi = "10.5281/zenodo.13149090"
    ferc714: ZenodoDoi = "10.5281/zenodo.13149091"
    gridpathratoolkit: ZenodoDoi = "10.5281/zenodo.10892394"
-    phmsagas: ZenodoDoi = "10.5281/zenodo.10493790"
+    phmsagas: ZenodoDoi = "10.5281/zenodo.13624526"


@seeess1 One more thing to flag - we published new PHMSA data this month, so the DOI should now be:

Suggested change

phmsagas: ZenodoDoi = "10.5281/zenodo.13624526"

phmsagas: ZenodoDoi = "10.5281/zenodo.14026420"

…m-phmsagas-data

For more information, see https://pre-commit.ci

seeess1 · 2024-11-24T22:51:56Z

A few minor comments, but I think you're basically ready to move on to writing this to the PUDL DB. The next steps should be outlined in the issue, but obviously let me know if you have questions about anything!

Regarding the unaccounted for gas - for now, if there's not a clear pattern in the problem, I think knowing the scale of the problem is useful, and we should write an asset check that will give us a sense of whether the problem changes in scale in the future. But, we can hold off on actually trying to fix this problem in the PR.

Added an asset check to help us keep an eye out for any significant upticks here.

I'm gonna work on the final steps for getting this into the PUDL DB.

…m-phmsagas-data

zaneselvans · 2024-11-25T23:57:12Z

I could see the change of adding an ENUM constraint on the state field being complicated since it changes the type to categorical rather than string which may make it hard to work with different state columns in different tables together. If that's a frustration or a hold up then I think just cleaning the column and then leaving it unconstrained (until we can come back and really deal with state across the board) is also totally fine.

…m-phmsagas-data

…om/seeess1/pudl into issue-3770-transform-phmsagas-data

For more information, see https://pre-commit.ci

seeess1 · 2024-12-07T16:15:32Z

Hey @e-belfer! Made some more updates (commit here):

Updated .init files.
Added a new PHMSA resource in pudl.metadata.resources.
Added new fields.

I think the last couple things are for me to:

Define the PK (more on that below).
Create a migration.
Run all tests to make sure nothing is off.

Does that sound right?

Primary keys

Regarding primary keys, I figured that the logical PK identifier for this asset would be the combined operator_id_phmsa and report_number. But apparently there are a few rows with the same values here but slightly different metadata (for example, different office addresses for each row). I've attached the output from filtering down to rows that have the same combined values for these columns. Let me know how you'd like to handle these.
duplicates.csv

e-belfer · 2024-12-09T17:26:31Z

@seeess1 Always some annoying duplicates! Low stakes here since all the actual data is luckily identical, but thanks for catching this. Some thoughts:

The last four look completely identical to me and could get handled by df.drop_duplicates() unless I'm missing some tiny difference?
For Williamsville, the one that doesn't have any "testing" in the names will be the one we want.
For the other 2 cases, would prefer to just have a consistent rule and in this case the first row for each looks intuitively more correct (both city addresses actually match the system they're located in). Would be good to add a tiny assertion in here that checks that you aren't dropping more than the expected number of rows of data (2) in case this becomes a huge problem in the next year of data, e.g.

e-belfer

Looking on track! I can do a final review once the last little bits are in place, but your remaining to-do list looks right to me.

e-belfer · 2024-12-09T17:18:13Z

src/pudl/helpers.py


    Args:
        df: The DataFrame to analyze.
-        custom_missing_values: Optional list of custom values to consider
+    custom_missing_values: Optional list of custom values to consider


Missing an indent here, I think!

e-belfer · 2024-12-09T17:26:06Z

src/pudl/metadata/fields.py

+            "Percentage is calculated as follows: "
+            "Take the sum of: purchased gas + produced gas minus customer use + company use "
+            "+ appropriate adjustments. Then divide by the sum of customer use + company use "
+            "+ appropriate adjustments. Multiply the output by 100."


Out of curiosity, were you seeing lots of values between 0-1 in this column, or mostly 0-100?

Here are the count breakdowns I'm seeing:

percent_unaccounted_for_gas == 0 14787 0 < percent_unaccounted_for_gas <= 1 11041 1 > percent_unaccounted_for_gas < 50 23665 percent_unaccounted_for_gas >= 50 51

…om/seeess1/pudl into issue-3770-transform-phmsagas-data

For more information, see https://pre-commit.ci

seeess1 · 2024-12-12T16:45:07Z

Looking on track! I can do a final review once the last little bits are in place, but your remaining to-do list looks right to me.

Pushed up my latest updates. Let me know what you think.

e-belfer

Right now the pandera schema check is failing when I run this locally, and I'm not totally sure why based on the metadata returned from the check failure.

Are you seeing the same issue on your end? I'd suggest getting this into alembic and written to the DB and then we can try to debug the problem, as we have some dtype enforcement built into the IO managers.

Otherwise, one recommendation about the other asset checks - which is to simplify them, as I think the way you've set this up won't work conceptually, seeing as the check gets called after the problematic data has already been dropped. Let me know if that all makes sense?

(As always, just let me know if you need any additional support on my end to get this over the line!)

e-belfer · 2024-12-17T17:19:10Z

src/pudl/transform/phmsagas.py


-        if error:
+    @asset_check(asset=spec.asset, blocking=True)


Not entirely following the logic here - the asset check will run on the df after you've dropped these rows in the transform step to create a unique PK, so the non_unique_groups returned in this asset check should always be 0. Rather than writing an asset check here, we'll want to actually embed this assertion in the code itself.

Could be as simple as something like this in the actual filtering functions you've written:

assert unfiltered_df.shape[0] - filtered_df.shape[0] <=3, "Dropped more than expected number of rows due to non-unique groups. Rows dropped: {dropped_rows}

e-belfer · 2024-12-17T17:36:27Z

src/pudl/transform/phmsagas.py

+    1. For any group of rows with the same combination of "operator_id_phmsa"
+       and "report_number":
+        - If at least one row in the group does not contain the string "test"
+          (case-insensitive) in either "office_address_street" or
+          "headquarters_address_street", keep only the rows in the group
+          that do not contain "test" in these columns.
+        - If all rows in the group contain "test" in either of the columns,
+          leave the group unchanged.


The indentation here is causing make docs-build to fail. Should look like:

Suggested change

1. For any group of rows with the same combination of "operator_id_phmsa"

and "report_number":

- If at least one row in the group does not contain the string "test"

(case-insensitive) in either "office_address_street" or

"headquarters_address_street", keep only the rows in the group

that do not contain "test" in these columns.

- If all rows in the group contain "test" in either of the columns,

leave the group unchanged.

1. For any group of rows with the same combination of "operator_id_phmsa"

and "report_number":

- If at least one row in the group does not contain the string "test"

(case-insensitive) in either "office_address_street" or

"headquarters_address_street", keep only the rows in the group

that do not contain "test" in these columns.

- If all rows in the group contain "test" in either of the columns,

leave the group unchanged.

e-belfer · 2024-12-17T17:42:46Z

src/pudl/transform/phmsagas.py

            return AssetCheckResult(passed=False, metadata={"errors": error})

        return AssetCheckResult(passed=True)

-    return _check
+    return [_check_percent_unaccounted_for_gas, _check_pk_deduplication]


Right now these asset checks aren't loaded into dagster, at least partially because the function to produce them isn't called anywhere but possibly also because you're returning a list a checks here rather than just one check. If we aren't parameterizing the checks (e.g., specifying expected values by year) then looking at the way in which multiple checks are defined in eia860.py would be a better analogy than the FERC 714 model.

e-belfer · 2025-01-07T18:30:35Z

@seeess1 Thanks for all your work on this one! I'm closing this and picking up your work in #4005 to get this over the last few hurdles!

sam and others added 16 commits September 21, 2024 13:53

Update phmsagas DOI and start transformation

d41717d

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

ba27b7e

…m-phmsagas-data

Starting data transformation

b40071e

Update notebook and change etl_fast phmsagas years

90fd277

Add 2023 package data columns for new phmsagas run

c5406b1

Added documentation

9fc3f45

Add troubleshooting to index

c3e2c67

Update troubleshooting

1bfb71d

Add helpers

7760e9a

Temp add change

e23bd61

Update column mappings

b5c7acd

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

035712e

…m-phmsagas-data

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

5d7d00f

…m-phmsagas-data

Update notebook and add draft transform script

9518e83

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

418cd55

…m-phmsagas-data

Remove old files and cleanup helpers

66b63b2

[pre-commit.ci] auto fixes from pre-commit.com hooks

cfc62d4

For more information, see https://pre-commit.ci

e-belfer self-requested a review October 24, 2024 18:11

e-belfer added phmsa Data from the Pipeline and Hazardous Material Safety Administration community labels Oct 24, 2024

e-belfer assigned seeess1 Oct 24, 2024

e-belfer reviewed Oct 25, 2024

View reviewed changes

zaneselvans reviewed Oct 26, 2024

View reviewed changes

sam and others added 4 commits November 1, 2024 08:07

Resolved merge conflicts

3225680

[pre-commit.ci] auto fixes from pre-commit.com hooks

f276746

For more information, see https://pre-commit.ci

Resolve merge conflicts

ac67b39

[pre-commit.ci] auto fixes from pre-commit.com hooks

57209e9

For more information, see https://pre-commit.ci

sam added 2 commits November 3, 2024 17:02

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

e84f348

…m-phmsagas-data

Cleanup notebook

7f94d13

e-belfer self-requested a review November 4, 2024 14:00

e-belfer requested changes Nov 6, 2024

View reviewed changes

e-belfer reviewed Nov 6, 2024

View reviewed changes

sam and others added 3 commits November 16, 2024 08:44

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

f1ba3dc

…m-phmsagas-data

Make updates per PR feedback

cb6c767

[pre-commit.ci] auto fixes from pre-commit.com hooks

05cc8ce

For more information, see https://pre-commit.ci

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

1d3db43

…m-phmsagas-data

seeess1 and others added 5 commits December 6, 2024 13:00

Merge remote-tracking branch 'upstream/main' into issue-3770-transfor…

2979526

…m-phmsagas-data

Cleanup method description

9f57d77

Merge branch 'issue-3770-transform-phmsagas-data' of https://github.c…

94d4d5d

…om/seeess1/pudl into issue-3770-transform-phmsagas-data

Update inits, classes, and fields

6f504e5

[pre-commit.ci] auto fixes from pre-commit.com hooks

759f1e6

For more information, see https://pre-commit.ci

e-belfer reviewed Dec 9, 2024

View reviewed changes

seeess1 and others added 3 commits December 12, 2024 11:31

Deduplication and test updates

f1fa7a3

Merge branch 'issue-3770-transform-phmsagas-data' of https://github.c…

2ae5dfb

…om/seeess1/pudl into issue-3770-transform-phmsagas-data

[pre-commit.ci] auto fixes from pre-commit.com hooks

d5f5ffe

For more information, see https://pre-commit.ci

e-belfer self-requested a review December 17, 2024 17:11

e-belfer requested changes Dec 17, 2024

View reviewed changes

e-belfer self-assigned this Jan 6, 2025

e-belfer mentioned this pull request Jan 7, 2025

Transform PHMSA company data #4005

Draft

6 tasks

e-belfer closed this Jan 7, 2025

		return exception_cols


		def standardize_state_columns(df: pd.DataFrame, state_columns: list) -> pd.DataFrame:

	phmsagas: ZenodoDoi = "10.5281/zenodo.13624526"
	phmsagas: ZenodoDoi = "10.5281/zenodo.14026420"

Issue 3770 transform phmsagas data #3929

Issue 3770 transform phmsagas data #3929

Conversation

seeess1 commented Oct 24, 2024

Overview

What problem does this address?

What did you change?

Documentation

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

review-notebook-app bot commented Oct 24, 2024

e-belfer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans Oct 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seeess1 commented Oct 30, 2024

zaneselvans commented Oct 30, 2024

seeess1 commented Nov 3, 2024

e-belfer left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seeess1 commented Nov 24, 2024

zaneselvans commented Nov 25, 2024

seeess1 commented Dec 7, 2024

e-belfer commented Dec 9, 2024

e-belfer left a comment

zaneselvans Oct 26, 2024 •

edited

Loading

e-belfer left a comment •

edited

Loading