Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove duplicates in the mine_id_msha column #3213

Open
bendnorman opened this issue Jan 4, 2024 · 0 comments
Open

Remove duplicates in the mine_id_msha column #3213

bendnorman opened this issue Jan 4, 2024 · 0 comments

Comments

@bendnorman
Copy link
Member

bendnorman commented Jan 4, 2024

Based on the comment in src/pudl/transform/eia923.py the mine_id_msha column should have unique ids:

# If we actually *have* an MSHA ID for a mine, then we have a totally
# unique identifier for that mine, and we can safely drop duplicates and
# keep just one copy of that mine, no matter how different all the other
# fields associated with the mine info are... Here we split out all the
# coalmine records that have an MSHA ID, remove them from the CMI
# data frame, drop duplicates, and then bring the unique mine records
# back into the overall CMI dataframe...
cmi_with_msha = cmi_df[cmi_df["mine_id_msha"] > 0]
cmi_with_msha = cmi_with_msha.drop_duplicates(
subset=[
"mine_id_msha",
]
)
cmi_df.drop(cmi_df[cmi_df["mine_id_msha"] > 0].index)
cmi_df = pd.concat([cmi_df, cmi_with_msha])
cmi_df = cmi_df.drop_duplicates(subset=coalmine_cols)
but they aren't because the cmi_df isn't overwritten.

It's been a while since I've played with this data so I'm not sure what the best option is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Icebox
Development

No branches or pull requests

1 participant