Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor entity matching name cleaner to be more efficient #3953

Merged
merged 14 commits into from
Dec 18, 2024

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Nov 8, 2024

Overview

As part of the SEC to EIA record linkage development, I had to make some changes to the PUDL company name cleaning module to make it more efficient and useful. The code for this module was originally pulled OS Climate's repo, but it was no longer maintained there. I didn't make significant changes when I pulled out that module, and thus it had some quirks and inefficiencies.

What problem does this address?

  • The name cleaner was very slow. It's still pretty slow on big datasets, but is about 3x faster than previously

What did you change?

  • Instead of using apply to apply the regex replacement rules, I used pd.Series.replace so that this replacement is vectorized.
  • Removed some coupling in the cleaning rules and restructured the CompanyNameCleaner class
  • Made some updates to the regex rules to be more effective

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Preview Give feedback

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Preview Give feedback

@katie-lamb katie-lamb marked this pull request as ready for review December 2, 2024 22:41
@katie-lamb katie-lamb requested a review from zschira December 2, 2024 22:41
@katie-lamb
Copy link
Member Author

@zschira I'm not sure how to fix the docs error... I posted about this in Slack, but here's the problem:

The warning (treated as error) occurs here and is about duplicate object descriptions for the class variables that I describe here in my CompanyNameCleaner class. When I look at the doc page that's generated by sphinx, it looks fine, although it lists all of these attributes twice, first with the full description and then with no description. Am I not doing the docstring for this class correctly? Is there something I should include so sphinx ignores this?

class CompanyNameCleaner(BaseModel):
"""Class to normalize/clean up text based company names.

Attributes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I resolve the warning/error you need to move the attribute doc strings directly below the attribute definitions:

cleaning_rules_list: list[str] = DEFAULT_CLEANING_RULES_LIST
    """A list of cleaning rules that the
    
    CompanyNameCleaner should apply. Will be validated to ensure
    rules comply to allowed cleaning functions."""

You can use this pudl class as a reference.

I think you're getting a duplicate warning because sphinx is creating empty doc strings for the attributes and then it looks at the doc string of the class and creates doc strings for the attributes again.

@katie-lamb
Copy link
Member Author

Thanks for the help @bendnorman ! I think now it's just the same error that you mentioned was generally causing problems in PUDL?

Copy link
Member

@zschira zschira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small questions, but overall looks good!

One random thing I started thinking about while looking at this is that maybe we were going about this the wrong way when trying to think up a way to avoid illegal states with string cleaning, and instead a better approach would be to just provide as much insight into what's happening as possible. It could be cool to try to turn this into a series of dagster op's so you could see the order in the dagster UI and inspect intermediate outputs.

This is all just some random thoughts though, not something I think we should try to actually do right now.

src/pudl/analysis/record_linkage/name_cleaner.py Outdated Show resolved Hide resolved
"""Apply the cleaning rules from the dictionary of regex rules."""
if self.place_word_the_at_beginning:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this rule handled differently? Is it a combination of two rules?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to do with the relationship between the "remove_word_the_from_the_end", "remove_word_the_from_the_beginning", and place_word_the_at_beginning rules. I decided that if you want to place the word the at the beginning, then you should do this first in case you want to then remove "the" from the end or "the" from the beginning. These rules are all kind of in conflict and feed into the idea that there are "states" that we go through - an op to handle "the" would make sense in a later refactor.

@katie-lamb katie-lamb added this pull request to the merge queue Dec 18, 2024
Merged via the queue into main with commit 0dd0530 Dec 18, 2024
17 checks passed
@katie-lamb katie-lamb deleted the rl-cleaning-upates branch December 18, 2024 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants