Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update linking.core.classifier and linking.core.threshold #175

Merged
merged 11 commits into from
Dec 6, 2024

Conversation

riley-harper
Copy link
Contributor

@riley-harper riley-harper commented Dec 6, 2024

These are two core modules which we have been using for our recent work. This PR makes a few updates to both of them, with a couple of breaking changes and some refactoring.

linking.core.classifier

  • choose_classifier() no longer filters the threshold and threshold_ratio keys out of the params argument. This was redundant, since we are already filtering these keys out of the hyper-parameters dictionary in the places where we call choose_classifier(). It was also confusing because not all of the if branches in the function handled the params the same way. Some did filter these keys out, some didn't. Now choose_classifier() does not need to know about the config structure, and just accepts a dictionary of hyper-parameters to pass along to the appropriate classifier constructor.

linking.core.threshold

  • Instead of taking the entire training_conf as an argument, predict_using_thresholds() now takes a decision argument. Previously, it extracted decision from the training config, then didn't use the training config for anything else. Like linking.core.classifier.choose_classifier(), this change makes predict_using_thresholds() less coupled to the config structure. Note that the order of arguments has also changed. decision appears at the end of the list of arguments, after id_col. training_conf appeared before id_col. This is to support possibly giving decision a default value and making it optional in the future.
  • Added type hints, documentation, a little bit of logging, and unit tests for linking.core.threshold
  • Rewrote some SQL queries in _apply_alpha_threshold() and _apply_threshold_ratio() as PySpark Column expressions. This lets us avoid using Python f-strings to interpolate values from the configuration file (like alpha_threshold) into queries, which could change the logic of the query. The PySpark expressions are also more composable and in my opinion a little easier to work with and understand.

The output type of choose_classifier() is really hard to write down
precisely because of the way PySpark types are set up. It's something
like tuple["Classifier", "Transformer"], but for some reason
SQLTransformer is not a subtype of Transformer.
The caller is responsible for passing a dictionary of hyper-parameters
to choose_classifier(), and this dictionary should not include hlink's
threshold or threshold_ratio. Both of the places where we call
choose_classifier() (training and model exploration) already handle
this.
…he whole training config

This makes it clear which part of the config predict_with_thresholds() is using
and makes it easier to call. It also means that predict_with_thresholds() does
not need to know about the structure of the config.
This prevents a possible SQL injection error by setting alpha_threshold to
something weird.  It's also a bit easier to read and work with in my
experience. It's more composable since you can build up the expression instead
of having to write all of the SQL at once.
This is just a bit cleaner to read, and makes clear the names of the columns
that we're adding. We can't select ratio and prediction at once because
prediction depends on ratio.
@riley-harper riley-harper requested a review from ccdavis December 6, 2024 18:06
@riley-harper
Copy link
Contributor Author

These changes are for #172 and #174. Once we merge v4-dev into main, we can close those issues.

Copy link

@ccdavis ccdavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comments.

The switch away from f-strings sure looks cleaner.

@riley-harper riley-harper merged commit 3c9043c into v4-dev Dec 6, 2024
6 checks passed
@riley-harper riley-harper deleted the core-arguments branch December 6, 2024 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants