-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update linking.core.classifier and linking.core.threshold #175
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The output type of choose_classifier() is really hard to write down precisely because of the way PySpark types are set up. It's something like tuple["Classifier", "Transformer"], but for some reason SQLTransformer is not a subtype of Transformer.
The caller is responsible for passing a dictionary of hyper-parameters to choose_classifier(), and this dictionary should not include hlink's threshold or threshold_ratio. Both of the places where we call choose_classifier() (training and model exploration) already handle this.
…he whole training config This makes it clear which part of the config predict_with_thresholds() is using and makes it easier to call. It also means that predict_with_thresholds() does not need to know about the structure of the config.
This prevents a possible SQL injection error by setting alpha_threshold to something weird. It's also a bit easier to read and work with in my experience. It's more composable since you can build up the expression instead of having to write all of the SQL at once.
This is just a bit cleaner to read, and makes clear the names of the columns that we're adding. We can't select ratio and prediction at once because prediction depends on ratio.
ccdavis
approved these changes
Dec 6, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comments.
The switch away from f-strings sure looks cleaner.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
These are two core modules which we have been using for our recent work. This PR makes a few updates to both of them, with a couple of breaking changes and some refactoring.
linking.core.classifier
choose_classifier()
no longer filters thethreshold
andthreshold_ratio
keys out of theparams
argument. This was redundant, since we are already filtering these keys out of the hyper-parameters dictionary in the places where we callchoose_classifier()
. It was also confusing because not all of the if branches in the function handled theparams
the same way. Some did filter these keys out, some didn't. Nowchoose_classifier()
does not need to know about the config structure, and just accepts a dictionary of hyper-parameters to pass along to the appropriate classifier constructor.linking.core.threshold
training_conf
as an argument,predict_using_thresholds()
now takes adecision
argument. Previously, it extracteddecision
from the training config, then didn't use the training config for anything else. Likelinking.core.classifier.choose_classifier()
, this change makespredict_using_thresholds()
less coupled to the config structure. Note that the order of arguments has also changed.decision
appears at the end of the list of arguments, afterid_col
.training_conf
appeared beforeid_col
. This is to support possibly givingdecision
a default value and making it optional in the future._apply_alpha_threshold()
and_apply_threshold_ratio()
as PySparkColumn
expressions. This lets us avoid using Python f-strings to interpolate values from the configuration file (likealpha_threshold
) into queries, which could change the logic of the query. The PySpark expressions are also more composable and in my opinion a little easier to work with and understand.