Update linking.core.classifier and linking.core.threshold #175

riley-harper · 2024-12-06T18:06:29Z

These are two core modules which we have been using for our recent work. This PR makes a few updates to both of them, with a couple of breaking changes and some refactoring.

linking.core.classifier

choose_classifier() no longer filters the threshold and threshold_ratio keys out of the params argument. This was redundant, since we are already filtering these keys out of the hyper-parameters dictionary in the places where we call choose_classifier(). It was also confusing because not all of the if branches in the function handled the params the same way. Some did filter these keys out, some didn't. Now choose_classifier() does not need to know about the config structure, and just accepts a dictionary of hyper-parameters to pass along to the appropriate classifier constructor.

linking.core.threshold

Instead of taking the entire training_conf as an argument, predict_using_thresholds() now takes a decision argument. Previously, it extracted decision from the training config, then didn't use the training config for anything else. Like linking.core.classifier.choose_classifier(), this change makes predict_using_thresholds() less coupled to the config structure. Note that the order of arguments has also changed. decision appears at the end of the list of arguments, after id_col. training_conf appeared before id_col. This is to support possibly giving decision a default value and making it optional in the future.
Added type hints, documentation, a little bit of logging, and unit tests for linking.core.threshold
Rewrote some SQL queries in _apply_alpha_threshold() and _apply_threshold_ratio() as PySpark Column expressions. This lets us avoid using Python f-strings to interpolate values from the configuration file (like alpha_threshold) into queries, which could change the logic of the query. The PySpark expressions are also more composable and in my opinion a little easier to work with and understand.

The output type of choose_classifier() is really hard to write down precisely because of the way PySpark types are set up. It's something like tuple["Classifier", "Transformer"], but for some reason SQLTransformer is not a subtype of Transformer.

The caller is responsible for passing a dictionary of hyper-parameters to choose_classifier(), and this dictionary should not include hlink's threshold or threshold_ratio. Both of the places where we call choose_classifier() (training and model exploration) already handle this.

…he whole training config This makes it clear which part of the config predict_with_thresholds() is using and makes it easier to call. It also means that predict_with_thresholds() does not need to know about the structure of the config.

This prevents a possible SQL injection error by setting alpha_threshold to something weird. It's also a bit easier to read and work with in my experience. It's more composable since you can build up the expression instead of having to write all of the SQL at once.

… SQL

This is just a bit cleaner to read, and makes clear the names of the columns that we're adding. We can't select ratio and prediction at once because prediction depends on ratio.

riley-harper · 2024-12-06T18:11:12Z

These changes are for #172 and #174. Once we merge v4-dev into main, we can close those issues.

ccdavis

Good comments.

The switch away from f-strings sure looks cleaner.

riley-harper added 11 commits December 5, 2024 09:58

[#174] Add type hints to linking.core.threshold

49bda13

[#174] Add a couple of unit tests for linking.core.threshold

28bcd03

[#174] Do some minor refactoring and cleanup of linking.core.threshold

5424513

[#174] Rewrite some thresholding code to use PySpark exprs instead of…

647a751

… SQL

[#174] Use withColumn() instead of select("*", ...)

b5c8ae9

This is just a bit cleaner to read, and makes clear the names of the columns that we're adding. We can't select ratio and prediction at once because prediction depends on ratio.

[#174] Improve the error message when there's no probability column

1ffb6d1

[#174] Update documentation and add a few logging debug statements

d32c2bf

riley-harper requested a review from ccdavis December 6, 2024 18:06

ccdavis approved these changes Dec 6, 2024

View reviewed changes

riley-harper merged commit 3c9043c into v4-dev Dec 6, 2024
6 checks passed

riley-harper deleted the core-arguments branch December 6, 2024 19:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update linking.core.classifier and linking.core.threshold #175

Update linking.core.classifier and linking.core.threshold #175

riley-harper commented Dec 6, 2024 •

edited

Loading

riley-harper commented Dec 6, 2024

ccdavis left a comment

Update linking.core.classifier and linking.core.threshold #175

Update linking.core.classifier and linking.core.threshold #175

Conversation

riley-harper commented Dec 6, 2024 • edited Loading

linking.core.classifier

linking.core.threshold

riley-harper commented Dec 6, 2024

ccdavis left a comment

Choose a reason for hiding this comment

riley-harper commented Dec 6, 2024 •

edited

Loading