Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't handle threshold and threshold_ratio in core.classifier.choose_classifier() #172

Open
riley-harper opened this issue Dec 4, 2024 · 0 comments

Comments

@riley-harper
Copy link
Contributor

Some of the code in choose_classifier() explicitly excludes the threshold and threshold_ratio keys from the params dict. This is because these attributes are stored alongside the parameters in the config file. But choose_classifier() should not be responsible for handling that. It should just accept a dictionary of hyper-parameters and pass that along to the correct ML model class.

Each time we call choose_classifier(), we have already extracted threshold and threshold_ratio from the hyper-parameters dict. So this change should just require removing some code from choose_classifier() and maybe updating some documentation. This is technically a breaking change, so we should add it in with v4.

@riley-harper riley-harper added this to the v4.0.0 milestone Dec 4, 2024
@riley-harper riley-harper changed the title Don't handle threshold and threshold_ratio in core.pipeline.choose_classifier() Don't handle threshold and threshold_ratio in core.classifier.choose_classifier() Dec 5, 2024
riley-harper added a commit that referenced this issue Dec 5, 2024
The output type of choose_classifier() is really hard to write down
precisely because of the way PySpark types are set up. It's something
like tuple["Classifier", "Transformer"], but for some reason
SQLTransformer is not a subtype of Transformer.
riley-harper added a commit that referenced this issue Dec 5, 2024
The caller is responsible for passing a dictionary of hyper-parameters
to choose_classifier(), and this dictionary should not include hlink's
threshold or threshold_ratio. Both of the places where we call
choose_classifier() (training and model exploration) already handle
this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant