Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support a randomized parameter search in model exploration #167

Open
riley-harper opened this issue Nov 26, 2024 · 4 comments
Open

Support a randomized parameter search in model exploration #167

riley-harper opened this issue Nov 26, 2024 · 4 comments
Labels
component: model exploration type: feature A new feature or enhancement to a feature
Milestone

Comments

@riley-harper
Copy link
Contributor

riley-harper commented Nov 26, 2024

Currently there are two ways to generate the list of model (hyper)parameters to search in model exploration. You can either provide a list of all of the models that you would like to test, or you can set param_grid = true and provide a grid of parameters and thresholds to test, like this:

model_parameters = [
  {type = "random_forest", maxDepth = [5, 15, 25], numTrees = [50, 75, 100], threshold = [0.5, 0.6, 0.7], threshold_ratio = [1.0, 1.2, 1.3], minInstancesPerNode=[1, 2]}
]

We would like to add a third option, randomized parameter search. With this option, users will specify parameters as either a distribution over a range or a list of choices. They'll set a new num_samples configuration setting which tells hlink how many model parameter settings it should sample from the given distributions.

To do this, we'll need to upgrade from a single param_grid: bool flag to something a little more complex. Maybe a new training.model_parameter_search table would work well:

# Equivalent to param_grid = true. We can still accept param_grid = true but print a
# deprecation message and internally convert it to this representation.
model_parameter_search = {strategy = "grid"}
# Equivalent to param_grid = false. Like param_grid = true, we can accept this but deprecate it.
# In this mode, we just take exactly what's in model_parameters and test it.
# This is still the default.
model_parameter_search = {strategy = "explicit"}
# The new feature.
model_parameter_search = {strategy = "randomized", num_samples = 20}

Users could also write this with the other table syntax for clarity.

[training.model_parameter_search]
strategy = "randomized"
num_samples = 20

When the strategy is "randomized", parameters can either be a list of values to be sampled from (uniformly) or a table which defines a distribution and arguments for the distribution. We may be able to make good use of scipy.stats here.

[[model_parameters]]
type = "random_forest"
maxDepth = {low = 5, high = 26, distribution = "randint"}
numTrees = {low = 50, high = 101, distribution = "randint"}
minInstancesPerNode = [1, 2]
# Not entirely sure how this will work yet
threshold = {low = 0.5, high = 0.7, distribution = "uniform"}
threshold_ratio = {low = 1.0, high = 1.3, distribution = "uniform"}

Outstanding questions:

  • How do thresholds work with randomized parameter search? Is it possible that we'd want to do grid search on the thresholds, but do randomized parameter search on the hyperparameters? Should we have a separate num_threshold_samples to support randomized parameter search on thresholds?
  • Which distributions should we support? "randint" and "uniform" seem indispensable.
@riley-harper
Copy link
Contributor Author

riley-harper commented Nov 26, 2024

Maybe the way to go is to also support the same options for thresholds with a threshold_search attribute.

riley-harper added a commit that referenced this issue Nov 26, 2024
We can just pass the list of model_parameters from the config file to this
function.
riley-harper added a commit that referenced this issue Nov 26, 2024
This will make this piece of code easier to understand and test.
riley-harper added a commit that referenced this issue Nov 26, 2024
…rch setting

One of these tests is failing because we haven't implemented this logic in the
_get_model_parameters() function yet.
riley-harper added a commit that referenced this issue Nov 27, 2024
The new training.model_parameter_search is a more flexible version of
param_grid. We still support param_grid, but eventually we will want to
completely switch over to model_parameter_search instead.
riley-harper added a commit that referenced this issue Nov 27, 2024
- randint returns a random integer in an inclusive range
- uniform returns a random float in an inclusive range
riley-harper added a commit that referenced this issue Nov 27, 2024
This makes this code more flexible and easier to understand. It also handles a
weird case where the toml library returns a subclass of dict in some
situations, and built-in Python dicts in other situations.
riley-harper added a commit that referenced this issue Nov 27, 2024
…gy randomized

This lets users set some parameters to a particular value, and only sample
others.  It's mostly a convenience because previously you could get the same
behavior by passing the parameter as a one-element list, like `maxDepth = [7]`.

This commit introduces the extra convenience of just specifying the parameter
as a value, like `maxDepth = 7`. So now you can do something like this:

```
[[training.model_parameters]]
type = "random_forest"
maxDepth = 7
numTrees = [1, 10, 20]
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
```

maxDepth will always be 7, numTrees will be randomly sampled from the list 1,
10, 20, and subsamplingRate will be sampled uniformly from the range [0.1,
0.9].
@riley-harper
Copy link
Contributor Author

The last commit makes the randomized search a little more flexible for users by letting them pass particular values, lists to sample from, or dictionaries defining distributions in model_parameters. For example,

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100]

@riley-harper
Copy link
Contributor Author

We should use training.seed to ensure reproducible results for the model parameter search.

@riley-harper
Copy link
Contributor Author

riley-harper commented Dec 2, 2024

I talked with some users and got some questions answered.

  1. We should use a grid search on thresholds when we sample the parameters with the randomized strategy. This means that we'll need to pass the thresholds through unchanged so that _calc_threshold_matrix() can handle them later.
  2. randint and uniform are about all we need for distributions. It would be nice to have a normal distribution as well. This should be pretty easy to add. I'm pretty sure it's supported by the Python random.Random() class.

riley-harper added a commit that referenced this issue Dec 2, 2024
Only the hyper-parameters to the model should be affected by
training.model_parameter_search.strategy. thresholds and
threshold_ratios should be passed through unchanged on each model.
@riley-harper riley-harper added this to the v4.0.0 milestone Dec 4, 2024
@riley-harper riley-harper added type: feature A new feature or enhancement to a feature and removed enhancement labels Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: model exploration type: feature A new feature or enhancement to a feature
Projects
None yet
Development

No branches or pull requests

1 participant