Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Randomized Parameter Search #168

Merged
merged 24 commits into from
Dec 4, 2024
Merged

Conversation

riley-harper
Copy link
Contributor

This work is for issue #167, which we can close once the v4-dev branch is merged into main and released.

Previously, there were two strategies for searching for the best parameters to a model in model exploration, and the training.param_grid config option switched between them. When param_grid was false, which was the default, model exploration would take the contents of training.model_parameters and test them without any changes or transformation. I've named this strategy "explicit" because the user explicitly writes out each combination of parameters they would like to test. When param_grid was true, the user provided lists of possible values for each parameter in model_parameters. Then model exploration took this list and generated every possible combination of parameters, testing each combination in serial. This strategy is called a "grid" search because it generates a grid of possible parameter combinations.

This PR adds a third strategy, "randomized" search, which samples each parameter from a list or distribution to create a set number N of parameter combinations to test. This differs from grid search, in which every possible combination of parameters becomes a test case. Randomized search should speed up searches for parameters. Grid search may still be helpful in some situations when you need more precision and would like to test a range of values very thoroughly.

To add a third strategy, we have deprecated the param_grid option and replaced it with training.model_parameter_search. param_grid still works but model exploration prints a warning message when you use it. model_parameter_search may be

[training.model_parameter_search]
strategy = "explicit"
[training.model_parameter_search]
strategy = "grid"

or

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

The explicit and grid strategies correspond exactly to the previous behavior with the param_grid option. The randomized strategy adds new behavior. When the strategy is randomized, each parameters in model_parameters may take one of three different forms.

  1. A value, like an integer, string, or float. This parameter is "pinned" in place and is not randomized.
  2. A list of values. The value for this parameter will be sampled from the given list at random. Each element has an equal chance of being chosen during each sample. The sampling is with replacement, so the same element may be chosen multiple times.
  3. A dictionary which defines a distribution from which the parameter should be sampled. Currently model exploration supports 3 different distributions: "randint", "uniform", and "normal".
    • randint requires "low" and "high" values and returns a random integer in the inclusive range [low, high].
    • uniform requires "low" and "high" values and returns a random float in the inclusive range [low, high].
    • normal requires a "mean" and "standard_deviation" and returns a float sampled from the normal distribution at the given mean and with the given standard deviation.

Here's an example configuration for randomized parameter search.

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100]

We can just pass the list of model_parameters from the config file to this
function.
This will make this piece of code easier to understand and test.
…rch setting

One of these tests is failing because we haven't implemented this logic in the
_get_model_parameters() function yet.
The new training.model_parameter_search is a more flexible version of
param_grid. We still support param_grid, but eventually we will want to
completely switch over to model_parameter_search instead.
- randint returns a random integer in an inclusive range
- uniform returns a random float in an inclusive range
This makes this code more flexible and easier to understand. It also handles a
weird case where the toml library returns a subclass of dict in some
situations, and built-in Python dicts in other situations.
…gy randomized

This lets users set some parameters to a particular value, and only sample
others.  It's mostly a convenience because previously you could get the same
behavior by passing the parameter as a one-element list, like `maxDepth = [7]`.

This commit introduces the extra convenience of just specifying the parameter
as a value, like `maxDepth = 7`. So now you can do something like this:

```
[[training.model_parameters]]
type = "random_forest"
maxDepth = 7
numTrees = [1, 10, 20]
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
```

maxDepth will always be 7, numTrees will be randomly sampled from the list 1,
10, 20, and subsamplingRate will be sampled uniformly from the range [0.1,
0.9].
Only the hyper-parameters to the model should be affected by
training.model_parameter_search.strategy. thresholds and
threshold_ratios should be passed through unchanged on each model.
@riley-harper
Copy link
Contributor Author

I haven't written user docs for this yet because I figure there will be a lot more changes to model exploration as well. We should make sure to write some good documentation once we have started narrowing in on how everything will work together.

@riley-harper riley-harper requested a review from ccdavis December 3, 2024 16:14
I renamed _get_model_parameters()'s training_config argument to
"training_settings" to match the changes made in v4-dev.
Copy link

@ccdavis ccdavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read it over and understand the feature. At first I was confused about how the num_samples got implemented but I found that. Good to go.

@riley-harper
Copy link
Contributor Author

The failing test is also failing on v4-dev. It's not related to randomized parameter search as far as I can tell.

@riley-harper riley-harper merged commit 85802d3 into v4-dev Dec 4, 2024
0 of 3 checks passed
@riley-harper riley-harper deleted the randomized_parameter_search branch December 4, 2024 17:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants