Add Randomized Parameter Search #168

riley-harper · 2024-12-03T16:13:47Z

This work is for issue #167, which we can close once the v4-dev branch is merged into main and released.

Previously, there were two strategies for searching for the best parameters to a model in model exploration, and the training.param_grid config option switched between them. When param_grid was false, which was the default, model exploration would take the contents of training.model_parameters and test them without any changes or transformation. I've named this strategy "explicit" because the user explicitly writes out each combination of parameters they would like to test. When param_grid was true, the user provided lists of possible values for each parameter in model_parameters. Then model exploration took this list and generated every possible combination of parameters, testing each combination in serial. This strategy is called a "grid" search because it generates a grid of possible parameter combinations.

This PR adds a third strategy, "randomized" search, which samples each parameter from a list or distribution to create a set number N of parameter combinations to test. This differs from grid search, in which every possible combination of parameters becomes a test case. Randomized search should speed up searches for parameters. Grid search may still be helpful in some situations when you need more precision and would like to test a range of values very thoroughly.

To add a third strategy, we have deprecated the param_grid option and replaced it with training.model_parameter_search. param_grid still works but model exploration prints a warning message when you use it. model_parameter_search may be

[training.model_parameter_search]
strategy = "explicit"

[training.model_parameter_search]
strategy = "grid"

or

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

The explicit and grid strategies correspond exactly to the previous behavior with the param_grid option. The randomized strategy adds new behavior. When the strategy is randomized, each parameters in model_parameters may take one of three different forms.

A value, like an integer, string, or float. This parameter is "pinned" in place and is not randomized.
A list of values. The value for this parameter will be sampled from the given list at random. Each element has an equal chance of being chosen during each sample. The sampling is with replacement, so the same element may be chosen multiple times.
A dictionary which defines a distribution from which the parameter should be sampled. Currently model exploration supports 3 different distributions: "randint", "uniform", and "normal".
- randint requires "low" and "high" values and returns a random integer in the inclusive range [low, high].
- uniform requires "low" and "high" values and returns a random float in the inclusive range [low, high].
- normal requires a "mean" and "standard_deviation" and returns a float sampled from the normal distribution at the given mean and with the given standard deviation.

Here's an example configuration for randomized parameter search.

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100]

…Models class

We can just pass the list of model_parameters from the config file to this function.

This will make this piece of code easier to understand and test.

…ameters()

…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.

…ers()

…ram_grid

The new training.model_parameter_search is a more flexible version of param_grid. We still support param_grid, but eventually we will want to completely switch over to model_parameter_search instead.

…le from lists

- randint returns a random integer in an inclusive range - uniform returns a random float in an inclusive range

This makes this code more flexible and easier to understand. It also handles a weird case where the toml library returns a subclass of dict in some situations, and built-in Python dicts in other situations.

…parameters()

…gy randomized This lets users set some parameters to a particular value, and only sample others. It's mostly a convenience because previously you could get the same behavior by passing the parameter as a one-element list, like `maxDepth = [7]`. This commit introduces the extra convenience of just specifying the parameter as a value, like `maxDepth = 7`. So now you can do something like this: ``` [[training.model_parameters]] type = "random_forest" maxDepth = 7 numTrees = [1, 10, 20] subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9} ``` maxDepth will always be 7, numTrees will be randomly sampled from the list 1, 10, 20, and subsamplingRate will be sampled uniformly from the range [0.1, 0.9].

Only the hyper-parameters to the model should be affected by training.model_parameter_search.strategy. thresholds and threshold_ratios should be passed through unchanged on each model.

riley-harper · 2024-12-03T16:14:33Z

I haven't written user docs for this yet because I figure there will be a lot more changes to model exploration as well. We should make sure to write some good documentation once we have started narrowing in on how everything will work together.

I renamed _get_model_parameters()'s training_config argument to "training_settings" to match the changes made in v4-dev.

ccdavis

I read it over and understand the feature. At first I was confused about how the num_samples got implemented but I found that. Good to go.

riley-harper · 2024-12-04T17:55:13Z

The failing test is also failing on v4-dev. It's not related to randomized parameter search as far as I can tell.

riley-harper added 21 commits November 26, 2024 14:20

[#167] Pull _custom_param_grid_builder() out of the LinkStepTrainTest…

c5f5b13

…Models class

[#167] Simplify the interface to _custom_param_grid_builder()

605369b

We can just pass the list of model_parameters from the config file to this function.

[#167] Pull _get_model_parameters() out of the LinkStep class

2204152

This will make this piece of code easier to understand and test.

[#167] Add a few tests for _get_model_parameters()

7d48380

[#167] Just pass the training section of the config to _get_model_par…

bc0bf7d

…ameters()

[#167] Add a couple of tests for the new training.model_parameter_sea…

8be8806

…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.

[#167] Look for training.model_parameter_search in _get_model_paramet…

a939ec2

…ers()

[#167] Make sure that model_parameter_search takes precedence over pa…

801582e

…ram_grid

[#167] Print a deprecation warning for training.param_grid

a476884

The new training.model_parameter_search is a more flexible version of param_grid. We still support param_grid, but eventually we will want to completely switch over to model_parameter_search instead.

[#167] Refactor _get_model_parameters()

8c72446

[#167] Improve an error condition in _get_model_parameters()

896ad67

[#167] Start supporting a randomized strategy which can randomly samp…

46da4cb

…le from lists

[#167] Support some simple distributions for randomized parameter search

51b4144

- randint returns a random integer in an inclusive range - uniform returns a random float in an inclusive range

[#167] Use isinstance instead of directly checking types

907818e

This makes this code more flexible and easier to understand. It also handles a weird case where the toml library returns a subclass of dict in some situations, and built-in Python dicts in other situations.

[#167] Pull the edge case logic for "type" out of _choose_randomized_…

65cb5ff

…parameters()

[#167] Respect training.seed when the search strategy is ""randomized"

0becd32

[#167] Add a normal distribution to randomized parameter search

5d0ea0b

[#167] Improve the "unknown distribution" error message

943fc0a

[#167] Don't randomize threshold or threshold_ratio

0f99e1b

Only the hyper-parameters to the model should be affected by training.model_parameter_search.strategy. thresholds and threshold_ratios should be passed through unchanged on each model.

[#167] Add a test for the unknown strategy error condition

7fed016

riley-harper requested a review from ccdavis December 3, 2024 16:14

riley-harper added 3 commits December 3, 2024 13:27

Merge branch 'main' into randomized_parameter_search

0f5deb6

Merge branch 'main' into randomized_parameter_search

c6d3a81

Merge branch 'v4-dev' into randomized_parameter_search

73e6adc

I renamed _get_model_parameters()'s training_config argument to "training_settings" to match the changes made in v4-dev.

ccdavis approved these changes Dec 4, 2024

View reviewed changes

riley-harper merged commit 85802d3 into v4-dev Dec 4, 2024
0 of 3 checks passed

riley-harper deleted the randomized_parameter_search branch December 4, 2024 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Randomized Parameter Search #168

Add Randomized Parameter Search #168

riley-harper commented Dec 3, 2024

riley-harper commented Dec 3, 2024

ccdavis left a comment

riley-harper commented Dec 4, 2024

Add Randomized Parameter Search #168

Add Randomized Parameter Search #168

Conversation

riley-harper commented Dec 3, 2024

riley-harper commented Dec 3, 2024

ccdavis left a comment

Choose a reason for hiding this comment

riley-harper commented Dec 4, 2024