Support a randomized parameter search in model exploration #167

riley-harper · 2024-11-26T18:58:54Z

Currently there are two ways to generate the list of model (hyper)parameters to search in model exploration. You can either provide a list of all of the models that you would like to test, or you can set param_grid = true and provide a grid of parameters and thresholds to test, like this:

model_parameters = [
  {type = "random_forest", maxDepth = [5, 15, 25], numTrees = [50, 75, 100], threshold = [0.5, 0.6, 0.7], threshold_ratio = [1.0, 1.2, 1.3], minInstancesPerNode=[1, 2]}
]

We would like to add a third option, randomized parameter search. With this option, users will specify parameters as either a distribution over a range or a list of choices. They'll set a new num_samples configuration setting which tells hlink how many model parameter settings it should sample from the given distributions.

To do this, we'll need to upgrade from a single param_grid: bool flag to something a little more complex. Maybe a new training.model_parameter_search table would work well:

# Equivalent to param_grid = true. We can still accept param_grid = true but print a
# deprecation message and internally convert it to this representation.
model_parameter_search = {strategy = "grid"}

# Equivalent to param_grid = false. Like param_grid = true, we can accept this but deprecate it.
# In this mode, we just take exactly what's in model_parameters and test it.
# This is still the default.
model_parameter_search = {strategy = "explicit"}

# The new feature.
model_parameter_search = {strategy = "randomized", num_samples = 20}

Users could also write this with the other table syntax for clarity.

[training.model_parameter_search]
strategy = "randomized"
num_samples = 20

When the strategy is "randomized", parameters can either be a list of values to be sampled from (uniformly) or a table which defines a distribution and arguments for the distribution. We may be able to make good use of scipy.stats here.

[[model_parameters]]
type = "random_forest"
maxDepth = {low = 5, high = 26, distribution = "randint"}
numTrees = {low = 50, high = 101, distribution = "randint"}
minInstancesPerNode = [1, 2]
# Not entirely sure how this will work yet
threshold = {low = 0.5, high = 0.7, distribution = "uniform"}
threshold_ratio = {low = 1.0, high = 1.3, distribution = "uniform"}

Outstanding questions:

How do thresholds work with randomized parameter search? Is it possible that we'd want to do grid search on the thresholds, but do randomized parameter search on the hyperparameters? Should we have a separate num_threshold_samples to support randomized parameter search on thresholds?
Which distributions should we support? "randint" and "uniform" seem indispensable.

The text was updated successfully, but these errors were encountered:

riley-harper · 2024-11-26T19:03:35Z

Maybe the way to go is to also support the same options for thresholds with a threshold_search attribute.

…Models class

We can just pass the list of model_parameters from the config file to this function.

This will make this piece of code easier to understand and test.

…ameters()

…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.

…ers()

…ram_grid

The new training.model_parameter_search is a more flexible version of param_grid. We still support param_grid, but eventually we will want to completely switch over to model_parameter_search instead.

…le from lists

- randint returns a random integer in an inclusive range - uniform returns a random float in an inclusive range

This makes this code more flexible and easier to understand. It also handles a weird case where the toml library returns a subclass of dict in some situations, and built-in Python dicts in other situations.

…parameters()

…gy randomized This lets users set some parameters to a particular value, and only sample others. It's mostly a convenience because previously you could get the same behavior by passing the parameter as a one-element list, like `maxDepth = [7]`. This commit introduces the extra convenience of just specifying the parameter as a value, like `maxDepth = 7`. So now you can do something like this: ``` [[training.model_parameters]] type = "random_forest" maxDepth = 7 numTrees = [1, 10, 20] subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9} ``` maxDepth will always be 7, numTrees will be randomly sampled from the list 1, 10, 20, and subsamplingRate will be sampled uniformly from the range [0.1, 0.9].

riley-harper · 2024-11-27T21:03:55Z

The last commit makes the randomized search a little more flexible for users by letting them pass particular values, lists to sample from, or dictionaries defining distributions in model_parameters. For example,

[training.model_parameter_search]
strategy = "randomized"
num_samples = 50

[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100]

riley-harper · 2024-12-02T16:09:28Z

We should use training.seed to ensure reproducible results for the model parameter search.

riley-harper · 2024-12-02T16:44:46Z

I talked with some users and got some questions answered.

We should use a grid search on thresholds when we sample the parameters with the randomized strategy. This means that we'll need to pass the thresholds through unchanged so that _calc_threshold_matrix() can handle them later.
randint and uniform are about all we need for distributions. It would be nice to have a normal distribution as well. This should be pretty easy to add. I'm pretty sure it's supported by the Python random.Random() class.

Only the hyper-parameters to the model should be affected by training.model_parameter_search.strategy. thresholds and threshold_ratios should be passed through unchanged on each model.

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Pull _custom_param_grid_builder() out of the LinkStepTrainTest…

c5f5b13

…Models class

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Simplify the interface to _custom_param_grid_builder()

605369b

We can just pass the list of model_parameters from the config file to this function.

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Pull _get_model_parameters() out of the LinkStep class

2204152

This will make this piece of code easier to understand and test.

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Add a few tests for _get_model_parameters()

7d48380

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Just pass the training section of the config to _get_model_par…

bc0bf7d

…ameters()

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Add a couple of tests for the new training.model_parameter_sea…

8be8806

…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Look for training.model_parameter_search in _get_model_paramet…

a939ec2

…ers()

riley-harper added a commit that referenced this issue Nov 26, 2024

[#167] Make sure that model_parameter_search takes precedence over pa…

801582e

…ram_grid

riley-harper added the enhancement label Nov 27, 2024

riley-harper added a commit that referenced this issue Nov 27, 2024

[#167] Refactor _get_model_parameters()

8c72446

riley-harper added a commit that referenced this issue Nov 27, 2024

[#167] Improve an error condition in _get_model_parameters()

896ad67

riley-harper added a commit that referenced this issue Nov 27, 2024

[#167] Start supporting a randomized strategy which can randomly samp…

46da4cb

…le from lists

riley-harper added a commit that referenced this issue Nov 27, 2024

[#167] Support some simple distributions for randomized parameter search

51b4144

- randint returns a random integer in an inclusive range - uniform returns a random float in an inclusive range

riley-harper added a commit that referenced this issue Nov 27, 2024

[#167] Pull the edge case logic for "type" out of _choose_randomized_…

65cb5ff

…parameters()

riley-harper added a commit that referenced this issue Dec 2, 2024

[#167] Respect training.seed when the search strategy is ""randomized"

0becd32

riley-harper added a commit that referenced this issue Dec 2, 2024

[#167] Add a normal distribution to randomized parameter search

5d0ea0b

riley-harper added a commit that referenced this issue Dec 2, 2024

[#167] Improve the "unknown distribution" error message

943fc0a

riley-harper added a commit that referenced this issue Dec 2, 2024

[#167] Add a test for the unknown strategy error condition

7fed016

riley-harper mentioned this issue Dec 3, 2024

Add Randomized Parameter Search #168

Merged

riley-harper added this to the v4.0.0 milestone Dec 4, 2024

riley-harper added type: feature A new feature or enhancement to a feature and removed enhancement labels Dec 4, 2024

riley-harper added the component: model exploration label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support a randomized parameter search in model exploration #167

Support a randomized parameter search in model exploration #167

riley-harper commented Nov 26, 2024 •

edited

Loading

riley-harper commented Nov 26, 2024 •

edited

Loading

riley-harper commented Nov 27, 2024

riley-harper commented Dec 2, 2024

riley-harper commented Dec 2, 2024 •

edited

Loading

Support a randomized parameter search in model exploration #167

Support a randomized parameter search in model exploration #167

Comments

riley-harper commented Nov 26, 2024 • edited Loading

riley-harper commented Nov 26, 2024 • edited Loading

riley-harper commented Nov 27, 2024

riley-harper commented Dec 2, 2024

riley-harper commented Dec 2, 2024 • edited Loading

riley-harper commented Nov 26, 2024 •

edited

Loading

riley-harper commented Nov 26, 2024 •

edited

Loading

riley-harper commented Dec 2, 2024 •

edited

Loading