-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support a randomized parameter search in model exploration #167
Comments
Maybe the way to go is to also support the same options for thresholds with a |
We can just pass the list of model_parameters from the config file to this function.
This will make this piece of code easier to understand and test.
…rch setting One of these tests is failing because we haven't implemented this logic in the _get_model_parameters() function yet.
The new training.model_parameter_search is a more flexible version of param_grid. We still support param_grid, but eventually we will want to completely switch over to model_parameter_search instead.
- randint returns a random integer in an inclusive range - uniform returns a random float in an inclusive range
This makes this code more flexible and easier to understand. It also handles a weird case where the toml library returns a subclass of dict in some situations, and built-in Python dicts in other situations.
…gy randomized This lets users set some parameters to a particular value, and only sample others. It's mostly a convenience because previously you could get the same behavior by passing the parameter as a one-element list, like `maxDepth = [7]`. This commit introduces the extra convenience of just specifying the parameter as a value, like `maxDepth = 7`. So now you can do something like this: ``` [[training.model_parameters]] type = "random_forest" maxDepth = 7 numTrees = [1, 10, 20] subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9} ``` maxDepth will always be 7, numTrees will be randomly sampled from the list 1, 10, 20, and subsamplingRate will be sampled uniformly from the range [0.1, 0.9].
The last commit makes the randomized search a little more flexible for users by letting them pass particular values, lists to sample from, or dictionaries defining distributions in [training.model_parameter_search]
strategy = "randomized"
num_samples = 50
[[training.model_parameters]]
type = "random_forest"
# maxDepth is always 7, and impurity is always "entropy"
maxDepth = 7
impurity = "entropy"
# subsamplingRate is sampled from the interval [0.1, 0.9] uniformly
subsamplingRate = {distribution = "uniform", low = 0.1, high = 0.9}
# numTrees is randomly sampled from the list 1, 10, 50, 100
numTrees = [1, 10, 50, 100] |
We should use |
I talked with some users and got some questions answered.
|
Only the hyper-parameters to the model should be affected by training.model_parameter_search.strategy. thresholds and threshold_ratios should be passed through unchanged on each model.
Currently there are two ways to generate the list of model (hyper)parameters to search in model exploration. You can either provide a list of all of the models that you would like to test, or you can set
param_grid = true
and provide a grid of parameters and thresholds to test, like this:We would like to add a third option, randomized parameter search. With this option, users will specify parameters as either a distribution over a range or a list of choices. They'll set a new
num_samples
configuration setting which tells hlink how many model parameter settings it should sample from the given distributions.To do this, we'll need to upgrade from a single
param_grid: bool
flag to something a little more complex. Maybe a newtraining.model_parameter_search
table would work well:Users could also write this with the other table syntax for clarity.
When the strategy is "randomized", parameters can either be a list of values to be sampled from (uniformly) or a table which defines a distribution and arguments for the distribution. We may be able to make good use of
scipy.stats
here.Outstanding questions:
num_threshold_samples
to support randomized parameter search on thresholds?The text was updated successfully, but these errors were encountered: