Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voting methods for feature ranking in efs #112

Merged
merged 75 commits into from
Nov 30, 2024
Merged
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
a5d1b38
add stability selection article
bblodfon Jul 31, 2024
4cc3815
add Rcpp code for approval voting feature ranking method
bblodfon Jul 31, 2024
21ae7d7
add citation
bblodfon Jul 31, 2024
ccffa4b
extra check during init()
bblodfon Jul 31, 2024
108ddc2
update doc + use the Rcpp interface for approval voting
bblodfon Jul 31, 2024
589df2e
add templates for params in ArchiveBatchFSelect + updocs
bblodfon Jul 31, 2024
e520c77
use testthat expectations (not checkmate ones!)
bblodfon Jul 31, 2024
0ecc618
add test for newly implemented voting methods
bblodfon Jul 31, 2024
2622c96
update test for av
bblodfon Jul 31, 2024
97f21c4
fix note
bblodfon Jul 31, 2024
f84f91c
refactor AV_rcpp, add SAV_rcpp
bblodfon Aug 1, 2024
3614d93
add norm_score, and SAV R function
bblodfon Aug 1, 2024
0a1eb49
add sav, improve doc
bblodfon Aug 1, 2024
fc5d24d
fix efs test
bblodfon Aug 1, 2024
6df3bbd
update and improve test for AV
bblodfon Aug 1, 2024
fc86503
add sav test
bblodfon Aug 1, 2024
0d9eccf
Merge branch 'main' into voting_methods
bblodfon Aug 7, 2024
87d68d4
add borda score
bblodfon Aug 7, 2024
fa05f09
update tests
bblodfon Aug 7, 2024
6a89966
add seq and revseq PAV Rcpp methods
bblodfon Aug 12, 2024
5c09975
add R functions for the PAV methods
bblodfon Aug 12, 2024
103bf45
comment printing
bblodfon Aug 12, 2024
ff17d11
add tests for PAV methods
bblodfon Aug 12, 2024
b6f4b5e
add PAV methods to efs
bblodfon Aug 12, 2024
3a248cf
refactor: do not use C++ RNGs
bblodfon Aug 13, 2024
92ce0df
fix startsWith
bblodfon Aug 13, 2024
283003e
updocs
bblodfon Aug 13, 2024
567f456
fix data.table note
bblodfon Aug 13, 2024
e55ae24
add committee_size parameter, refactor borda score
bblodfon Aug 19, 2024
9a37e60
add large data test for seq pav
bblodfon Aug 19, 2024
58ab928
refactor C++ code, add optimized PAV
bblodfon Aug 21, 2024
61c0907
remove revseq-PAV method, use optimized seqPAV
bblodfon Aug 21, 2024
8654a38
update tests
bblodfon Aug 21, 2024
47e3dcf
remove suboptimal seqPAV function
bblodfon Aug 23, 2024
b369c6e
shuffle candidates outside Rcpp functions (same tie-breaking)
bblodfon Aug 23, 2024
6b7fb03
optimize Phragmen a bit => do not randomly select the candidate with …
bblodfon Aug 23, 2024
60065f9
add phragmen's rule in efs
bblodfon Aug 23, 2024
8ffa44f
correct borda score + use phragmens rule
bblodfon Aug 23, 2024
852ff35
add tests for Phragmen's rule
bblodfon Aug 23, 2024
5623812
correct weighted Phragmen's rule
bblodfon Sep 18, 2024
7e3be3e
add specific test for phragmen's rule
bblodfon Sep 18, 2024
25387c4
Merge branch 'main' into voting_methods
bblodfon Sep 19, 2024
1eef6c6
run document()
bblodfon Sep 19, 2024
f2ccbda
show data.table result after using ':='
bblodfon Oct 17, 2024
bea5e39
add n_resamples field + nicer obj print
bblodfon Oct 17, 2024
2d21fc7
cover edge case (eg lasso resulted in no features getting selected)
bblodfon Oct 24, 2024
ad9fd2e
Merge branch 'main' into voting_methods
bblodfon Oct 25, 2024
7f3ab3b
updocs
bblodfon Oct 25, 2024
4137404
small styling fix
bblodfon Oct 25, 2024
d151303
add Stabl ref
bblodfon Oct 31, 2024
83529b6
more descriptive name
bblodfon Oct 31, 2024
49bb097
add embedded ensemble feature selection
bblodfon Oct 31, 2024
6f3923f
remove print()
bblodfon Nov 1, 2024
123624e
add TOCHECK comment on benchmark design
bblodfon Nov 5, 2024
0581cdc
use internal valid task
be-marc Nov 11, 2024
14acd73
simplify
be-marc Nov 11, 2024
81b475d
...
be-marc Nov 11, 2024
79747ad
store_models = FALSE
be-marc Nov 11, 2024
331f231
...
be-marc Nov 11, 2024
081acc8
separate the use of inner_measure and measure used in the test sets
bblodfon Nov 18, 2024
efc0155
updocs
bblodfon Nov 18, 2024
0e2f93f
update tests
bblodfon Nov 18, 2024
3bca203
Merge branch 'main' into voting_methods
bblodfon Nov 18, 2024
d457221
refactor: expect_vector => expect_numeric
bblodfon Nov 18, 2024
9cb56b1
fix partial arg match
bblodfon Nov 18, 2024
cc36179
fix example
bblodfon Nov 18, 2024
816376a
use fastVoteR for feature ranking
bblodfon Nov 23, 2024
3dae249
pass named list to callback parameter
be-marc Nov 25, 2024
fd5afbc
skip test if fastVoteR is not available
bblodfon Nov 25, 2024
c937024
refactor: better handling of inner measure
bblodfon Nov 26, 2024
8e506c8
add tests for embedded_ensemble_fselect()
bblodfon Nov 26, 2024
3bd1772
update NEWs
bblodfon Nov 26, 2024
9e05dca
add active_measure field
bblodfon Nov 26, 2024
832bd7f
remove Remotes as fastVoteR is now on CRAN :)
bblodfon Nov 27, 2024
8c0d73f
refine doc
bblodfon Nov 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Suggests:
mlr3learners,
mlr3pipelines,
rpart,
fastVoteR,
testthat (>= 3.0.0)
Config/testthat/edition: 3
Config/testthat/parallel: true
Expand Down Expand Up @@ -74,6 +75,7 @@ Collate:
'assertions.R'
'auto_fselector.R'
'bibentries.R'
'embedded_ensemble_fselect.R'
'ensemble_fselect.R'
'extract_inner_fselect_archives.R'
'extract_inner_fselect_results.R'
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ export(auto_fselector)
export(callback_batch_fselect)
export(clbk)
export(clbks)
export(embedded_ensemble_fselect)
export(ensemble_fselect)
export(extract_inner_fselect_archives)
export(extract_inner_fselect_results)
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# mlr3fselect (development version)

* Use [fastVoteR](https://github.com/bblodfon/fastVoteR) for feature ranking in `EnsembleFSResult()` objects
* Add embedded ensemble feature selection `embedded_ensemble_fselect()`
* Refactor `ensemble_fselect()` and `EnsembleFSResult()`

# mlr3fselect 1.2.1

* compatibility: mlr3 0.22.0
Expand Down
235 changes: 182 additions & 53 deletions R/EnsembleFSResult.R
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
#' Whether to add the learner, task and resampling information from the benchmark result.
#'
#' @references
#' `r format_bib("das1999")`
#' `r format_bib("das1999", "meinshausen2010")`
#'
#' @export
#' @examples
Expand All @@ -27,7 +27,8 @@
#' learners = lrns(c("classif.rpart", "classif.featureless")),
#' init_resampling = rsmp("subsampling", repeats = 2),
#' inner_resampling = rsmp("cv", folds = 3),
#' measure = msr("classif.ce"),
#' inner_measure = msr("classif.ce"),
#' measure = msr("classif.acc"),
#' terminator = trm("none")
#' )
#'
Expand All @@ -43,7 +44,16 @@
#' # returns a ranking of all features
#' head(efsr$feature_ranking())
#'
#' # returns the empirical pareto front (nfeatures vs error)
#' # returns the empirical pareto front, i.e. n_features vs measure (error)
#' efsr$pareto_front()
#'
#' # returns the knee points (optimal trade-off between n_features and performance)
#' efsr$knee_points()
#'
#' # change to use the inner optimization measure
#' efsr$set_active_measure(which = "inner")
#'
#' # Pareto front is calculated on the inner measure
#' efsr$pareto_front()
#' }
EnsembleFSResult = R6Class("EnsembleFSResult",
Expand All @@ -62,26 +72,53 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
#'
#' @param result ([data.table::data.table])\cr
#' The result of the ensemble feature selection.
#' Column names should include `"resampling_iteration"`, `"learner_id"`, `"features"`
#' and `"n_features"`.
#' Mandatory column names should include `"resampling_iteration"`, `"learner_id"`,
#' `"features"` and `"n_features"`.
#' A column named as `{measure$id}` (scores on the test sets) must also be
#' always present.
#' The column with the `{inner_measure$id}` (scores on the train sets) is not mandatory,
#' but note that it should be named as `{inner_measure$id}_inner` to distinguish from
#' the `{measure$id}`.
#' @param features ([character()])\cr
#' The vector of features of the task that was used in the ensemble feature
#' selection.
#' @param benchmark_result ([mlr3::BenchmarkResult])\cr
#' The benchmark result object.
#' @param measure_id (`character(1)`)\cr
#' Column name of `"result"` that corresponds to the measure used.
#' @param minimize (`logical(1)`)\cr
#' If `TRUE` (default), lower values of the measure correspond to higher performance.
initialize = function(result, features, benchmark_result = NULL, measure_id,
minimize = TRUE) {
#' @param measure ([mlr3::Measure])\cr
#' The measure used to score the learners on the test sets generated
#' during the ensemble feature selection process.
#' This will be the 'active' measure used in methods of this object, but this
#' can be changed with `$set_active_measure()`.
#' @param inner_measure ([mlr3::Measure])\cr
#' The inner measure used to optimize and score the learners on the train sets
#' generated during the ensemble feature selection process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say that differently? Scoring on a train set sounds wrong. Is this the outer train set which is split by the inner resampling? We score the inner resample result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, its the outer train set. The inner_resampling generates N train/test splits. The inner_measure is used to optimize/tune on the train set and you get the best subset and final model + score on that train set. We use these final models to also score the corresponding test splits (the inner resampling result you ask), with the measure. In embedded efs we only do the second (no inner_measure is needed/used).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can change the wording to specifically mention the train/test splits of the inner resampling (I also mentionthat earlier in the doc), what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you get the best subset and final model + score on that train set

It is the final model with the best subset and corresponding performance estimated on the inner resampling. There is no scoring on the outer training set but scoring on the inner resampling result. This is very similar to nested resampling. Maybe stick to the words used bellow figure 4.5

https://mlr3book.mlr-org.com/chapters/chapter4/hyperparameter_optimization.html#sec-nested-resampling

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sorry Marc, it's as you say, when I was writing the above comment, I meant outer resampling (what we call init_resampling) as the one that generates the train/test splits. And yes, pretty much we are doing nested CV, with outer resampling the N times holdout split. I will update the doc

initialize = function(
result,
features,
benchmark_result = NULL,
measure,
inner_measure = NULL
) {
assert_data_table(result)
private$.measure_id = assert_string(measure_id, null.ok = FALSE)
mandatory_columns = c("resampling_iteration", "learner_id", "features", "n_features")
assert_names(names(result), must.include = c(mandatory_columns, measure_id))
private$.measure = assert_measure(measure)
private$.active_measure = "outer"
measure_ids = c(private$.measure$id)
if (!is.null(inner_measure)) {
private$.inner_measure = assert_measure(inner_measure)
# special end-fix required for inner measure
measure_ids = c(measure_ids, sprintf("%s_inner", private$.inner_measure$id))
}

# the non-NULL measure ids should be defined as columns in the dt result
mandatory_columns = c("resampling_iteration", "learner_id", "features",
"n_features", measure_ids)
assert_names(names(result), must.include = mandatory_columns)
private$.result = result
private$.features = assert_character(features, any.missing = FALSE, null.ok = FALSE)
private$.minimize = assert_logical(minimize, null.ok = FALSE)

# check that all feature sets are subsets of the task features
assert_subset(unlist(result$features), private$.features)

self$benchmark_result = if (!is.null(benchmark_result)) assert_benchmark_result(benchmark_result)

self$man = "mlr3fselect::ensemble_fs_result"
Expand All @@ -99,7 +136,8 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
#'
#' @param ... (ignored).
print = function(...) {
catf(format(self))
catf("%s with %s learners and %s initial resamplings",
format(self), self$n_learners, self$n_resamples)
print(private$.result[, c("resampling_iteration", "learner_id", "n_features"), with = FALSE])
},

Expand All @@ -110,43 +148,102 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
},

#' @description
#' Calculates the feature ranking.
#' Use this function to change the active measure.
#'
#' @param which (`character(1)`)\cr
#' Which [measure][mlr3::Measure] from the ensemble feature selection result
#' to use in methods of this object.
#' Should be either `"inner"` (optimization measure used in training sets)
#' or `"outer"` (measure used in test sets, default value).
set_active_measure = function(which = "outer") {
assert_choice(which, c("inner", "outer"))

# check if `inner_measure` is an `mlr3::Measure`
if (which == "inner" && is.null(private$.inner_measure)) {
stop("No inner_measure was defined during initialization")
}

private$.active_measure = which
},

#' @description
#' Calculates the feature ranking via [fastVoteR::rank_candidates()].
#'
#' @details
#' The feature ranking process is built on the following framework: models act as voters, features act as candidates, and voters select certain candidates (features).
#' The feature ranking process is built on the following framework: models act as *voters*, features act as *candidates*, and voters select certain candidates (features).
#' The primary objective is to compile these selections into a consensus ranked list of features, effectively forming a committee.
#' Currently, only `"approval_voting"` method is supported, which selects the candidates/features that have the highest approval score or selection frequency, i.e. appear the most often.
#'
#' For every feature a score is calculated, which depends on the `"method"` argument.
#' The higher the score, the higher the ranking of the feature.
#' Note that some methods output a feature ranking instead of a score per feature, so we always include **Borda's score**, which is method-agnostic, i.e. it can be used to compare the feature rankings across different methods.
#'
#' We shuffle the input candidates/features so that we enforce random tie-breaking.
#' Users should set the same `seed` for consistent comparison between the different feature ranking methods and for reproducibility.
#'
#' @param method (`character(1)`)\cr
#' The method to calculate the feature ranking.
#' The method to calculate the feature ranking. See [fastVoteR::rank_candidates()]
#' for a complete list of available methods.
#' Approval voting (`"av"`) is the default method.
#' @param use_weights (`logical(1)`)\cr
#' The default value (`TRUE`) uses weights equal to the performance scores
#' of each voter/model (or the inverse scores if the measure is minimized).
#' If `FALSE`, we treat all voters as equal and assign them all a weight equal to 1.
#' @param committee_size (`integer(1)`)\cr
#' Number of top selected features in the output ranking.
#' This parameter can be used to speed-up methods that build a committee sequentially
#' (`"seq_pav"`), by requesting only the top N selected candidates/features
#' and not the complete feature ranking.
#' @param shuffle_features (`logical(1)`)\cr
#' Whether to shuffle the task features randomly before computing the ranking.
#' Shuffling ensures consistent random tie-breaking across methods and prevents
#' deterministic biases when features with equal scores are encountered.
#' Default is `TRUE` and it's advised to set a seed before running this function.
#' Set to `FALSE` if deterministic ordering of features is preferred (same as
#' during initialization).
#'
#' @return A [data.table::data.table] listing all the features, ordered by decreasing inclusion probability scores (depending on the `method`)
feature_ranking = function(method = "approval_voting") {
assert_choice(method, choices = "approval_voting")

# cached results
if (!is.null(private$.feature_ranking[[method]])) {
return(private$.feature_ranking[[method]])
#' @return A [data.table::data.table] listing all the features, ordered by decreasing scores (depends on the `"method"`). Columns are as follows:
#' - `"feature"`: Feature names.
#' - `"score"`: Scores assigned to each feature based on the selected method (if applicable).
#' - `"norm_score"`: Normalized scores (if applicable), scaled to the range \eqn{[0,1]}, which can be loosely interpreted as **selection probabilities** (Meinshausen et al. (2010)).
#' - `"borda_score"`: Borda scores for method-agnostic comparison, ranging in \eqn{[0,1]}, where the top feature receives a score of 1 and the lowest-ranked feature receives a score of 0.
#' This column is always included so that feature ranking methods that output only rankings have also a feature-wise score.
#'
feature_ranking = function(method = "av", use_weights = TRUE, committee_size = NULL, shuffle_features = TRUE) {
requireNamespace("fastVoteR")

# candidates => all features, voters => list of selected (best) features sets
candidates = private$.features
voters = private$.result$features

# calculate weights
if (use_weights) {
# voter weights are the (inverse) scores
measure = self$measure # get active measure
measure_id = ifelse(private$.active_measure == "inner",
sprintf("%s_inner", measure$id),
measure$id)

scores = private$.result[, get(measure_id)]
weights = if (measure$minimize) 1 / scores else scores
} else {
# all voters are equal
weights = rep(1, length(voters))
}

count_tbl = sort(table(unlist(private$.result$features)), decreasing = TRUE)
features_selected = names(count_tbl)
features_not_selected = setdiff(private$.features, features_selected)

res_fs = data.table(
feature = features_selected,
inclusion_probability = as.vector(count_tbl) / nrow(private$.result)
# get consensus feature ranking
res = fastVoteR::rank_candidates(
voters = voters,
candidates = candidates,
weights = weights,
committee_size = committee_size,
method = method,
borda_score = TRUE,
shuffle_candidates = shuffle_features
)

res_fns = data.table(
feature = features_not_selected,
inclusion_probability = 0
)

res = rbindlist(list(res_fs, res_fns))
setnames(res, "candidate", "feature")

private$.feature_ranking[[method]] = res
private$.feature_ranking[[method]]
res
},

#' @description
Expand Down Expand Up @@ -222,8 +319,11 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
pareto_front = function(type = "empirical") {
assert_choice(type, choices = c("empirical", "estimated"))
result = private$.result
measure_id = private$.measure_id
minimize = private$.minimize
measure = self$measure # get active measure
measure_id = ifelse(private$.active_measure == "inner",
sprintf("%s_inner", measure$id),
measure$id)
minimize = measure$minimize

# Keep only n_features and performance scores
cols_to_keep = c("n_features", measure_id)
Expand Down Expand Up @@ -261,6 +361,8 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
# Transform the data (x => 1/x)
n_features_inv = NULL
pf[, n_features_inv := 1 / n_features]
# remove edge cases where no features were selected
pf = pf[n_features > 0]

# Fit the linear model
form = mlr3misc::formulate(lhs = measure_id, rhs = "n_features_inv")
Expand Down Expand Up @@ -298,8 +400,11 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
knee_points = function(method = "NBI", type = "empirical") {
assert_choice(method, choices = c("NBI"))
assert_choice(type, choices = c("empirical", "estimated"))
measure_id = private$.measure_id
minimize = private$.minimize
measure = self$measure # get active measure
measure_id = ifelse(private$.active_measure == "inner",
sprintf("%s_inner", measure$id),
measure$id)
minimize = measure$minimize

pf = if (type == "empirical") self$pareto_front() else self$pareto_front(type = "estimated")

Expand Down Expand Up @@ -346,26 +451,50 @@ EnsembleFSResult = R6Class("EnsembleFSResult",
uniqueN(private$.result$learner_id)
},

#' @field measure (`character(1)`)\cr
#' Returns the measure id used in the ensemble feature selection.
#' @field measure ([mlr3::Measure])\cr
#' Returns the active measure to use in methods of this object.
measure = function(rhs) {
assert_ro_binding(rhs)
private$.measure_id

if (private$.active_measure == "outer") {
private$.measure
} else {
private$.inner_measure
}
},

#' @field active_measure (`character(1)`)\cr
#' Specifies the type of the active measure.
#' Can be one of the two:
#'
#' - `"outer"`: measure used in the test sets of the ensemble feature
#' selection process.
#' - `"inner"`: measure used for optimization and scoring the train sets.
active_measure = function(rhs) {
assert_ro_binding(rhs)
private$.active_measure
},

#' @field n_resamples (`character(1)`)\cr
#' Returns the number of times the task was initially resampled in the ensemble feature selection.
n_resamples = function(rhs) {
assert_ro_binding(rhs)
uniqueN(self$result$resampling_iteration)
}
),

private = list(
.result = NULL, # with no R6 classes
.stability_global = NULL,
.stability_learner = NULL,
.feature_ranking = NULL,
.features = NULL,
.measure_id = NULL,
.minimize = NULL
.measure = NULL,
.inner_measure = NULL,
.active_measure = NULL
)
)

#' @export
as.data.table.EnsembleFSResult = function(x, ...) {
as.data.table.EnsembleFSResult = function(x, ...) {
x$result
}
Loading