Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Down-Sampling PipoOps (Tomek, Nearmiss) based on themis #817

Merged
merged 28 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
b24579f
init PipeOps Tomek and Nearmiss
advieser Aug 29, 2024
86fc40a
working PipeOpTomek
advieser Aug 31, 2024
7aae18c
PipeOpTomek tests
advieser Aug 31, 2024
ef58815
typo
advieser Aug 31, 2024
2bf7004
Added PipeOpNearmiss tests
advieser Aug 31, 2024
33b125d
added tests
advieser Aug 31, 2024
995f117
docs: small additions
advieser Aug 31, 2024
2311c65
modified params in test
advieser Aug 31, 2024
953d14e
Working PipeOpNearmiss
advieser Aug 31, 2024
0303284
Docs changes in PipeOpTomek
advieser Aug 31, 2024
f8dd881
remove dev comments
advieser Aug 31, 2024
b4d2b29
docs: document()
advieser Aug 31, 2024
12945db
added themis to suggests
advieser Aug 31, 2024
25f6420
docs: simplified examples
advieser Aug 31, 2024
714f1a1
docs: corrections in examples
advieser Aug 31, 2024
0e704a2
Correcting corrections in examples
advieser Aug 31, 2024
488600a
document()
advieser Sep 1, 2024
460c397
code review changes
advieser Sep 21, 2024
b4ca973
Updated NEWS.md
advieser Sep 21, 2024
0d96c5a
get in data.table
advieser Sep 21, 2024
e23a812
static type checker var defs
mb706 Sep 24, 2024
59f35c1
static type checker var defs II
mb706 Sep 24, 2024
a0de64b
Merge branch 'master' into themis_pipeops
advieser Sep 24, 2024
48e0de3
test for uncommon row_ids
advieser Sep 24, 2024
00c397a
Merge branch 'themis_pipeops' of https://github.com/mlr-org/mlr3pipel…
advieser Sep 24, 2024
92ec133
document
advieser Sep 24, 2024
22f80a2
update version
advieser Sep 24, 2024
d3402ef
Merge branch 'master' into themis_pipeops
advieser Sep 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ Suggests:
vtreat,
future,
htmlwidgets,
ranger
ranger,
themis
ByteCompile: true
Encoding: UTF-8
Config/testthat/edition: 3
Expand Down Expand Up @@ -145,6 +146,7 @@ Collate:
'PipeOpMutate.R'
'PipeOpNMF.R'
'PipeOpNOP.R'
'PipeOpNearmiss.R'
'PipeOpOVR.R'
'PipeOpPCA.R'
'PipeOpProxy.R'
Expand All @@ -164,6 +166,7 @@ Collate:
'PipeOpSubsample.R'
'PipeOpTextVectorizer.R'
'PipeOpThreshold.R'
'PipeOpTomek.R'
'PipeOpTrafo.R'
'PipeOpTuneThreshold.R'
'PipeOpUnbranch.R'
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ export(PipeOpMultiplicityImply)
export(PipeOpMutate)
export(PipeOpNMF)
export(PipeOpNOP)
export(PipeOpNearmiss)
export(PipeOpOVRSplit)
export(PipeOpOVRUnite)
export(PipeOpPCA)
Expand Down Expand Up @@ -108,6 +109,7 @@ export(PipeOpTaskPreproc)
export(PipeOpTaskPreprocSimple)
export(PipeOpTextVectorizer)
export(PipeOpThreshold)
export(PipeOpTomek)
export(PipeOpTuneThreshold)
export(PipeOpUnbranch)
export(PipeOpVtreat)
Expand Down
3 changes: 2 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# mlr3pipelines 0.6.0-9000

* New PipeOp `PipeOpRowApply` / `po("rowapply")`
* New PipeOp: `PipeOpRowApply` / `po("rowapply")`
* New down-sampling PipeOps for inbalanced data: `PipeOpTomek` / `po("tomek")` and `PipeOpNearmiss` / `po("nearmiss")`

# mlr3pipelines 0.6.0

Expand Down
109 changes: 109 additions & 0 deletions R/PipeOpNearmiss.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#' @title Nearmiss Down-Sampling
#'
#' @usage NULL
#' @name mlr_pipeops_nearmiss
#' @format [`R6Class`][R6::R6Class] object inheriting from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @description
#' Generates a more balanced data set by down-sampling the instances of non-minority classes using the NEARMISS algorithm.
#'
#' The algorithm down-samples by selecting instances from the non-minority classes that have the smallest mean distance
#' to their `k` nearest neighbors of different classes.
#' For this only numeric and integer features are taken into account. These must have no missing values.
#'
#' This can only be applied to [classification tasks][mlr3::TaskClassif]. Multiclass classification is supported.
#'
#' See [`themis::nearmiss`] for details.
#'
#' @section Construction:
#' ```
#' PipeOpNearmiss$new(id = "nearmiss", param_vals = list())
#' ```
#'
#' * `id` :: `character(1)`\cr
#' Identifier of resulting object, default `"nearmiss"`.
#' * `param_vals` :: named `list`\cr
#' List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default `list()`.
#'
#' @section Input and Output Channels:
#' Input and output channels are inherited from [`PipeOpTaskPreproc`].
#'
#' The output during training is the input [`Task`][mlr3::Task] with the rows removed from the non-minority classes.
#' The output during prediction is the unchanged input.
#'
#' @section State:
#' The `$state` is a named `list` with the `$state` elements inherited from [`PipeOpTaskPreproc`].
#'
#' @section Parameters:
#' The parameters are the parameters inherited from [`PipeOpTaskPreproc`], as well as
#' * `k` :: `integer(1)`\cr
#' Number of nearest neighbors used for calculating the mean distances. Default is `5`.
#' * `under_ratio` :: `numeric(1)`\cr
#' Ratio of the minority-to-majority frequencies. This specifies the ratio to which the number of instances
#' in the non-minority classes get down-sampled to, relative to the number of instances of the minority class.
#' Default is `1`. For details, see [`themis::nearmiss`].
#'
#' @section Fields:
#' Only fields inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @section Methods:
#' Only methods inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @references
#' `r format_bib("zhang2003")`
#'
#' @family PipeOps
#' @template seealso_pipeopslist
#' @include PipeOpTaskPreproc.R
#' @export
#' @examples
#' \dontshow{ if (requireNamespace("themis")) \{ }
#' library("mlr3")
#'
#' # Create example task
#' task = tsk("wine")
#' task$head()
#' table(task$data(cols = "type"))
#'
#' # Down-sample and balance data
#' pop = po("nearmiss")
#' nearmiss_result = pop$train(list(task))[[1]]$data()
#' nrow(nearmiss_result)
#' table(nearmiss_result$type)
#' \dontshow{ \} }
PipeOpNearmiss = R6Class("PipeOpNearmiss",
inherit = PipeOpTaskPreproc,
public = list(
initialize = function(id = "nearmiss", param_vals = list()) {
ps = ps(
k = p_int(lower = 1, default = 5, tags = c("train", "nearmiss")),
under_ratio = p_dbl(lower = 0, default = 1, tags = c("train", "nearmiss"))
)
super$initialize(id, param_set = ps, param_vals = param_vals, packages = "themis", can_subset_cols = FALSE,
task_type = "TaskClassif", tags = "imbalanced data")
}
),
private = list(

.train_task = function(task) {
# Return task unchanged, if no feature columns exist
if (!length(task$feature_names)) {
return(task)
}
# At least one numeric or integer feature required
if (!any(task$feature_types$type %in% c("numeric", "integer"))) {
stop("Nearmiss needs at least one numeric or integer feature to work.")
}
# Subset columns to only include integer/numeric features and the target
cols = c(task$feature_types[get("type") %in% c("integer", "numeric"), get("id")], task$target_names)
mb706 marked this conversation as resolved.
Show resolved Hide resolved
# Down-sample data
dt = setDT(invoke(themis::nearmiss, df = task$data(cols = cols), var = task$target_names,
.args = self$param_set$get_values(tags = "nearmiss")))

keep = task$row_ids[as.integer(row.names(dt))]
task$filter(keep)
}
)
)

mlr_pipeops$add("nearmiss", PipeOpNearmiss)
98 changes: 98 additions & 0 deletions R/PipeOpTomek.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
#' @title Tomek Down-Sampling
#'
#' @usage NULL
#' @name mlr_pipeops_tomek
#' @format [`R6Class`][R6::R6Class] object inheriting from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @description
#' Generates a cleaner data set by removing all majority-minority Tomek links.
#'
#' The algorithm down-samples the data by removing all pairs of observations that form a Tomek link,
#' i.e. a pair of observations that are nearest neighbors and belong to different classes.
#' For this only numeric and integer features are taken into account. These must have no missing values.
#'
#' This can only be applied to [classification tasks][mlr3::TaskClassif]. Multiclass classification is supported.
#'
#' See [`themis::tomek`] for details.
#'
#' @section Construction:
#' ```
#' PipeOpTOmek$new(id = "tomek", param_vals = list())
#' ```
#'
#' * `id` :: `character(1)`\cr
#' Identifier of resulting object, default `"tomek"`.
#' * `param_vals` :: named `list`\cr
#' List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default `list()`.
#'
#' @section Input and Output Channels:
#' Input and output channels are inherited from [`PipeOpTaskPreproc`].
#'
#' The output during training is the input [`Task`][mlr3::Task] with removed rows for pairs of observations that form a Tomek link.
#' The output during prediction is the unchanged input.
#'
#' @section State:
#' The `$state` is a named `list` with the `$state` elements inherited from [`PipeOpTaskPreproc`].
#'
#' @section Parameters:
#' The parameters are the parameters inherited from [`PipeOpTaskPreproc`].
#'
#' @section Fields:
#' Only fields inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @section Methods:
#' Only methods inherited from [`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @references
#' `r format_bib("tomek1976")`
#'
#' @family PipeOps
#' @template seealso_pipeopslist
#' @include PipeOpTaskPreproc.R
#' @export
#' @examples
#' \dontshow{ if (requireNamespace("themis")) \{ }
#' library("mlr3")
#'
#' # Create example task
#' task = tsk("iris")
#' task$head()
#' table(task$data(cols = "Species"))
#'
#' # Down-sample data
#' pop = po("tomek")
#' tomek_result = pop$train(list(task))[[1]]$data()
#' nrow(tomek_result)
#' table(tomek_result$Species)
#' \dontshow{ \} }
PipeOpTomek = R6Class("PipeOpTomek",
inherit = PipeOpTaskPreproc,
public = list(
initialize = function(id = "tomek", param_vals = list()) {
super$initialize(id, param_set = ps(), param_vals = param_vals, packages = "themis", can_subset_cols = FALSE,
task_type = "TaskClassif", tags = "imbalanced data")
}
),
private = list(

.train_task = function(task) {
# Return task unchanged, if no feature columns exist
if (!length(task$feature_names)) {
return(task)
}
# At least one numeric or integer feature required
if (!any(task$feature_types$type %in% c("numeric", "integer"))) {
stop("Tomek needs at least one numeric or integer feature to work.")
}
# Subset columns to only include integer/numeric features and the target
cols = c(task$feature_types[get("type") %in% c("integer", "numeric"), get("id")], task$target_names)
mb706 marked this conversation as resolved.
Show resolved Hide resolved
# Down-sample data
dt = setDT(invoke(themis::tomek, df = task$data(cols = cols), var = task$target_names))

keep = task$row_ids[as.integer(row.names(dt))]
task$filter(keep)
}
)
)

mlr_pipeops$add("tomek", PipeOpTomek)
20 changes: 20 additions & 0 deletions R/bibentries.R
Original file line number Diff line number Diff line change
Expand Up @@ -52,5 +52,25 @@ bibentries = c(
author = "Yujun Wu and Dennis D Boos and Leonard A Stefanski",
title = "Controlling Variable Selection by the Addition of Pseudovariables",
journal = "Journal of the American Statistical Association"
),

zhang2003 = bibentry("inproceedings",
year = "2003",
author = "Zhang, J. and Mani, I.",
title = "KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction",
booktitle = "Proceedings of Workshop on Learning from Imbalanced Datasets (ICML)",
),

tomek1976 = bibentry("article",
doi = "10.1109/TSMC.1976.4309452",
author = "I. Tomek",
year = "1976",
title = "Two Modifications of CNN",
journal = "IEEE Transactions on Systems, Man and Cybernetics",
volume = "6",
number = "11",
pages = "769--772",
publisher = "IEEE"
)

)
2 changes: 2 additions & 0 deletions man/PipeOp.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpEnsemble.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpImpute.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpTargetTrafo.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpTaskPreproc.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/PipeOpTaskPreprocSimple.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/mlr_pipeops.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading