Skip to content

Commit

Permalink
Doc fix for PipeOpEncode and easy type conversion
Browse files Browse the repository at this point in the history
closes #749
  • Loading branch information
mb706 committed Apr 21, 2024
1 parent a6eb367 commit 44f76b2
Show file tree
Hide file tree
Showing 75 changed files with 545 additions and 446 deletions.
14 changes: 12 additions & 2 deletions R/PipeOpEncode.R
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
#' @format [`R6Class`] object inheriting from [`PipeOpTaskPreprocSimple`]/[`PipeOpTaskPreproc`]/[`PipeOp`].
#'
#' @description
#' Encodes columns of type `factor`, `character` and `ordered`.
#' Encodes columns of type `factor` and `ordered`.
#'
#' Possible encodings are `"one-hot"` encoding, as well as encoding according to `stats::contr.helmert()`, `stats::contr.poly()`,
#' `stats::contr.sum()` and `stats::contr.treatment()`.
Expand All @@ -14,6 +14,8 @@
#'
#' Use the [`PipeOpTaskPreproc`] `$affect_columns` functionality to only encode a subset of columns, or only encode columns of a certain type.
#'
#' `character`-type features can be encoded by converting them `factor` features first, using [`ppl("convert_types", "character", "factor")`][mlr_graphs_convert_types].
#'
#' @section Construction:
#' ```
#' PipeOpEncode$new(id = "encode", param_vals = list())
Expand All @@ -26,7 +28,7 @@
#' @section Input and Output Channels:
#' Input and output channels are inherited from [`PipeOpTaskPreproc`].
#'
#' The output is the input [`Task`][mlr3::Task] with all affected `factor`, `character` or `ordered` parameters encoded according to the `method`
#' The output is the input [`Task`][mlr3::Task] with all affected `factor` and `ordered` parameters encoded according to the `method`
#' parameter.
#'
#' @section State:
Expand Down Expand Up @@ -78,6 +80,14 @@
#'
#' poe$param_set$values$method = "sum"
#' poe$train(list(task))[[1]]$data()
#'
#' # converting character-columns
#' data_chr = data.table::data.table(x = factor(letters[1:3]), y = letters[1:3])
#' task_chr = TaskClassif$new("task_chr", data_chr, "x")
#'
#' goe = ppl("convert_types", "character", "factor") %>>% po("encode")
#'
#' goe$train(task_chr)[[1]]$data()
PipeOpEncode = R6Class("PipeOpEncode",
inherit = PipeOpTaskPreprocSimple,
public = list(
Expand Down
89 changes: 89 additions & 0 deletions R/pipeline_convert_types.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#' @include mlr_graphs.R

#' @title Convert Column Types
#' @name mlr_graphs_convert_types
#' @description
#' Converts all columns of type `type_from` to `type_to`, using the corresponding R function (e.g. `as.numeric()`, `as.factor()`).
#' It is possible to further subset the columns that should be affected using the `affect_columns` argument.
#' The resulting [`Graph`] contains a [`PipeOpColApply`], followed, if appropriate, by a [`PipeOpFixFactors`].
#'
#' @param type_from `character` \cr
#' Which column types to convert. May be any combination of `"logical"`, `"integer"`, `"numeric"`, `"factor"`, `"ordered"`, `"character"`, or `"POSIXct"`.
#' @param type_to `character(1)` \cr
#' Which type to convert to. Must be a scalar value, exactly one of the types allowed in `type_from`.
#' @param affect_columns `function` | [`Selector`] | `NULL` \cr
#' Which columns to affect. This argument can further restrict the columns being converted, beyond the `type_from` argument.
#' Must be a [`Selector`]-like function, which takes a [`Task`][mlr3::Task] as argument and returns a `character` of features to use.
#' @param id `character(1)` | `NULL` \cr
#' ID to give to the constructed [`PipeOp`]s.
#' Defaults to an ID built automatically from `type_from` and `type_to`.
#' If a [`PipeOpFixFactors`] is appended, its ID will be `paste0(id, "_ff")`.
#' @param fixfactors `logical(1)` | `NULL` \cr
#' Whether to append a [`PipeOpFixFactors`]. Defaults to `TRUE` if and only if `type_to` is `"factor"` or `"ordered"`.
#' @param more_args `list` \cr
#' Additional arguments to give to the conversion function. This could e.g. be used to pass the timezone to `as.POSIXct`.
#'
#' @return [`Graph`]
#' @export
#' @examples
#'
#' data_chr = data.table::data.table(
#' x = factor(letters[1:3]),
#' y = letters[1:3],
#' z = letters[1:3]
#' )
#' task_chr = TaskClassif$new("task_chr", data_chr, "x")
#' str(data_chr$data())
#'
#' graph = ppl("convert_types", "character", "factor")
#' str(graph$train(data_chr)[[1]]$data())
#'
#' graph_z = ppl("convert_types", "character", "factor",
#' affect_columns = selector_name("z"))
#' graph_z$train(data_chr)[[1]]$data()
#'
#' # `affect_columns` and `type_from` are both applied. The following
#' # looks for a 'numeric' column with name 'z', which is not present;
#' # the task is therefore unchanged.
#' graph_z = ppl("convert_types", "numeric", "factor",
#' affect_columns = selector_name("z"))
#' graph_z$train(data_chr)[[1]]$data()
#'
pipeline_convert_types = function(type_from, type_to, affect_columns = NULL, id = NULL, fixfactors = NULL, more_args = list()) {
coltypes = mlr_reflections$task_feature_types


assert_character(type_from, any.missing = FALSE, unique = TRUE)
assert_subset(type_from, coltypes)
assert_choice(type_to, coltypes)
assert_function(affect_columns, null.ok = TRUE)
assert_string(id, null.ok = TRUE)
assert_flag(fixfactors, null.ok = TRUE)
assert_list(more_args)

selector = selector_type(type_from)
if (!is.null(affect_columns)) {
selector = selector_intersect(selector, affect_columns)
}
if (is.null(id)) {
id = sprintf("convert_%s_to_%s",
paste(names(coltypes)[match(type_from, coltypes)], collapse = ""),
names(coltypes)[match(type_to, coltypes)]
)
}
converter = get(paste0("as.", type_to))
if (length(more_args)) {
converter = crate(function(x) {
mlr3misc::invoke(converter, x = x, .args = more_args)
}, converter, more_args)
}
if (is.null(fixfactors)) {
fixfactors = type_to %in% c("factor", "ordered")
}
po("colapply",
id = id, applicator = converter, affect_columns = affect_columns
) %>>!% if (fixfactors) po("fixfactors")
}

mlr_graphs$add("convert_types", pipeline_convert_types)

8 changes: 4 additions & 4 deletions man/Graph.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 8 additions & 8 deletions man/PipeOp.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 6 additions & 6 deletions man/PipeOpEnsemble.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 6 additions & 6 deletions man/PipeOpImpute.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

20 changes: 10 additions & 10 deletions man/PipeOpTargetTrafo.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading

0 comments on commit 44f76b2

Please sign in to comment.