-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On futuremice() and reproducibility #557
Comments
Thanks for raising these points, @edbonneville! The reason for using
However, this will not solve the second point. I again agree that reproducibility is important, but I think that this will remain an issue regardless of how we adapt set.seed(123)
purrr::map_dbl(1:10, ~rnorm(1))
#> [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499
#> [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197
set.seed(123)
future::plan("multisession")
furrr::future_map_dbl(1:10, ~rnorm(1),
.options = furrr::furrr_options(seed = TRUE))
#> [1] -2.03403746 -0.37386282 0.21503119 -1.17845115 -1.45315445 -2.06945243
#> [7] 1.75945588 -0.06722657 0.92327056 -0.56712447 Created on 2023-05-19 with reprex v2.0.2 I haven't done much research into whether it is possible to obtain exactly reproducible results regardless of the future strategy, so if you know a way to mitigate this issue I would be very interested to implement it! Given this, and because |
Thanks for the detailed answer @thomvolker ! I would agree your suggestion regarding the use of Your example on future strategies I do not think is fair, since set.seed(123)
future::plan("sequential")
furrr::future_map_dbl(
1:10,
~ rnorm(1),
.options = furrr::furrr_options(seed = TRUE)
)
#> Warning: package 'purrr' was built under R version 4.2.2
#> [1] -2.03403746 -0.37386282 0.21503119 -1.17845115 -1.45315445 -2.06945243
#> [7] 1.75945588 -0.06722657 0.92327056 -0.56712447
set.seed(123)
future::plan("multisession")
furrr::future_map_dbl(
1:10,
~ rnorm(1),
.options = furrr::furrr_options(seed = TRUE)
)
#> [1] -2.03403746 -0.37386282 0.21503119 -1.17845115 -1.45315445 -2.06945243
#> [7] 1.75945588 -0.06722657 0.92327056 -0.56712447 .. which does yield the same results. If we take the
We would need to contact the maintainer of the
.. and it turns out this is competitive with # Globals
library(mice)
library(future)
library(microbenchmark)
library(ggplot2)
# Proposed function
futuremice3 <- function(data,
m,
# data.init + printFlag added here since not part of mids list
data.init = NULL,
printFlag = FALSE,
...) {
# Run initial mice() to run all checks
mids.init <- mice::mice(
data,
m = 1,
printFlag = printFlag,
data.init = data.init,
...
)
total_m <- m # Keep desired m separate, otherwise will be overwritten in list2env call (turned to m = 1)
# Populate the function environment with the element of the mids list a single time
list2env(x = mids.init, envir = environment())
maxit <- iteration
# Closure which makes use of the above variables
# It is the mice() function but from this line:
# https://github.com/amices/mice/blob/ea6d83453b39f489c4991662565dc3c8d32cfd4b/R/mice.R#LL428C5-L428C5
add.extra.imputations <- function(m, ...) {
# data frame for storing the event log
state <- list(it = 0, im = 0, dep = "", meth = "", log = FALSE)
loggedEvents <- data.frame(it = 0, im = 0, dep = "", meth = "",
out = "")
# initialize imputations
nmis <- apply(is.na(data), 2, sum)
imp <- mice:::initialize.imp(
data, m, ignore, where, blocks, visitSequence,
method, nmis, data.init
)
# and iterate...
from <- 1
to <- from + maxit - 1
q <- mice:::sampler(
data, m, ignore, where, imp, blocks, method,
visitSequence, predictorMatrix, formulas, blots,
post, c(from, to), printFlag, ...
)
if (!state$log) loggedEvents <- NULL
if (state$log) row.names(loggedEvents) <- seq_len(nrow(loggedEvents))
## save, and return
midsobj <- list(
data = data,
imp = q$imp,
m = m,
where = where,
blocks = blocks,
call = call,
nmis = nmis,
method = method,
predictorMatrix = predictorMatrix,
visitSequence = visitSequence,
formulas = formulas,
post = post,
blots = blots,
ignore = ignore,
seed = seed,
iteration = q$iteration,
lastSeedValue = get(".Random.seed",
envir = globalenv(), mode = "integer",
inherits = FALSE
),
chainMean = q$chainMean,
chainVar = q$chainVar,
loggedEvents = loggedEvents,
version = packageVersion("mice"),
date = Sys.Date()
)
oldClass(midsobj) <- "mids"
if (!is.null(midsobj$loggedEvents)) {
warning("Number of logged events: ", nrow(midsobj$loggedEvents),
call. = FALSE
)
}
midsobj
}
# Same procedure as futuremice2(), but now will be much faster
mids_list <- future.apply::future_lapply(
X = seq_len(total_m - 1),
FUN = function(x) add.extra.imputations(m = 1, ...),
future.seed = TRUE
)
imp <- Reduce(f = mice::ibind, x = c(list(mids.init), mids_list))
return(imp)
}
# Generate data like futuremice vignette here:
# https://github.com/gerkovink/miceVignettes/blob/ec5235007366b7b9b56723154d7e9475da5f8c43/futuremice/Vignette_futuremice.Rmd#L123
set.seed(20230520)
large_covmat <- diag(8)
large_covmat[large_covmat == 0] <- 0.5
large_data <- MASS::mvrnorm(
n = 10000,
mu = c(0, 0, 0, 0, 0, 0, 0, 0),
Sigma = large_covmat
)
large_data_with_missings <- ampute(large_data, prop = 0.8, mech = "MCAR")$amp
head(large_data_with_missings)
#> V1 V2 V3 V4 V5 V6
#> 1 -1.5119431 -1.71881533 -1.7187446 -0.4127288 -1.89156861 -0.7399172
#> 2 0.2485728 0.04659958 -0.9330806 1.5763965 1.39634609 NA
#> 3 0.6672462 -1.14396682 NA -0.8888281 -0.06917341 0.2223045
#> 4 -0.2795157 -1.44177321 0.7633885 NA -0.32258739 -0.3794018
#> 5 0.6114431 -0.57473468 -0.1585314 -0.1170578 0.74479866 0.3170481
#> 6 -0.1955333 0.70432678 NA 0.8053712 -0.17290571 0.1633985
#> V7 V8
#> 1 -0.6165913 -2.59207333
#> 2 0.2685161 -0.32164220
#> 3 -0.7292466 0.40230824
#> 4 -0.2542349 -1.02424692
#> 5 -0.5111598 NA
#> 6 0.7281126 0.06050003
cores <- 3L
m <- 50
# Will take a good 10 minutes
mb <- microbenchmark(
"futuremice" = futuremice(large_data_with_missings, m = m, n.core = cores),
"futuremice3" = {
plan(multisession, workers = cores)
futuremice3(large_data_with_missings, m = m)
},
"sequential" = mice(large_data_with_missings, m = m, printFlag = FALSE),
times = 10
)
mb
#> Unit: seconds
#> expr min lq mean median uq max neval cld
#> futuremice 12.78954 12.86768 12.97507 12.90699 13.04418 13.38369 10 a
#> futuremice3 12.72041 12.78315 13.31132 12.99573 13.69118 14.86115 10 a
#> sequential 19.80765 19.94718 20.17635 20.15332 20.43851 20.55581 10 b
#autoplot(mb) Definitely worth doing some extra checking, but this would solve the reproducibility issue without sacrificing speed (at least in the above example with large-ish data). And of course it is not very clean that I had to copy code directly from the |
Thanks for another good contribution! You're right, I should have compared with a sequential plan. I also like your suggestion, but I would need to check whether we don't lose mice functionality in futuremice in this set-up. My ideal solution would be to simply parallelize the sampler within mice, such that there is no duplicate code whatsoever, and that relies on a "parallel = TRUE" argument in mice, and with the possibility of setting your own plan and everything, but that requires some more thinking first! |
Dear
{mice}
team,Thank you for your continued work on this package!
I had a couple of suggestions/questions regarding the
futuremice()
function. The first is on how theplan()
is currently set within the function, which not recommended as per one of the{future}
vignettes. If someone is usingfuturemice()
because they want to make use of parallel processing anyway, would it not be more flexible to let them set their ownplan()
? Multiple arguments (e.g. future.plan, use.logical, n.core, parallelseed) would then no longer be necessary, and would allow advanced users to specify nested futures (e.g. when you want to parallelize imputations within parallelized simulation replications).The second (related) point is that I think one of the major advantages of using this future framework is that your results can be reproducible regardless of how another user decides to parallelize the same analysis (e.g. local computer vs. computing cluster, differing numbers of cores etc). In the current implementation,
futuremice()
will not be able to reproduce the same set of imputations when only the number of cores differs (same dataset, number of imputations, and seed). For example:This is a side effect of
n.core
dictating how them
imputations are spread out across the cores. A way to make the imputations reproducible regardless ofn.core
might look something like this:Of course if you do some benchmarks,
futuremice2()
will be slower thanfuturemice()
, since the former callsmice()
a wholem
times while the latter onlyn.core
times. Nevertheless, a)futuremice2()
will still outperform sequentialmice()
(depending on size of dataset and number of imputations); b) you could speed it up further by for example callingmice()
a first time to perform the necessary checks, and then only call a function which would boil down to runningmice()
from here, which uses the defined predictorMatrix/maxit/method/etc from the first call.So my main point: the current implementation does optimize computational resources nicely, but not reproducibility - so something like
futuremice2()
could be a compromise. What do you think? I guess two arguments for keeping the current implementation is that a)futuremice()
should focus on optimizing speed; b) we should anyway be using a large enough number of imputed datasets such that our results do not differ much between independent imputation rounds.. 😅Thanks!
Session info
The text was updated successfully, but these errors were encountered: