Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed learner fit does not yield same result as regular fit when Lrnr_randomForest is used #402

Open
tq21 opened this issue Nov 16, 2022 · 1 comment
Labels

Comments

@tq21
Copy link
Contributor

tq21 commented Nov 16, 2022

When Lrnr_randomForest is a candidate learner in a super learner library, the delayed learner fit and regular fit do not yield the same prediction result.

Here is an example borrowed from test-delayed_sl3.R:

library(delayed)
library(SuperLearner)
library(future)

plan(sequential)
  
data(cpp_imputed)
task <- sl3_Task$new(
  cpp_imputed, 
  covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
  outcome = "haz")
  
lrnr_rf <- Lrnr_randomForest$new()
lrnr_glmnet <- Lrnr_glmnet$new()
lrnr_glm_fast <- Lrnr_glm_fast$new()
  
stack <- Stack$new(lrnr_rf, lrnr_glmnet, lrnr_glm_fast)
sl <- Lrnr_sl$new(learners = stack)

set.seed(123)
test_delayed <- delayed_learner_train(sl, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
set.seed(123)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()

set.seed(123)
fit <- sl$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))

Running the code above, preds_delayed and preds do not match.

Note that if we remove Lrnr_randomForest, result is reproducible:

plan(sequential)

data(cpp_imputed)
task <- sl3_Task$new(
  cpp_imputed, 
  covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
  outcome = "haz")

lrnr_glmnet <- Lrnr_glmnet$new()
lrnr_glm_fast <- Lrnr_glm_fast$new()

stack <- Stack$new(lrnr_glmnet, lrnr_glm_fast)
sl <- Lrnr_sl$new(learners = stack)

set.seed(123)
test_delayed <- delayed_learner_train(sl, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
set.seed(123)
fit_delayed <- sched$compute()

set.seed(123)
fit <- sl$train(task)

preds_delayed <- fit_delayed$predict()
preds <- fit$predict()

expect_equal(preds_delayed, preds)

Removing set.seed(123) above the line test_delayed <- delayed_learner_train(sl, task), preds_delayed and preds do not match.

@jeremyrcoyle
Copy link
Collaborator

So calling lrnr$train on any learner is going to be running delayed anyways. See: https://github.com/tlverse/sl3/blob/master/R/Lrnr_base.R#L224

The difference between the two runs is that by default we use FutureJob, not SequentialJob, which is controllable by this function: sl3:::sl3_delayed_job_type. Changing the example to be consistent about job type fixes the issue:

library(delayed)
library(future)
library(sl3)
library(testthat)
plan(sequential)

data(cpp_imputed)
task <- sl3_Task$new(
  cpp_imputed, 
  covariates = c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn"),
  outcome = "haz")

lrnr_rf <- Lrnr_randomForest$new()

# FutureJob is reproducible
set.seed(123)
test_delayed <- delayed_learner_train(lrnr_rf, task)
sched <- Scheduler$new(test_delayed, FutureJob)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()

set.seed(123)
options(sl3.enable.future = TRUE) # the default
fit <- lrnr_rf$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))

# SequentialJob is reproducible
set.seed(123)
test_delayed <- delayed_learner_train(lrnr_rf, task)
sched <- Scheduler$new(test_delayed, SequentialJob)
fit_delayed <- sched$compute()
preds_delayed <- fit_delayed$predict()

set.seed(123)
options(sl3.enable.future = FALSE)
fit <- lrnr_rf$train(task)
preds <- fit$predict()
expect_equal(as.numeric(preds_delayed), as.numeric(preds))

The two job types currently handle seeds differently, which may be fixable, but I believe we changed FutureJob to fix other reproducibility issues, especially under parallelization. Maybe an issue on delayed is appropriate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants