Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recalibratePlpRefit returns a dataframe instead of an object of class runPlp #470

Open
Volpym opened this issue Jul 12, 2024 · 3 comments
Open

Comments

@Volpym
Copy link

Volpym commented Jul 12, 2024

Describe the bug
Running recalibratePlpRefit doesn't return an object of class runPlp that is recalibrated on the new data object, as stated in the documentation but an object of type data.frame. Am I missing something?

Set up (please run in R "sessionInfo()" and copy the output here):

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8      
 [6] LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_2.1.5                  PatientLevelPrediction_5.0.5 FeatureExtraction_3.5.2      Andromeda_0.6.6              dplyr_1.1.4                 
[6] DatabaseConnector_6.3.2      memuse_4.2-3                

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     remotes_2.4.2        purrr_1.0.2          splines_4.2.1        rJava_1.0-11         lattice_0.20-45      vctrs_0.6.5         
 [8] generics_0.1.3       utf8_1.2.4           blob_1.2.4           survival_3.3-1       rlang_1.1.4          pkgbuild_1.4.4       pillar_1.9.0        
[15] glue_1.7.0           withr_3.0.0          DBI_1.2.3            ParallelLogger_3.3.0 Cyclops_3.4.1        bit64_4.0.5          dbplyr_2.5.0        
[22] plyr_1.8.9           lifecycle_1.0.4      memoise_2.0.1        tzdb_0.4.0           callr_3.7.6          fastmap_1.2.0        ps_1.7.7            
[29] parallel_4.2.1       curl_5.2.1           fansi_1.0.6          urltools_1.7.3       triebeard_0.4.1      Rcpp_1.0.12          backports_1.5.0     
[36] checkmate_2.3.1      cachem_1.1.0         desc_1.4.2           vroom_1.6.5          jsonlite_1.8.8       bit_4.0.5            hms_1.1.3           
[43] processx_3.8.4       SqlRender_1.18.0     rprojroot_2.0.4      grid_4.2.1           cli_3.6.3            tools_4.2.1          magrittr_2.0.3      
[50] tibble_3.2.1         RSQLite_2.3.7        crayon_1.5.3         pkgconfig_2.0.3      Matrix_1.5-0         pROC_1.18.5          rstudioapi_0.16.0   
[57] R6_2.5.1             compiler_4.2.1   

To Reproduce

cdmDatabaseSchema <- 'demo_cdm'
cdmDatabaseName <- 'postgres'
dbms <- "postgresql"
user <- 'postgres'
pw <- 'postgres'
server <- '<ip>/postgres'
port <- '5438'

connectionDetails <- DatabaseConnector::createConnectionDetails(dbms = dbms,
                                                                server = server,
                                                                user = user,
                                                                password = pw,
                                                                port = port)
databaseDetails <- PatientLevelPrediction::createDatabaseDetails(  connectionDetails,
                                           cdmDatabaseSchema,
                                           cdmDatabaseName,
                                           tempEmulationSchema = cdmDatabaseSchema,
                                           cohortDatabaseSchema = cdmDatabaseSchema,
                                           cohortTable = "sinusitiscohort",
                                           outcomeDatabaseSchema = cdmDatabaseSchema,
                                           outcomeTable = "sinusitiscohort",
                                           outcomeIds = 5,
                                           cdmVersion = 5.3,
                                           cohortId = 3)

covariateSettings <- FeatureExtraction::createCovariateSettings(
  useDemographicsGender = T, 
  useDemographicsAgeGroup = T, 
  useConditionGroupEraLongTerm = T
), 

restrictPlpDataSettings <- PatientLevelPrediction::createRestrictPlpDataSettings(
  studyStartDate = "",
  studyEndDate = "",
  firstExposureOnly = F,
  washoutPeriod = 0,
  sampleSize = NULL
)

plpData <- PatientLevelPrediction::getPlpData(databaseDetails, covariateSettings, restrictPlpDataSettings)
outcomeId <- 5
pop1 <- PatientLevelPrediction::createStudyPopulationSettings(riskWindowStart = 1, 
                                      riskWindowEnd = 365,
                                      requireTimeAtRisk = T, 
                                      minTimeAtRisk = 364, 
                                      includeAllOutcomes = T)

newPopulation <- PatientLevelPrediction::createStudyPopulation(plpData, outcomeId,pop1, NULL)
model<-loadPlpModel('<path_to_rds>')
recalibratedModel <- recalibratePlpRefit(model, newPopulation, plpData)

Context
I have divided the Eunomia dataset into three chunks. I trained a PLP model on the first chunk and now I am recalibrating it using the data from the second chunk. Next, I plan to recalibrate this model again using the third chunk of data. Finally, I will compare this final recalibrated model with the PLP model trained on the entire Eunomia dataset.

@jreps
Copy link
Collaborator

jreps commented Jul 12, 2024

The documentation is definitely wrong. Thanks for reporting this. It appears to return the prediction for the new population with the original model and recalibrated model. It also adds attributes for how the model needed to be adjusted, but these seem hidden. It does refit the model, so this could be edited to return a runPlp, but I need to see whether it is used inside other functions before editing.

@Volpym
Copy link
Author

Volpym commented Jul 14, 2024

Could you provide some information on how to generate the runPlp object myself?

@egillax
Copy link
Collaborator

egillax commented Nov 8, 2024

Hi @Volpym,

I'm now looking at the recalibration code a bit in relation to a study I'm doing. First, if you want to return the model from the recalibratePlpRefit something like the following should work:

Modified recalibratePlpRefit
recalibratePlpRefit <- function(
    plpModel,
    newPopulation,
    newData, 
    returnModel = FALSE) {
  checkNotNull(plpModel)
  checkNotNull(newPopulation)
  checkNotNull(newData)
  checkIsClass(plpModel, "plpModel")
  checkIsClass(newData, "plpData")
  checkBoolean(returnModel)

  # get selected covariates
  includeCovariateIds <- plpModel$covariateImportance %>%
    dplyr::filter(.data$covariateValue != 0) %>%
    dplyr::select("covariateId") %>%
    dplyr::pull()

  # check which covariates are included in new data
  containedIds <- newData$covariateData$covariateRef %>% dplyr::collect()
  noShrinkage <- intersect(includeCovariateIds, containedIds$covariateId)

  # add intercept
  noShrinkage <- append(noShrinkage, 0, 0)

  setLassoRefit <- setLassoLogisticRegression(
    includeCovariateIds = includeCovariateIds,
    noShrinkage = noShrinkage,
    maxIterations = 10000 # increasing this due to test code often not converging
  )

  newData$labels <- newPopulation

  newData$folds <- data.frame(
    rowId = newData$labels$rowId,
    index = sample(2, length(newData$labels$rowId), replace = TRUE)
  )

  # add dummy settings to fit model
  attr(newData, "metaData")$outcomeId <- attr(newPopulation, "metaData")$outcomeId
  attr(newData, "metaData")$targetId <- attr(newPopulation, "metaData")$targetId
  attr(newData, "metaData")$restrictPlpDataSettings <- attr(newPopulation, "metaData")$restrictPlpDataSettings
  attr(newData, "metaData")$covariateSettings <- newData$metaData$covariateSettings
  attr(newData, "metaData")$populationSettings <- attr(newPopulation, "metaData")$populationSettings
  attr(newData$covariateData, "metaData")$featureEngineeringSettings <- PatientLevelPrediction::createFeatureEngineeringSettings()
  attr(newData$covariateData, "metaData")$preprocessSettings <- PatientLevelPrediction::createPreprocessSettings()
  attr(newData, "metaData")$splitSettings <- PatientLevelPrediction::createDefaultSplitSetting()
  attr(newData, "metaData")$sampleSettings <- PatientLevelPrediction::createSampleSettings()

  newModel <- tryCatch(
    {
      fitPlp(
        trainData = newData,
        modelSettings = setLassoRefit,
        analysisId = "recalibrationRefit",
        analysisPath = NULL
      )
    },
    error = function(e) {
      ParallelLogger::logInfo(e)
      return(NULL)
    }
  )
  if (is.null(newModel)) {
    ParallelLogger::logInfo("Recalibration fit failed")
    return(NULL)
  }

  newModel$prediction$evaluationType <- "recalibrationRefit"

  oldPred <- predictPlp(
    plpModel = plpModel,
    plpData = newData,
    population = newPopulation,
    timepoint = 0
  )

  oldPred$evaluationType <- "validation"

  prediction <- rbind(
    oldPred,
    newModel$prediction[, colnames(oldPred)]
  )

  if (!is.null(newModel$covariateImportance)) {
    adjust <- newModel$covariateImportance %>%
      dplyr::filter(.data$covariateValue != 0) %>%
      dplyr::select(
        "covariateId",
        "covariateValue"
      )
  } else {
    adjust <- c()
  }

  newIntercept <- newModel$model$coefficients[names(newModel$model$coefficients) == "(Intercept)"]

  attr(prediction, "metaData")$recalibratePlpRefit <- list(adjust = adjust, newIntercept = newIntercept)
  
  if (returnModel) {
    return(model = newModel)
  } else {
    return(prediction)
  }
}

But this is not a runPlp object, but rather a plpModel object. I don't think it ever should have returned a runPlp object, it was probably some copy paster error in the docs.

But I've been thinking about our three types of recalibration, recalibrationInTheLarge, weakRecalibration and the recalibratePlpRefit . I think it would be good to get the recalibrated model out and ready to be used.

Could you tell me more about your use case and for example why you are looking at the recalibratePlpRefit rather than the other two methods ?

Egill

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants