Skip to content

Commit

Permalink
Merge pull request #149 from istallworthy/more-feedback
Browse files Browse the repository at this point in the history
vignette edits
  • Loading branch information
istallworthy authored Nov 20, 2023
2 parents ef49d44 + e7be529 commit 408071b
Show file tree
Hide file tree
Showing 9 changed files with 85 additions and 154 deletions.
Binary file modified .DS_Store
Binary file not shown.
2 changes: 1 addition & 1 deletion _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ template:
bootswatch: litera
navbar:
structure:
left: [intro, reference, articles, tutorials, news]
left: [intro, reference, articles, news]
right: [search, github]
footer:
structure:
Expand Down
1 change: 1 addition & 0 deletions examplePipelineRevised.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -281,6 +281,7 @@ data <- lapply(data, function(x){
factor_covars <- c("state", "TcBlac2","BioDadInHH2","HomeOwnd", "PmBlac2",
"PmMrSt2", "SurpPreg", "RHealth", "SmokTotl", "DrnkFreq",
"RHasSO.6", "RHasSO.15", "RHasSO.24", "RHasSO.35", "RHasSO.58")
data <- lapply(data, function(x) {
x[, factor_covars] <- as.data.frame(lapply(x[, factor_covars], as.factor))
x })
Expand Down
Binary file added inst/imgfile.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/dev 2.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added man/figures/dev image.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
38 changes: 15 additions & 23 deletions vignettes/Preliminary_Steps.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -107,10 +107,9 @@ All data must be wide format and contain an “ID” column for subject identifi


#### P1a. Format single data frame of long data
Users beginning with a single data frame in long format (with or without missingness) can utilize a helper function `formatLongData` to summarize exposure and outcome data and convert to required variable names. This function takes a dataset in long format and any variables for time (time_var), ID (id_var), and missing data (missing) with alternative variables and re-labels them according to what is required by the package. It also classes any factor confounders (factor_confounders) as factors in the data and all others as numeric.
Users beginning with a single data frame in long format (with or without missingness) can utilize a helper function `formatLongData()` to summarize exposure and outcome data and convert to required variable names. This function takes a dataset in long format and any variables for time (time_var), ID (id_var), and missing data (missing) with alternative variables and re-labels them according to what is required by the package. It also classes any factor confounders (factor_confounders) as factors in the data and all others as numeric.

```{r}
data_long_f <- formatLongData(data = data_long, exposure = exposure,
exposure_time_pts = exposure_time_pts, outcome = outcome,
time_var = "Tage", id_var = "S_ID",
Expand Down Expand Up @@ -158,39 +157,37 @@ data <- read.csv('/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/


### P2. Impute Data to Account for Missingness
The functions of the *devMSMs* package accept data in the form of a single data frame with no missing values or m imputed datasets in the form of either a mids object (output from the mice package or via `imputeData`) or a list of imputed datasets. Most developmental data from humans will have some amount of missing data. Given that the creation of IPTW balancing weights requires complete data, we recommend imputing data. Imputation assumes a missing data mechanism of missing at random (MAR) and no more than 20% missing data in total (Leyrat et al., 2021). Given existing work demonstrating its superiority, *devMSMS* implements the ‘within’ approach for imputed data, conducting all steps on each imputed dataset before pooling estimates using Rubin’s rules to create final average predictions and contrast comparisons in *Worfklows* vignettes Step 5 (Leyrat et al, 2021; Granger et al., 2019).
The functions of the *devMSMs* package accept data in the form of a single data frame with no missing values or m imputed datasets in the form of either a mids object (output from the mice package or via `imputeData()`) or a list of imputed datasets. Most developmental data from humans will have some amount of missing data. Given that the creation of IPTW balancing weights requires complete data, we recommend imputing data. Imputation assumes a missing data mechanism of missing at random (MAR) and no more than 20% missing data in total (Leyrat et al., 2021). Given existing work demonstrating its superiority, *devMSMS* implements the ‘within’ approach for imputed data, conducting all steps on each imputed dataset before pooling estimates using Rubin’s rules to create final average predictions and contrast comparisons in *Worfklows* vignettes Step 5 (Leyrat et al, 2021; Granger et al., 2019).

#### P2a. Multiply impute single wide, formatted data frame using mice
Users have the option of using the helper `imputeData` function to impute their correctly formatted wide data. This step can take a while to run. The user can specify how many imputed datasets to create (default m = 5). `imputeData` draws on the `mice` function from the *mice* package (van Buuren & Oudshoorn, 2011) to conduct multiple imputation by chained equations (mice). All other variables present in the dataset are used to impute missing data in each column.
Users have the option of using the helper `imputeData()` function to impute their correctly formatted wide data. This step can take a while to run. The user can specify how many imputed datasets to create (default m = 5). `imputeData()` draws on the `mice()` function from the *mice* package (van Buuren & Oudshoorn, 2011) to conduct multiple imputation by chained equations (mice). All other variables present in the dataset are used to impute missing data in each column.

The user can specify the imputation method through the `method` field drawing from the following list: “pmm” (predictive mean matching), “midastouch” (weighted predictive mean matching), “sample” (random sample from observed values), “rf” (random forest) or “cart” (classification and regression trees). Random forest imputation is the default given evidence for its efficiency and superior performance (Shah et al., 2014).

The parameter `read_imps_from_file` will allow you to read already imputed data in from local storage (TRUE) so as not to have to re-run this imputation code multiple times (FALSE; default). Users may use this parameter to supply their own mids object of imputed data from the *mice* package (with the title ‘all_imp.rds’). Be sure to inspect the console for any warnings as well as the resulting imputed datasets. Any variables that have missing data following imputation may need to be removed due to high collinearity and/or low variability.

The required inputs for this function are a data frame in wide format (formatted according to pre-requirements listed above), m number of imputed datasets to create, a path to the home directory, (if ‘save.out’ = TRUE), exposure (e.g., “variable”), and outcome (e.g., “variable.t”). The home directory path, exposure, and outcome should already be defined if the user completed the Specifying Core Inputs vignette.

Optional inputs are as follows. The user can specify an imputation method compatible with `mice` (see above). Additionally, the user can specify in `maxit` the number of interactions for `mice::mice()` to conduct (default is 5). The user can also specify `para_proc`, a logical indicator indicating whether or not to speed up imputing using parallel processing (default = TRUE). This uses 2 cores using functions from the *parallel*, *doRNG*, and *doParallel* packages.
Optional inputs are as follows. The user can specify an imputation method compatible with `mice()` (see above). Additionally, the user can specify in `maxit` the number of interactions for `mice::mice()` to conduct (default is 5). The user can also specify `para_proc`, a logical indicator indicating whether or not to speed up imputing using parallel processing (default = TRUE). This uses 2 cores using functions from the *parallel*, *doRNG*, and *doParallel* packages.

The user may also specify any additional inputs accepted by `mice::mice()` and we advise consulting the <a href:="https://www.rdocumentation.org/packages/mice/versions/3.16.0/topics/mice">[*mice* documentation]</a> for more information.
The user can also indicate if they have already created imputed datasets from this function and wish to read them in (`read_imps_from_file = TRUE` rather than recreate them (default). They can also set `save.out = FALSE` to suppress saving intermediate and final output to the local home directory (recommended and default = TRUE).

For this example, we create 5 imputed datasets using the default random forest method and 5 iterations and assign the output to `data` for use with *devMSMs*.

```{r}
#optional; number of imputations (default is 5)
m <- 5 #empirical example
m <- 5
#optional; provide an imputation method pmm, midastouch, sample, cart , rf (default)
method <- "rf" #empirical example
method <- "rf"
#optional maximum iterations for imputation (default is 5)
maxit <- 5 #empirical example
maxit <- 5
imputed_data <- imputeData(data = data_wide, exposure = exposure, outcome = outcome,
m = m, method = method, maxit = maxit, para_proc = FALSE,
read_imps_from_file = FALSE,
home_dir = home_dir, save.out = TRUE)
data <- imputed_data
```

We can also read in previously saved imputed data.
Expand All @@ -204,12 +201,10 @@ data <- readRDS("/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/t
Users can also read in, as a list, imputed data created using a different function and saved locally as .csv files (labeled “1”:m) in a single folder.

```{r}
#read in imputed csv files to list
folder <- "/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/imputations/" # these are final imputations for empirical example; change this to match your local folder
folder <- "/Users/isabella/Library/CloudStorage/Box-Box/BSL General/MSMs/testing/testing data/continuous outcome/continuous exposure/imputations/"
files <- list.files(folder, full.names = TRUE, pattern = "\\.csv") #make sure pattern matches suffix of your data
#if you want to use the package with a list of imputed data from above
data <- lapply(files, function(file) {
imp_data <- read.csv(file)
imp_data
Expand Down Expand Up @@ -239,11 +234,10 @@ epochs <- data.frame(epochs = c("Infancy", #list user-specified names
Exposure histories are the units by which users will test their substantive hypotheses and their construction should be determined by both theoretical and practical reasoning. We strongly recommend users verify and inspect exposure histories a priori in relation to their data and hypotheses.

### P4a. Create high and low cutoff values for continuous exposures
First, for continuously distributed exposures (regardless of whether or not exposure epochs are specified), we recommend users indicate high and low cutoff values as an optional input to the `compareHistories(`) *devMSMs* function (see *Workflows* vignettes). To do so, they specify to `hi_lo_cut`, as a list, a quantile value (0-1) above which will be considered high levels exposure, followed by a quantile value (0-1) below which will be considered low levels of exposure (default is median split). These values may have to be revised following inspection of the sample distribution across the resulting exposure histories in the subsequent steps. These final values should be used in creating exposure histories in Step 5 of the *Workflows* vignettes.
First, for continuously distributed exposures (regardless of whether or not exposure epochs are specified), we recommend users indicate high and low cutoff values as an optional input to the `compareHistories()`) *devMSMs* function (see *Workflows* vignettes). To do so, they specify to `hi_lo_cut`, as a list, a quantile value (0-1) above which will be considered high levels exposure, followed by a quantile value (0-1) below which will be considered low levels of exposure (default is median split). These values may have to be revised following inspection of the sample distribution across the resulting exposure histories in the subsequent steps. These final values should be used in creating exposure histories in Step 5 of the *Workflows* vignettes.

```{r}
hi_lo_cut <- c(0.6, 0.3) #empirical example
```


Expand All @@ -259,25 +253,23 @@ These final reference and comparison values established at this step should be u
```{r}
reference <- c("l-l-l", "l-l-h")
comparison <- c("h-h-h", "h-l-l", "l-l-h", "h-h-l", "l-h-h") #empirical example final choice
comparison <- c("h-h-h", "h-l-l", "l-l-h", "h-h-l", "l-h-h")
```

### P4c. Inspect exposure histories and data
For all users, we highly recommend use of the helper `inspectData` function (with the original dataset long or wide format or imputed data in the case of missingness) to summarize exposure, outcome, and confounders and inspect the sample distribution among exposure histories. Based on any user-specified exposure epochs and high and low quantile values (for continuous exposures), this function outputs a table showing the sample distribution across all histories, as shown below in Table 2.
For all users, we highly recommend use of the helper `inspectData()` function (with the original dataset long or wide format or imputed data in the case of missingness) to summarize exposure, outcome, and confounders and inspect the sample distribution among exposure histories. Based on any user-specified exposure epochs and high and low quantile values (for continuous exposures), this function outputs a table showing the sample distribution across all histories.

We strongly suggest visually inspecting this table and revising the designation of epochs and/or high and low quantile values (for continuous exposures) until each history contains a reasonable number of participants. While there is no gold standard required number per history cell, users should guard against extrapolation beyond the scope of the data. For example, in our data, when using 75th and 25th percentile cutoffs, there were histories that represented less than two cases and thus we re-evaluated our cutoffs. Users may wish to revise any epoch designation and high and low cutoff values, where applicable. The function conducts summaries and history distribution inspection for each imputed dataset if imputed data are supplied.

*insert Table 2*

The required inputs for `inspectData` are: data (as a data frame in wide or long format, a list of imputed data frames in wide format, or a mids object), a path to the home directory, exposure (e.g., “variable”), and outcome (e.g., “variable.t”).
The required inputs for `inspectData()` are: data (as a data frame in wide or long format, a list of imputed data frames in wide format, or a mids object), a path to the home directory, exposure (e.g., “variable”), and outcome (e.g., “variable.t”).

Optional inputs are time-varying confounders (e.g., “variable.t”), epochs, high/low cutoff values for continuous exposures and specification of reference and comparison histories (see above), setting `verbose = FALSE` to suppress console output (recommended and default is TRUE), and setting `save.out = FALSE` to suppress saving intermediate and final output to local home directory (recommended and default = TRUE). The specification of exposure epochs should be kept consistent throughout the use of the *devMSMs* package (see *Workflows* vignettes). The home directory path, exposure, exposure time points, confounders, and outcome should already be defined if the user completed the Specify Required Package Core Inputs vignette.

The helper `inspectData` function outputs the following files into the home directory: a correlation plot of all variables in the dataset (Figure 2), tables of exposure (Table 3) and outcome (Table 4) descriptive statistics, and two summary tables of the confounders considered at each time point (Table 5 & 6).
The helper `inspectData()` function outputs the following files into the home directory: a correlation plot of all variables in the dataset (Figure 2), tables of exposure (Table 3) and outcome (Table 4) descriptive statistics, and two summary tables of the confounders considered at each time point (Table 5 & 6).

```{r}
inspectData(data = data, exposure = exposure, exposure_time_pts = exposure_time_pts, outcome = outcome, # required input
ti_confounders = ti_confounders, tv_confounders = tv_confounders, # required input
epochs = epochs, hi_lo_cut = hi_lo_cut, reference = reference, comparison = comparison, #optional input
Expand Down
Loading

0 comments on commit 408071b

Please sign in to comment.