Imputing a multivariate response variable in tidy long format #655

wlandau · 2024-07-23T16:36:36Z

wlandau
Jul 23, 2024

Background

I work with longitudinal clinical trial data and longitudinal models like the mixed model for repeated measures (MMRM). These multivariate models typically assume independent patients and correlated observations within patients.

There seems to be a consensus among the Clinical Data Interchange Standards Consortium (CDISC), modeling packages like brms, and the Tidyverse to represent this longitudinal data in tidy long form.

Data

An example dataset is the FEV dataset from the mmrm package, where the response variable FEV1 is recorded for multiple patients (USUBJID) who were randomized to different treatment groups (ARMCD) and measured over multiple time points (AVISIT).

data("fev_data", package = "mmrm")
tibble::as_tibble(fev_data)
#> # A tibble: 800 × 10
#>    USUBJID AVISIT ARMCD RACE                      SEX    FEV1_BL  FEV1 WEIGHT VISITN VISITN2
#>    <fct>   <fct>  <fct> <fct>                     <fct>    <dbl> <dbl>  <dbl>  <int>   <dbl>
#>  1 PT1     VIS1   TRT   Black or African American Female    25.3  NA    0.677      1  -0.626
#>  2 PT1     VIS2   TRT   Black or African American Female    25.3  40.0  0.801      2   0.184
#>  3 PT1     VIS3   TRT   Black or African American Female    25.3  NA    0.709      3  -0.836
#>  4 PT1     VIS4   TRT   Black or African American Female    25.3  20.5  0.809      4   1.60 
#>  5 PT2     VIS1   PBO   Asian                     Male      45.0  NA    0.465      1   0.330
#>  6 PT2     VIS2   PBO   Asian                     Male      45.0  31.5  0.233      2  -0.820
#>  7 PT2     VIS3   PBO   Asian                     Male      45.0  36.9  0.360      3   0.487
#>  8 PT2     VIS4   PBO   Asian                     Male      45.0  48.8  0.507      4   0.738
#>  9 PT3     VIS1   PBO   Black or African American Female    43.5  NA    0.682      1   0.576
#> 10 PT3     VIS2   PBO   Black or African American Female    43.5  36.0  0.892      2  -0.305
#> # ℹ 790 more rows
#> # ℹ Use `print(n = ...)` to see more rows

Above, each independent patient (index variable USUBJID) has multiple rows of data and the rows within a patient (index variable AVISIT) are correlated repeated measures.

Model

In brms, I can run a multivariate model on this tidy long dataset by adding a unstr(time = AVISIT, gr = USUBJID) term in the model formula.

library(brms)
formula <- brmsformula(
  FEV1 ~ FEV1_BL + AVISIT + ARMCD + AVISIT*ARMCD + RACE + SEX + WEIGHT
    + unstr(time = USUBJID, gr = AVISIT)
)
fit <- brm(data = fev_data, formula = formula)

brms has incredibly smooth native integration with mice which pools analyses automatically. I would strongly prefer to use this integration for multiple imputation in MNAR scenarios.

Problem

However, I am not sure if it would be appropriate to do so.

From chapter 3 and chapter 4 of Flexible Imputation of Missing Data, mice seems to have different ideas about how to represent univariate vs multivariate data. According to the book, a univariate missing data problem is when only one column in the data has missing values, and a multivariate one is when multiple columns have missing values. Throughout the book, each row in a dataset is implicitly referred to as a "case", whereas my colleagues and I think of a "case" as a patient with multiple rows. All this implies a non-tidy wide format for the data, rather than the conventional tidy long one we prefer to work with.

In my line of work, there is only one partially missing column in the dataset, but the underlying statistical problem is still multivariate. I am wondering what is the most appropriate way to use mice in this ubiquitous scenario.

Experiments

When I naively plug the full FEV dataset into mice, I get warnings about "logged events".

library(mice)
mice(fev_data)
# ...
#> 1                                                                                                                                                                                                                                                   VISITN
#> 2                                                                                                                                             USUBJIDPT54, USUBJIDPT142, USUBJIDPT199, ARMCDTRT, RACEBlack or African American, SEXFemale, WEIGHT, VISITN2
#> 3 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 4                                                                                                                                             USUBJIDPT54, USUBJIDPT142, USUBJIDPT199, ARMCDTRT, RACEBlack or African American, SEXFemale, WEIGHT, VISITN2
#> 5 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?
#> 6                                                                                                                                             USUBJIDPT54, USUBJIDPT142, USUBJIDPT199, ARMCDTRT, RACEBlack or African American, SEXFemale, WEIGHT, VISITN2
#> Warning message:
#> Number of logged events: 51

From the messages, mice was looking at collinearity with the patient ID variable USUBJID, which is incorrect for the situation. Dropping it removes the warnings, but then the model is univariate where presumably each row is a "case".

library(dplyr)
fev_data %>%
  select(-USUBJID, -VISITN) %>%
  mice()

Unless I am missing something about how PMM works, it seems like an appropriate imputation method should at least have some awareness of USUBJID. To tell mice that the data is really multivariate, it seems like I need to pivot it to wide form first.

library(tidyverse)
data <- fev_data %>%
  pivot_wider(
    id_cols = c("ARMCD", "FEV1_BL", "USUBJID", "SEX", "RACE"),
    names_from = "AVISIT",
    values_from = "FEV1"
  ) %>%
  select(-USUBJID)
data
#> # A tibble: 200 × 8
#>    ARMCD FEV1_BL SEX    RACE                       VIS1  VIS2  VIS3  VIS4
#>    <fct>   <dbl> <fct>  <fct>                     <dbl> <dbl> <dbl> <dbl>
#>  1 TRT      25.3 Female Black or African American  NA    40.0  NA    20.5
#>  2 PBO      45.0 Male   Asian                      NA    31.5  36.9  48.8
#>  3 PBO      43.5 Female Black or African American  NA    36.0  NA    37.2
#>  4 TRT      31.6 Female Asian                      33.9  33.7  NA    54.5
#>  5 PBO      43.6 Male   Black or African American  32.3  NA    46.8  41.7
#>  6 PBO      21.6 Male   Black or African American  NA    NA    39.0  NA  
#>  7 PBO      25.5 Female Asian                      31.9  32.9  NA    48.3
#>  8 PBO      47.8 Male   Black or African American  32.2  35.9  45.5  53.0
#>  9 TRT      50.9 Male   White                      47.2  46.6  NA    58.1
#> 10 PBO      57.7 Female Black or African American  NA    NA    45.0  NA  
#> # ℹ 190 more rows
#> # ℹ Use `print(n = ...)` to see more rows
imputed <- mice(data)

But this does not work in my case because it is no longer possible to regress on time-varying predictors. For example, if I pivot FEV1 to wide form, I need to drop the variable WEIGHT (or try to convert WEIGHT to multiple columns as well, which does not really make sense). In addition, I would need to switch to a much more complicated multivariate modeling syntax in brms, which may make it difficult or impossible to specify the the precise covariance and correlation structures I need. It is tempting to hack into the imputed object and pivot each imputed dataset back to long form, but it seems unwise to try.

wlandau · 2024-07-23T17:09:27Z

wlandau
Jul 23, 2024
Author

Hmm... maybe I was looking for https://insightsengineering.github.io/rbmi all along...

0 replies

hanneoberman · 2024-07-24T07:23:56Z

hanneoberman
Jul 24, 2024
Maintainer

Hi @wlandau! Your data looks like long format clustered data. mice is able to produce valid imputations once you accommodate the clustering structure (i.e. measurements within patients). Here's the chapter explaining how to: stefvanbuuren.name/fimd/ch-longitudinal.

1 reply

wlandau Jul 24, 2024
Author

Thanks! I'll take a look.

stefvanbuuren · 2024-07-29T13:31:48Z

stefvanbuuren
Jul 29, 2024
Maintainer

Some generic advice, not knowing how well it applies to your case:

If you can, convert the data into a wide format before imputation and convert it back into a long format for making plots or for subject-level analyses. Imputing wide-format data (all patient data in one row) with mice is easy, flexible, and conforms to the thinking of subject-matter specialists.

It might be challenging to convert into a wide format if patient timing differs widely. In that case, we end up using hundreds of different time points. You might try to define time bins, but in my mind, a better solution is to use a broken stick model <doi:10.18637/jss.v106.i07> to convert long to wide (repeated measures). The accompanying software also supports the generation of multiple imputations.

Every problem is different, so use whatever suits you.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Imputing a multivariate response variable in tidy long format #655

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Imputing a multivariate response variable in tidy long format #655

wlandau Jul 23, 2024

Background

Data

Model

Problem

Experiments

Replies: 3 comments · 1 reply

wlandau Jul 23, 2024 Author

hanneoberman Jul 24, 2024 Maintainer

wlandau Jul 24, 2024 Author

stefvanbuuren Jul 29, 2024 Maintainer

wlandau
Jul 23, 2024

Replies: 3 comments 1 reply

wlandau
Jul 23, 2024
Author

hanneoberman
Jul 24, 2024
Maintainer

wlandau Jul 24, 2024
Author

stefvanbuuren
Jul 29, 2024
Maintainer