add citation, update vignette, spell checking

USCbiostats · May 18, 2020 · 3a0e9e6 · 3a0e9e6
1 parent 0023fe1
commit 3a0e9e6
Show file tree

Hide file tree

Showing 17 changed files with 172 additions and 50 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -6,7 +6,7 @@ Author: Yinqi Zhao, Cheng Peng, Zhao Yang, David V. Conti
 Maintainer: Yinqi Zhao <[email protected]>
 Description: An implementation for the 'LUCID' method to jointly estimate latent unknown clusters/subgroups with integrated data. 
   An EM algorithm is used to obtain the latent cluster assignment and model parameter estimates. 
-  Feature selection is achieved by applying the regularization method.
+  Feature selection is achieved by applying the L1 regularization method.
 Depends: R (>= 3.6.0)
 License: GPL-3
 Encoding: UTF-8

diff --git a/R/boot_lucid.R b/R/boot_lucid.R
@@ -8,7 +8,7 @@
 #' @param CoG Optional, matrix. Covariates to be adjusted for estimating the latent cluster.
 #' @param CoY Optional, matrix. Covariates to be adjusted for estimating the outcome.
 #' @param model A LUCID model fitted by \code{\link{est.lucid}}.
-#' @param R Number of boostrap iterations.
+#' @param R Number of bootstrap iterations.
 #' @param n Number of CPU cores to be used in the bootstrap
 #' 
 #' @return A list of estimates with their 95 percent CI.

diff --git a/R/def_lucid.R b/R/def_lucid.R
@@ -1,6 +1,6 @@
 #' Control parameters for EM algorithm
 #' 
-#' @param tol Convergence critieria for the EM algorithm. Default is 0.001.
+#' @param tol Convergence criteria for the EM algorithm. Default is 0.001.
 #' @param max_itr Maximum number of iterations in each try of fitting process, integer, default is 1000.
 #' @param max_tot.itr Maximum number of total iterations, integer, default is 10000.
 #'

diff --git a/README.Rmd b/README.Rmd
@@ -49,7 +49,7 @@ and the expectation of the complete log likelihood can be written as
 knitr::include_graphics("man/figures/equation4.png")
 ```
 
-At each iteration, in the E-step, compute the expectation of the complete data log likelihood by plugging in the posterior probability and then in the M-step, update the parameters by maximizing the expected complete likelihood function. Detailed derivations of the EM algorithm for LUDID can be found elsewhere.
+At each iteration, in the E-step, compute the expectation of the complete data log likelihood by plugging in the posterior probability and then in the M-step, update the parameters by maximizing the expected complete likelihood function. Detailed derivations of the EM algorithm for LUCID can be found elsewhere.
 
 
 ## Installation

diff --git a/README.md b/README.md
@@ -58,7 +58,7 @@ At each iteration, in the E-step, compute the expectation of the
 complete data log likelihood by plugging in the posterior probability
 and then in the M-step, update the parameters by maximizing the expected
 complete likelihood function. Detailed derivations of the EM algorithm
-for LUDID can be found elsewhere.
+for LUCID can be found elsewhere.
 
 ## Installation
 

diff --git a/data/G2.rda b/data/G2.rda
diff --git a/data/Y2.rda b/data/Y2.rda
diff --git a/data/Z2.rda b/data/Z2.rda
diff --git a/inst/CITATION b/inst/CITATION
@@ -0,0 +1,18 @@
+citHeader("To cite LUCIDus in publications use:")
+
+citEntry(
+  entry = "Article",
+  title        = "A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits",
+  author       = "Peng, Cheng and Wang, Jun and Asante, Isaac and Louie, Stan and Jin, Ran and Chatzi, Lida and Casey, Graham and Thomas, Duncan C and Conti, David V",
+  journal      = "Bioinformatics",
+  year         = "2019",
+  issn         = "1367-4803",
+  doi          = "10.1093/bioinformatics/btz667",
+  url          = "https://doi.org/10.1093/bioinformatics/btz667",
+
+  textVersion  =
+  paste("Cheng Peng, Jun Wang, Isaac Asante, Stan Louie, Ran Jin, Lida Chatzi, Graham Casey, Duncan C Thomas, David V Conti (2019).",
+        "A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits.",
+        "Bioinformatics, btz667.",
+        "URL https://doi.org/10.1093/bioinformatics/btz667")
+)
diff --git a/man/boot.lucid.Rd b/man/boot.lucid.Rd
diff --git a/man/def.control.Rd b/man/def.control.Rd
diff --git a/vignettes/LUCID_vignette.Rmd b/vignettes/LUCID_vignette.Rmd
@@ -24,7 +24,7 @@ The **LUCIDus** package is aiming to provide researchers in the genetic epidemio
 This package is an implementation for the novel statistical method proposed in the research paper "A Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) with Phenotypic Traits^[https://doi.org/10.1093/bioinformatics/btz667]" published by the *Bioinformatics*. LUCID improves the subtype classification which leads to better diagnostics as well as prognostics and could be the potential solution for efficient targeted treatments and successful personalized medicine.
 
 ## Introduction to the LUCID (Latent Unknown Clustering with Integrated Data)
-Multi-omics data combined with the phenotypic trait are integrated by jointly modeling their relationships through a latent cluster variable, which is illustrated by the directed acyclic graph (DAG) below. (A screenshot from [LUCID papaer](https://doi.org/10.1093/bioinformatics/btz667))
+Multi-omics data combined with the phenotypic trait are integrated by jointly modeling their relationships through a latent cluster variable, which is illustrated by the directed acyclic graph (DAG) below. (A screenshot from [LUCID paper](https://doi.org/10.1093/bioinformatics/btz667))
 
 ```{r out.width="50%", echo=FALSE}
 knitr::include_graphics("DAG.png")
@@ -75,35 +75,97 @@ Here, we use a simulated data set in the LUCID package to illustrate the framewo
 
 ### Identify the number of latent clusters.
 The first thing of fitting LUCID model is to identify the potential latent cluster number. Bayesian Information Criteria (BIC) is commonly used to compare and choose a group of non-nested models. Therefore we adapt the BIC to decide the number of latent clusters for LUCID. We can use `tune.lucid()` for this analysis.
-```{r eval=FALSE}
-set.seed(1)
-tune.K <- tune.lucid(G = G2, Z = Z2, Y = Y2, K = 2:5)
-tune.K
 ```
+> set.seed(1)
+> tune.K <- tune.lucid(G = G2, Z = Z2, Y = Y2, K = 2:5)
+> tune.K
+$res.K
+  K       BIC
+1 2  99025.54
+2 3  99607.07
+3 4 100196.34
+4 5 100778.79
 
-```{r out.width="80%"}
-knitr::include_graphics("fig1.png")
+$res.tune
+NULL
+
+$optimal
+       Rho_G Rho_Z_InvCov  Rho_Z_CovMu            K          BIC 
+          NA           NA           NA         2.00     99025.54 
 ```
 
-The function will return a list. `list$res.K` lists a series of models together with their BICs. `list$optimal` gives you the answer to the best K. Here, the model with K = 2 has the lowest BIC (66053.02) and hence we decide the optimal K to be 2.
+The function will return a list. `list$res.K` lists a series of models together with their BICs. `list$optimal` gives you the answer to the best K. Here, the model with K = 2 has the lowest BIC (99025.54) and hence we decide the optimal K to be 2.
 
 ### Fit the LUCID model
 We use `est.lucid()` function to fit the model. `useY` is an option to be taken good care of. By default, `useY = TRUE`, which means we're interested in estimating the latent structure of the data and use the information of the outcome to cluster. On the other hand, if the primary question is to test the association of the cluster to outcome we should set it to `FALSE` since we do not want to bias the estimation of clusters by including the outcome in the estimation. Here is an example which focus on the estimation of the clustering.
-```{r eval=FALSE}
-fit1 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2)
+```
+> fit1 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2)
+initialize the LUCID ... 
+iteration 1 : E-step finished. 
+iteration 1 : M-step finished,  loglike =  -112022.6 
+iteration 2 : E-step finished. 
+iteration 2 : M-step finished,  loglike =  -65392.77 
+iteration 3 : E-step finished. 
+iteration 3 : M-step finished,  loglike =  -64538.53 
+iteration 4 : E-step finished. 
+iteration 4 : M-step finished,  loglike =  -60593.42 
+iteration 5 : E-step finished. 
+iteration 5 : M-step finished,  loglike =  -50891.01 
+iteration 6 : E-step finished. 
+iteration 6 : M-step finished,  loglike =  -48932.31 
+iteration 7 : E-step finished. 
+iteration 7 : M-step finished,  loglike =  -48932.31 
+Success: LUCID converges! 
+> fit1
+An object estimated by LUCID model 
+Outcome type: normal 
+Number of clusters: K = 2 
+Variance-Covariance structure for biomarkers: EII model
 ```
 
 Then use `summary()` to display the model parameters.
-```{r eval=FALSE}
-summary(fit1)
+
 ```
-```{r}
-knitr::include_graphics("fig2.png")
+> summary(fit1)
+----------Summary of the LUCID model---------- 
+ 
+K =  2 , log likelihood = -48932.31 , BIC =  99025.54 
+ 
+(1) Y (normal outcome): the mean and the sd of the Gaussian mixture Y for each latent cluster 
+                mu        sd
+cluster1 -1.989017 0.9803328
+cluster2  2.030915 1.0101779
+
+(2) Z: estimates of biomarker means for each latent cluster 
+        cluster1     cluster2
+CZ1 -2.007621361  2.016969684
+CZ2 -1.963055183  1.940025207
+CZ3 -2.003261127  2.006120325
+CZ4 -2.006636548  1.984085152
+CZ5 -1.998671158  1.965863805
+NZ1  0.003405598 -0.028354380
+NZ2  0.010965841  0.031840666
+NZ3 -0.029590811  0.009409087
+NZ4  0.018534552  0.005757923
+NZ5  0.028345495  0.062087843
+
+(3) E: the odds ratio of being assigned to each latent cluster for each exposure 
+                original        OR
+CG1.cluster2 -0.70100547 0.4960863
+CG2.cluster2 -0.72988086 0.4819664
+CG3.cluster2 -0.72459572 0.4845204
+CG4.cluster2 -0.64141179 0.5265485
+CG5.cluster2 -0.68018375 0.5065239
+NG1.cluster2 -0.21877995 0.8034985
+NG2.cluster2  0.08447478 1.0881454
+NG3.cluster2 -0.04780871 0.9533161
+NG4.cluster2  0.01291939 1.0130032
+NG5.cluster2  0.01306431 1.0131500
 ```
 
 
 For a better understanding of the model, we can visualize the LUCID model by `plot()`.
-```{r eval=FALSE}
+```
 plot(fit1)
 ```
 
@@ -112,53 +174,95 @@ knitr::include_graphics("fig3.png")
 ```
 
 ### Variable Selection Procedure
-As we can see from the table and the Sankey diagram, many genetic effects and biomarker effects are almost 0. To follow the principle of parsimony, it's reasonable to conduct a variable selection proedcure to achieve a simpler model. There are 3 tuning parameters in the variable selection. `Rho_G` is the penalty for genetic features and `Rho_InvCov` and `Rho_CovMu` are the penalties for biomarkers. To choose the best combination of tuning parameters, we perform a grid search and use BIC to identify the optimal choice. Here is an example of search 18 combinations.
-```{r eval=FALSE}
-set.seed(2)
-tune.par <- tune.lucid(G = G2, Z = Z2, Y = Y2, family = "normal", K = 2,
-                      Rho_G = c(0.01, 0.02),
-                      Rho_Z_InvCov =c(0.1, 0.15,  0.2),
-                      Rho_Z_CovMu = seq(80, 100, by = 10))
-tune.par
-```
-```{r out.width="80%"}
-knitr::include_graphics("fig4.png")
+As we can see from the table and the Sankey diagram, many genetic effects and biomarker effects are almost 0. To follow the principle of parsimony, it's reasonable to conduct a variable selection procedure to achieve a simpler model. There are 3 tuning parameters in the variable selection. `Rho_G` is the penalty for genetic features and `Rho_InvCov` and `Rho_CovMu` are the penalties for biomarkers. To choose the best combination of tuning parameters, we perform a grid search and use BIC to identify the optimal choice. Here is an example of search 18 combinations.
 ```
+> tune.par <- tune.lucid(G = G2, Z = Z2, Y = Y2, family = "normal", K = 2,
++                        Rho_G = c(0.01, 0.02),
++                        Rho_Z_InvCov =c(0.1, 0.15,  0.2),
++                        Rho_Z_CovMu = seq(80, 100, by = 10))
+> tune.par
+$res.K
+  K
+1 2
 
+$res.tune
+   Rho_G Rho_Z_InvCov Rho_Z_CovMu K       BIC
+1   0.01         0.10          80 2  98958.38
+2   0.01         0.10          90 2  98657.88
+3   0.01         0.10         100 2 108764.74
+4   0.01         0.15          80 2  99103.11
+5   0.01         0.15          90 2  98802.57
+6   0.01         0.15         100 2  98678.07
+7   0.01         0.20          80 2  99289.15
+8   0.01         0.20          90 2  98988.58
+9   0.01         0.20         100 2  98864.04
+10  0.02         0.10          80 2  98983.32
+11  0.02         0.10          90 2  98682.81
+12  0.02         0.10         100 2  98558.35
+13  0.02         0.15          80 2  99128.04
+14  0.02         0.15          90 2  98827.50
+15  0.02         0.15         100 2  98703.01
+16  0.02         0.20          80 2  99314.08
+17  0.02         0.20          90 2  99013.51
+18  0.02         0.20         100 2  98888.97
 
-The best combination is Rho_G = 0.01, Rho_Z_InvCov = -.1 and Rho_Z_CovMu = 90. Next, refit the model with penalties.
-```{r eval=FALSE}
-fit2 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2, tune = def.tune(Rho_G = 0.01, Rho_Z_InvCov = 0.1, Rho_Z_CovMu = 90, Select_G = TRUE, Select_Z = TRUE))
+$optimal
+   Rho_G Rho_Z_InvCov Rho_Z_CovMu K      BIC
+12  0.02          0.1         100 2 98558.35
 ```
 
-Let's check the variables selected by LUCID.
-```{r eval=FALSE}
-fit2$select
+The best combination is Rho_G = 0.02, Rho_Z_InvCov = 0.1 and Rho_Z_CovMu = 100. Next, refit the model with penalties.
+```
+> fit2 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2, tune = def.tune(Rho_G = 0.02, Rho_Z_InvCov = 0.1, Rho_Z_CovMu = 1000, Select_G = TRUE, Select_Z = TRUE))
+initialize the LUCID ... 
+iteration 1 : E-step finished. 
+iteration 1 : M-step finished,  loglike =  -59892.22 
+iteration 2 : E-step finished. 
+iteration 2 : M-step finished,  loglike =  -49756.34 
+iteration 3 : E-step finished. 
+iteration 3 : M-step finished,  loglike =  -49750.85 
+iteration 4 : E-step finished. 
+iteration 4 : M-step finished,  loglike =  -49750.85 
+Success: LUCID converges! 
 ```
 
-```{r out.width="80%"}
-knitr::include_graphics("fig5.png")
+Let's check the variables selected by LUCID.
 ```
+> fit2$select
+$selectG
+  CG1   CG2   CG3   CG4   CG5   NG1   NG2   NG3   NG4   NG5 
+ TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE 
 
-Not all the non-significant variables are selected. In practice, we could try slightly increase the penalties to achieve better variable selection results.
+$selectZ
+  CZ1   CZ2   CZ3   CZ4   CZ5   NZ1   NZ2   NZ3   NZ4   NZ5 
+ TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE 
+```
+We can then refit the model by using the selected features.
 
 ### Inference on Parameters
 It's hard to derive the asymptotic distribution of the estimates of LUCID. Thus we use a bootstrap method to do inference on parameters. It is realized by the function `boot.lucid`. Let's try bootstrap on model 1.
-```{r eval=FALSE}
-boot <- boot.lucid(G = G2, Z = Z2, Y = Y2, model = fit1, R = 100)
 ```
-
-We can add boostrap SE and 95% CI to the summary table by specifying the option `se`.
-```{r eval = FALSE}
-summary(fit1, se = boot)
+> boot <- boot.lucid(G = G2, Z = Z2, Y = Y2, model = fit2, R = 100)
+initialize the LUCID ... 
+iteration 1 : E-step finished. 
+iteration 1 : M-step finished,  loglike =  -63133.32 
+iteration 2 : E-step finished. 
+iteration 2 : M-step finished,  loglike =  -48932.81 
+iteration 3 : E-step finished. 
+iteration 3 : M-step finished,  loglike =  -48932.31 
+iteration 4 : E-step finished. 
+iteration 4 : M-step finished,  loglike =  -48932.31 
+Success: LUCID converges! 
 ```
 
-```{r out.width="80%"}
-knitr::include_graphics("fig6.png")
+We can add bootstrap SE and 95% CI to the summary table by specifying the option `se`.
+```
+summary(fit1, boot.se = boot)
 ```
 
 
 
+
 ## Acknowledgments
 - Cheng Peng, Ph.D.
 - David V. Conti, Ph.D.

diff --git a/vignettes/fig1.png b/vignettes/fig1.png
diff --git a/vignettes/fig2.png b/vignettes/fig2.png
diff --git a/vignettes/fig4.png b/vignettes/fig4.png
diff --git a/vignettes/fig5.png b/vignettes/fig5.png
diff --git a/vignettes/fig6.png b/vignettes/fig6.png