Skip to content

Commit

Permalink
add citation, update vignette, spell checking
Browse files Browse the repository at this point in the history
  • Loading branch information
Yinqi committed May 18, 2020
1 parent 0023fe1 commit 3a0e9e6
Show file tree
Hide file tree
Showing 17 changed files with 172 additions and 50 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Author: Yinqi Zhao, Cheng Peng, Zhao Yang, David V. Conti
Maintainer: Yinqi Zhao <[email protected]>
Description: An implementation for the 'LUCID' method to jointly estimate latent unknown clusters/subgroups with integrated data.
An EM algorithm is used to obtain the latent cluster assignment and model parameter estimates.
Feature selection is achieved by applying the regularization method.
Feature selection is achieved by applying the L1 regularization method.
Depends: R (>= 3.6.0)
License: GPL-3
Encoding: UTF-8
Expand Down
2 changes: 1 addition & 1 deletion R/boot_lucid.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
#' @param CoG Optional, matrix. Covariates to be adjusted for estimating the latent cluster.
#' @param CoY Optional, matrix. Covariates to be adjusted for estimating the outcome.
#' @param model A LUCID model fitted by \code{\link{est.lucid}}.
#' @param R Number of boostrap iterations.
#' @param R Number of bootstrap iterations.
#' @param n Number of CPU cores to be used in the bootstrap
#'
#' @return A list of estimates with their 95 percent CI.
Expand Down
2 changes: 1 addition & 1 deletion R/def_lucid.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#' Control parameters for EM algorithm
#'
#' @param tol Convergence critieria for the EM algorithm. Default is 0.001.
#' @param tol Convergence criteria for the EM algorithm. Default is 0.001.
#' @param max_itr Maximum number of iterations in each try of fitting process, integer, default is 1000.
#' @param max_tot.itr Maximum number of total iterations, integer, default is 10000.
#'
Expand Down
2 changes: 1 addition & 1 deletion README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ and the expectation of the complete log likelihood can be written as
knitr::include_graphics("man/figures/equation4.png")
```

At each iteration, in the E-step, compute the expectation of the complete data log likelihood by plugging in the posterior probability and then in the M-step, update the parameters by maximizing the expected complete likelihood function. Detailed derivations of the EM algorithm for LUDID can be found elsewhere.
At each iteration, in the E-step, compute the expectation of the complete data log likelihood by plugging in the posterior probability and then in the M-step, update the parameters by maximizing the expected complete likelihood function. Detailed derivations of the EM algorithm for LUCID can be found elsewhere.


## Installation
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ At each iteration, in the E-step, compute the expectation of the
complete data log likelihood by plugging in the posterior probability
and then in the M-step, update the parameters by maximizing the expected
complete likelihood function. Detailed derivations of the EM algorithm
for LUDID can be found elsewhere.
for LUCID can be found elsewhere.

## Installation

Expand Down
Binary file modified data/G2.rda
Binary file not shown.
Binary file modified data/Y2.rda
Binary file not shown.
Binary file modified data/Z2.rda
Binary file not shown.
18 changes: 18 additions & 0 deletions inst/CITATION
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
citHeader("To cite LUCIDus in publications use:")

citEntry(
entry = "Article",
title = "A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits",
author = "Peng, Cheng and Wang, Jun and Asante, Isaac and Louie, Stan and Jin, Ran and Chatzi, Lida and Casey, Graham and Thomas, Duncan C and Conti, David V",
journal = "Bioinformatics",
year = "2019",
issn = "1367-4803",
doi = "10.1093/bioinformatics/btz667",
url = "https://doi.org/10.1093/bioinformatics/btz667",

textVersion =
paste("Cheng Peng, Jun Wang, Isaac Asante, Stan Louie, Ran Jin, Lida Chatzi, Graham Casey, Duncan C Thomas, David V Conti (2019).",
"A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits.",
"Bioinformatics, btz667.",
"URL https://doi.org/10.1093/bioinformatics/btz667")
)
2 changes: 1 addition & 1 deletion man/boot.lucid.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion man/def.control.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

190 changes: 147 additions & 43 deletions vignettes/LUCID_vignette.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The **LUCIDus** package is aiming to provide researchers in the genetic epidemio
This package is an implementation for the novel statistical method proposed in the research paper "A Latent Unknown Clustering Integrating Multi-Omics Data (LUCID) with Phenotypic Traits^[https://doi.org/10.1093/bioinformatics/btz667]" published by the *Bioinformatics*. LUCID improves the subtype classification which leads to better diagnostics as well as prognostics and could be the potential solution for efficient targeted treatments and successful personalized medicine.

## Introduction to the LUCID (Latent Unknown Clustering with Integrated Data)
Multi-omics data combined with the phenotypic trait are integrated by jointly modeling their relationships through a latent cluster variable, which is illustrated by the directed acyclic graph (DAG) below. (A screenshot from [LUCID papaer](https://doi.org/10.1093/bioinformatics/btz667))
Multi-omics data combined with the phenotypic trait are integrated by jointly modeling their relationships through a latent cluster variable, which is illustrated by the directed acyclic graph (DAG) below. (A screenshot from [LUCID paper](https://doi.org/10.1093/bioinformatics/btz667))

```{r out.width="50%", echo=FALSE}
knitr::include_graphics("DAG.png")
Expand Down Expand Up @@ -75,35 +75,97 @@ Here, we use a simulated data set in the LUCID package to illustrate the framewo

### Identify the number of latent clusters.
The first thing of fitting LUCID model is to identify the potential latent cluster number. Bayesian Information Criteria (BIC) is commonly used to compare and choose a group of non-nested models. Therefore we adapt the BIC to decide the number of latent clusters for LUCID. We can use `tune.lucid()` for this analysis.
```{r eval=FALSE}
set.seed(1)
tune.K <- tune.lucid(G = G2, Z = Z2, Y = Y2, K = 2:5)
tune.K
```
> set.seed(1)
> tune.K <- tune.lucid(G = G2, Z = Z2, Y = Y2, K = 2:5)
> tune.K
$res.K
K BIC
1 2 99025.54
2 3 99607.07
3 4 100196.34
4 5 100778.79
```{r out.width="80%"}
knitr::include_graphics("fig1.png")
$res.tune
NULL
$optimal
Rho_G Rho_Z_InvCov Rho_Z_CovMu K BIC
NA NA NA 2.00 99025.54
```

The function will return a list. `list$res.K` lists a series of models together with their BICs. `list$optimal` gives you the answer to the best K. Here, the model with K = 2 has the lowest BIC (66053.02) and hence we decide the optimal K to be 2.
The function will return a list. `list$res.K` lists a series of models together with their BICs. `list$optimal` gives you the answer to the best K. Here, the model with K = 2 has the lowest BIC (99025.54) and hence we decide the optimal K to be 2.

### Fit the LUCID model
We use `est.lucid()` function to fit the model. `useY` is an option to be taken good care of. By default, `useY = TRUE`, which means we're interested in estimating the latent structure of the data and use the information of the outcome to cluster. On the other hand, if the primary question is to test the association of the cluster to outcome we should set it to `FALSE` since we do not want to bias the estimation of clusters by including the outcome in the estimation. Here is an example which focus on the estimation of the clustering.
```{r eval=FALSE}
fit1 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2)
```
> fit1 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2)
initialize the LUCID ...
iteration 1 : E-step finished.
iteration 1 : M-step finished, loglike = -112022.6
iteration 2 : E-step finished.
iteration 2 : M-step finished, loglike = -65392.77
iteration 3 : E-step finished.
iteration 3 : M-step finished, loglike = -64538.53
iteration 4 : E-step finished.
iteration 4 : M-step finished, loglike = -60593.42
iteration 5 : E-step finished.
iteration 5 : M-step finished, loglike = -50891.01
iteration 6 : E-step finished.
iteration 6 : M-step finished, loglike = -48932.31
iteration 7 : E-step finished.
iteration 7 : M-step finished, loglike = -48932.31
Success: LUCID converges!
> fit1
An object estimated by LUCID model
Outcome type: normal
Number of clusters: K = 2
Variance-Covariance structure for biomarkers: EII model
```

Then use `summary()` to display the model parameters.
```{r eval=FALSE}
summary(fit1)

```
```{r}
knitr::include_graphics("fig2.png")
> summary(fit1)
----------Summary of the LUCID model----------
K = 2 , log likelihood = -48932.31 , BIC = 99025.54
(1) Y (normal outcome): the mean and the sd of the Gaussian mixture Y for each latent cluster
mu sd
cluster1 -1.989017 0.9803328
cluster2 2.030915 1.0101779
(2) Z: estimates of biomarker means for each latent cluster
cluster1 cluster2
CZ1 -2.007621361 2.016969684
CZ2 -1.963055183 1.940025207
CZ3 -2.003261127 2.006120325
CZ4 -2.006636548 1.984085152
CZ5 -1.998671158 1.965863805
NZ1 0.003405598 -0.028354380
NZ2 0.010965841 0.031840666
NZ3 -0.029590811 0.009409087
NZ4 0.018534552 0.005757923
NZ5 0.028345495 0.062087843
(3) E: the odds ratio of being assigned to each latent cluster for each exposure
original OR
CG1.cluster2 -0.70100547 0.4960863
CG2.cluster2 -0.72988086 0.4819664
CG3.cluster2 -0.72459572 0.4845204
CG4.cluster2 -0.64141179 0.5265485
CG5.cluster2 -0.68018375 0.5065239
NG1.cluster2 -0.21877995 0.8034985
NG2.cluster2 0.08447478 1.0881454
NG3.cluster2 -0.04780871 0.9533161
NG4.cluster2 0.01291939 1.0130032
NG5.cluster2 0.01306431 1.0131500
```


For a better understanding of the model, we can visualize the LUCID model by `plot()`.
```{r eval=FALSE}
```
plot(fit1)
```

Expand All @@ -112,53 +174,95 @@ knitr::include_graphics("fig3.png")
```

### Variable Selection Procedure
As we can see from the table and the Sankey diagram, many genetic effects and biomarker effects are almost 0. To follow the principle of parsimony, it's reasonable to conduct a variable selection proedcure to achieve a simpler model. There are 3 tuning parameters in the variable selection. `Rho_G` is the penalty for genetic features and `Rho_InvCov` and `Rho_CovMu` are the penalties for biomarkers. To choose the best combination of tuning parameters, we perform a grid search and use BIC to identify the optimal choice. Here is an example of search 18 combinations.
```{r eval=FALSE}
set.seed(2)
tune.par <- tune.lucid(G = G2, Z = Z2, Y = Y2, family = "normal", K = 2,
Rho_G = c(0.01, 0.02),
Rho_Z_InvCov =c(0.1, 0.15, 0.2),
Rho_Z_CovMu = seq(80, 100, by = 10))
tune.par
```
```{r out.width="80%"}
knitr::include_graphics("fig4.png")
As we can see from the table and the Sankey diagram, many genetic effects and biomarker effects are almost 0. To follow the principle of parsimony, it's reasonable to conduct a variable selection procedure to achieve a simpler model. There are 3 tuning parameters in the variable selection. `Rho_G` is the penalty for genetic features and `Rho_InvCov` and `Rho_CovMu` are the penalties for biomarkers. To choose the best combination of tuning parameters, we perform a grid search and use BIC to identify the optimal choice. Here is an example of search 18 combinations.
```
> tune.par <- tune.lucid(G = G2, Z = Z2, Y = Y2, family = "normal", K = 2,
+ Rho_G = c(0.01, 0.02),
+ Rho_Z_InvCov =c(0.1, 0.15, 0.2),
+ Rho_Z_CovMu = seq(80, 100, by = 10))
> tune.par
$res.K
K
1 2
$res.tune
Rho_G Rho_Z_InvCov Rho_Z_CovMu K BIC
1 0.01 0.10 80 2 98958.38
2 0.01 0.10 90 2 98657.88
3 0.01 0.10 100 2 108764.74
4 0.01 0.15 80 2 99103.11
5 0.01 0.15 90 2 98802.57
6 0.01 0.15 100 2 98678.07
7 0.01 0.20 80 2 99289.15
8 0.01 0.20 90 2 98988.58
9 0.01 0.20 100 2 98864.04
10 0.02 0.10 80 2 98983.32
11 0.02 0.10 90 2 98682.81
12 0.02 0.10 100 2 98558.35
13 0.02 0.15 80 2 99128.04
14 0.02 0.15 90 2 98827.50
15 0.02 0.15 100 2 98703.01
16 0.02 0.20 80 2 99314.08
17 0.02 0.20 90 2 99013.51
18 0.02 0.20 100 2 98888.97
The best combination is Rho_G = 0.01, Rho_Z_InvCov = -.1 and Rho_Z_CovMu = 90. Next, refit the model with penalties.
```{r eval=FALSE}
fit2 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2, tune = def.tune(Rho_G = 0.01, Rho_Z_InvCov = 0.1, Rho_Z_CovMu = 90, Select_G = TRUE, Select_Z = TRUE))
$optimal
Rho_G Rho_Z_InvCov Rho_Z_CovMu K BIC
12 0.02 0.1 100 2 98558.35
```

Let's check the variables selected by LUCID.
```{r eval=FALSE}
fit2$select
The best combination is Rho_G = 0.02, Rho_Z_InvCov = 0.1 and Rho_Z_CovMu = 100. Next, refit the model with penalties.
```
> fit2 <- est.lucid(G = G2, Z = Z2, Y = Y2, K = 2, tune = def.tune(Rho_G = 0.02, Rho_Z_InvCov = 0.1, Rho_Z_CovMu = 1000, Select_G = TRUE, Select_Z = TRUE))
initialize the LUCID ...
iteration 1 : E-step finished.
iteration 1 : M-step finished, loglike = -59892.22
iteration 2 : E-step finished.
iteration 2 : M-step finished, loglike = -49756.34
iteration 3 : E-step finished.
iteration 3 : M-step finished, loglike = -49750.85
iteration 4 : E-step finished.
iteration 4 : M-step finished, loglike = -49750.85
Success: LUCID converges!
```

```{r out.width="80%"}
knitr::include_graphics("fig5.png")
Let's check the variables selected by LUCID.
```
> fit2$select
$selectG
CG1 CG2 CG3 CG4 CG5 NG1 NG2 NG3 NG4 NG5
TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
Not all the non-significant variables are selected. In practice, we could try slightly increase the penalties to achieve better variable selection results.
$selectZ
CZ1 CZ2 CZ3 CZ4 CZ5 NZ1 NZ2 NZ3 NZ4 NZ5
TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
```
We can then refit the model by using the selected features.

### Inference on Parameters
It's hard to derive the asymptotic distribution of the estimates of LUCID. Thus we use a bootstrap method to do inference on parameters. It is realized by the function `boot.lucid`. Let's try bootstrap on model 1.
```{r eval=FALSE}
boot <- boot.lucid(G = G2, Z = Z2, Y = Y2, model = fit1, R = 100)
```

We can add boostrap SE and 95% CI to the summary table by specifying the option `se`.
```{r eval = FALSE}
summary(fit1, se = boot)
> boot <- boot.lucid(G = G2, Z = Z2, Y = Y2, model = fit2, R = 100)
initialize the LUCID ...
iteration 1 : E-step finished.
iteration 1 : M-step finished, loglike = -63133.32
iteration 2 : E-step finished.
iteration 2 : M-step finished, loglike = -48932.81
iteration 3 : E-step finished.
iteration 3 : M-step finished, loglike = -48932.31
iteration 4 : E-step finished.
iteration 4 : M-step finished, loglike = -48932.31
Success: LUCID converges!
```

```{r out.width="80%"}
knitr::include_graphics("fig6.png")
We can add bootstrap SE and 95% CI to the summary table by specifying the option `se`.
```
summary(fit1, boot.se = boot)
```




## Acknowledgments
- Cheng Peng, Ph.D.
- David V. Conti, Ph.D.
Expand Down
Binary file removed vignettes/fig1.png
Binary file not shown.
Binary file removed vignettes/fig2.png
Binary file not shown.
Binary file removed vignettes/fig4.png
Binary file not shown.
Binary file removed vignettes/fig5.png
Binary file not shown.
Binary file removed vignettes/fig6.png
Binary file not shown.

0 comments on commit 3a0e9e6

Please sign in to comment.