Skip to content

Commit

Permalink
chore: render site
Browse files Browse the repository at this point in the history
  • Loading branch information
jhrcook committed Jan 31, 2022
1 parent 126c1c1 commit e799c4e
Show file tree
Hide file tree
Showing 13 changed files with 4,062 additions and 152 deletions.
1 change: 1 addition & 0 deletions docs/about.html
Original file line number Diff line number Diff line change
Expand Up @@ -2402,6 +2402,7 @@ <h3>${suggestion.title}</h3>
<a href="notes/18_basis-function-models_bda3-20.html">18. Notes on 'Ch 20. Basis function models'</a>
<a href="notes/19_gaussian-processes_bda3-21.html">19. Notes on 'Ch 21. Gaussian process models'</a>
<a href="notes/20_finite-mixture-models_bda3-22.html">20. Notes on 'Ch 22. Finite mixture models'</a>
<a href="notes/21_Dirichlet-process-models_bda3-23.html">21. Notes on 'Ch 23. Dirichlet process models'</a>
</div>
</div>
<div class="nav-dropdown">
Expand Down
207 changes: 180 additions & 27 deletions docs/book-exercises/bda-exercises-19.Rmd
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
---
title: "Chapter 17 Exercises - Reproducing the the 'serial dilution assay'"
title: "Chapter 19 Exercises - Reproducing the the 'serial dilution assay'"
date: "2021-12-10"
output: distill::distill_article
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)
# dpi = 400, fig.width = 7, fig.height = 4.5, fig.retina = TRUE
rfiles <- list.files(here::here("src"), full.names = TRUE, pattern = "R$")
for (rfile in rfiles) {
Expand All @@ -15,17 +16,24 @@ for (rfile in rfiles) {
set.seed(0)
```

My goal here is to fit the serial dilution model model and perform basic inferential analysis as demonstrated in Chapter 17.
The original publication by Gelman *et al.* is also available[^1].
Unfortunately, I could not find the original data analyzed and was only able to copy the standards data from the book.
> I turned this exercise into a blog post on my [webiste](https://joshuacook.netlify.app).
[^1]: Gelman A, Chew GL, Shnaidman M. Bayesian analysis of serial dilution assays. Biometrics. 2004 Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. https://pubmed.ncbi.nlm.nih.gov/15180666/
In chapter 17 "Parametric nonlinear models" of *Bayesian Data Analysis*[^1] by Gelman *et al.*, the authors present an example of fitting a curve to a [serial dilution](https://en.wikipedia.org/wiki/Serial_dilution) standard curve and using it to estimate unknown concentrations.
Below, I build the model with Stan and fit it using MCMC.
Unfortunately, I was unable to find the original data in Gelman's original publication of the model[^2].
The best I could do was copy the data for the standard curve from a table in the book and build the model to fit that data.

> The source code for this post is in a [repository](https://github.com/jhrcook/bayesian-data-analysis-course) of my work for Aki Vehtari's Bayesian Data Analysis [course](https://avehtari.github.io/BDA_course_Aalto/index.html).
[^1]: Gelman, Andrew, et al. Bayesian Data Analysis. CRC Press Etc., 2015.
[^2]: Gelman A, Chew GL, Shnaidman M. Bayesian analysis of serial dilution assays. Biometrics. 2004 Jun;60(2):407-17. doi: 10.1111/j.0006-341X.2004.00185.x. PMID: 15180666. https://pubmed.ncbi.nlm.nih.gov/15180666/

## Setup

```{r setup-show, message=FALSE, warning=FALSE}
library(rstan)
library(tidybayes)
library(patchwork)
library(tidyverse)
options(mc.cores = parallel::detectCores())
Expand All @@ -35,14 +43,16 @@ theme_set(
theme_classic() +
theme(
panel.grid.major = element_line(),
strip.background = element_blank()
strip.background = element_blank(),
plot.title = element_text(hjust = 0.5)
)
)
SNS_BLUE <- "#1F77B4"
STAN_RED <- "#B2171D"
```

Data copied from the book Figure 19.3 on page 473.
As mentioned above, I couldn't find the original data, so I copied it from the book's figure 19.3 on page 473.

```{r}
dilution_standards_data <- tibble::tribble(
Expand All @@ -58,29 +68,87 @@ dilution_standards_data <- tibble::tribble(
) %>%
mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
unnest(c(y, rep))
head(dilution_standards_data)
knitr::kable(dilution_standards_data)
```

The following plot shows the two standard dilution curves.
They are quite similar.

```{r}
dilution_standards_data %>%
data_plot <- dilution_standards_data %>%
ggplot(aes(x = conc, y = y, color = rep)) +
geom_line(alpha = 0.5, linetype = 2) +
geom_point(alpha = 0.8) +
scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
scale_color_brewer(type = "qual", palette = "Set1")
scale_color_brewer(type = "qual", palette = "Set1") +
theme(
legend.position = c(0.8, 0.2),
legend.background = element_blank()
) +
labs(
x = "concentration",
y = "y",
title = "Serial dilution standard curve",
color = "replicate"
)
data_plot
```

## Modeling

### Model specification

The Stan model is quite simple, but I ran into some problems fitting it, discussed below.
The model uses a normal likelihood to describe the posterior distribution $p(y|x)$.
The mean of the likelihood is defined for a given concentration $x$ using the standard equation used in the field:

$$
\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + (x/\beta_3)^{-\beta_4}} \\
$$


The model is a scaled and shifted logistic curve.
This structure results in the following interpretations for $\beta$, all of which are restricted to positive values:

1. $\beta_1$: color intensity when the concentration is 0
2. $\beta_2$: increase to saturation
3. $\beta_3$: the inflection point of the curve
4. $\beta_4$: rate of saturation

Below are the prior distributions for $\beta$. Note that they are are drastically different scales - this is critical to help the model fit the data.

$$
\beta_1 \sim N(10, 2.5) \\
\beta_2 \sim N(100, 5) \\
\beta_3 \sim N(0, 1) \\
\beta_4 \sim N(0, 2.5)
$$

The measurement error of the model, representing the variance in the model's likelihood is defined as follows:

$$
\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} \rgroup^{2\alpha} \sigma^2_y
$$

Here, $\alpha$, restricted to lie between 0 and 1, allows the variance to be higher for larger measurement values.
$A$ is a constant (set to 30 by the authors) that allows $\sigma_y$ to be more easily interpreted as the variance from "typical" measurements.
Below are the priors for the new variables in the model.

$$
\alpha \sim \text{Beta}(1, 1) \qquad
\sigma \sim |N(0, 2.5)|
$$

### In Stan

Below is the Stan code for the model.
It looks very similar to the mathematical description of the model, a nice feature of the Stan probabilistic programming language.

The centrality and variance of the likelihood are calculated separately as `g` and `tau` so they can be used in the `model` and `generated quantities` block without duplicating the code.
The `log_lik` is calculated so that PSIS-LOO cross validation could be conducted if desired.
The `log_lik` is calculated so that PSIS-LOO cross validation can be estimated.
I also included the ability to provide new data to make predictions over as `xnew`.

```{r}
dilution_model_file <- here::here("models", "serial-dilution.stan")
Expand All @@ -89,43 +157,55 @@ writeLines(readLines(dilution_model_file))

### Sampling

As mentioned above, there were some problems with fitting the model.
The main problem was that the $\beta$ parameters are on *wildly* different scales, such as $\beta_2$ is around 100 while $\beta_3$ is less than 0.1.
Therefore, in order for the model to fit correctly, I needed to assign different prior distributions for each parameter.
As mentioned above, specifically defining the prior distributions for each $\beta$ is necessary for MCMC to accurately sample from the posterior.
With those helping restrict the range of their values, the model fit very well.

```{r}
xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
model_data <- list(
N = nrow(dilution_standards_data),
A = 30,
x = dilution_standards_data$conc,
y = dilution_standards_data$y
y = dilution_standards_data$y,
M = length(xnew),
xnew = xnew
)
dilution_model <- stan(
dilution_model_file,
model_name = "serial-dilution",
data = model_data
data = model_data,
refresh = 1000
)
```

### Posterior distributions

The next step is to analyze the posterior draws of the model.
We can check the success of MCMC by visualizing the traces of the chains, looking for good mixing ("fuzzy caterpillars") and checking diagnostic values such as $\widehat{R}$ and $n_\text{eff}$.
The trace plots are shown below followed by a table of the posteriors with the diagnostic values.
Everything looks good suggesting MCMC was successful.

```{r}
rstan::stan_trace(dilution_model, pars = "beta", ncol=1) +
model_pars <- c("beta", "alpha", "sigma")
rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0.02, 0.02)))
scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
theme(legend.position = "bottom")
```

```{r}
print(dilution_model, pars=c("beta", "sigma", "alpha"))
print(dilution_model, pars = model_pars)
```

The following density plots show the posterior distributions of the model parameters $\beta$, $\alpha$, and $\sigma$.

```{r}
rstan::stan_dens(
dilution_model, pars = c("beta", "sigma", "alpha"),
separate_chains=TRUE
dilution_model,
pars = model_pars,
separate_chains = FALSE,
alpha = 0.6
) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0, 0.02)))
Expand All @@ -134,6 +214,7 @@ rstan::stan_dens(
### Posterior predictive check

Below is a plot of the posterior predictive distributions of the model on the original data.
1,000 individual simulations are plotted in blue and the mean in black.
The simulated curves visually appear to correspond well with the observed data indicating the model has good fit.

```{r}
Expand All @@ -150,21 +231,93 @@ dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
```

```{r}
plt_n_draws <- 100
plt_n_draws <- 1000
plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
ppc_mean <- dilution_post_pred %>%
group_by(conc) %>%
summarize(value = mean(value)) %>%
ungroup()
dilution_post_pred %>%
filter(draw %in% !!plt_draws) %>%
mutate(grp = glue::glue("{draw}-{rep}")) %>%
ggplot(aes(x = conc, y = value)) +
facet_wrap(vars(rep), nrow = 1) +
geom_line(aes(group = draw), alpha = 0.1) +
geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
geom_line(group = "a", data = ppc_mean, color = "black") +
geom_point(data = ppc_mean, color = "black") +
geom_line(
aes(y = y, group = rep),
data = dilution_standards_data,
color = STAN_RED
) +
geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0.02, 0.02)))
scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
labs(
x = "concentration",
y = "y",
title = "Posterior predictive distribution"
)
```

I also had the model make posterior predictions on concentrations across the observed range at smaller step-sizes.
The mean and 89% HDI are shown in blue below along with the observed data in red.
The inset plot is a zoomed-in view of the posterior predictive distribution at the lower concentrations.

```{r}
ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
ynew_hdi <- apply(
rstan::extract(dilution_model, pars = "ynew")$ynew,
2,
bayestestR::hdi,
ci = 0.89
)
ynew_ppc <- tibble(
conc = xnew,
ynew_mean = ynew_mean,
ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
)
```

```{r}
plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
ppc_df %>%
ggplot(aes(x = conc, y = ynew_mean)) +
geom_ribbon(
aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
fill = SNS_BLUE,
alpha = 0.5
) +
geom_line(group = "a") +
geom_line(
aes(y = y, group = rep),
data = obs_df,
color = STAN_RED
) +
geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0.02, 0.02)))
}
ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
labs(
x = "concentration",
y = "y",
title = "Posterior predictive distribution"
)
sub_max <- 0.04
sub_ppc_plot <- plot_posterior_pred(
ynew_ppc %>% filter(conc <= sub_max),
dilution_standards_data %>% filter(conc <= sub_max),
pt_size = 0.6
) +
theme(axis.title = element_blank())
ppc_plot +
inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
```

---
Expand Down
Loading

0 comments on commit e799c4e

Please sign in to comment.