-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathexercises-19.Rmd
324 lines (264 loc) · 9.96 KB
/
exercises-19.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
# Chapter 19 Exercises - Reproducing the the 'serial dilution assay'
2021-12-10
```{r setup, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)
# dpi = 400, fig.width = 7, fig.height = 4.5, fig.retina = TRUE
rfiles <- list.files(here::here("src"), full.names = TRUE, pattern = "R$")
for (rfile in rfiles) {
source(rfile)
}
set.seed(0)
```
> I turned this exercise into a blog post on my [webiste](https://joshuacook.netlify.app).
In chapter 17 "Parametric nonlinear models" of *Bayesian Data Analysis* by Gelman *et al.* [@bda3], the authors present an example of fitting a curve to a [serial dilution](https://en.wikipedia.org/wiki/Serial_dilution) standard curve and using it to estimate unknown concentrations.
Below, I build the model with Stan and fit it using MCMC.
Unfortunately, I was unable to find the original data in Gelman's original publication of the model [@Gelman2004-kh].
The best I could do was copy the data for the standard curve from a table in the book and build the model to fit that data.
> The source code for this post is in a [repository](https://github.com/jhrcook/bayesian-data-analysis-course) of my work for Aki Vehtari's Bayesian Data Analysis [course](https://avehtari.github.io/BDA_course_Aalto/index.html).
## Setup
```{r setup-show, message=FALSE, warning=FALSE}
library(rstan)
library(tidybayes)
library(patchwork)
library(tidyverse)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
theme_set(
theme_classic() +
theme(
panel.grid.major = element_line(),
strip.background = element_blank(),
plot.title = element_text(hjust = 0.5)
)
)
SNS_BLUE <- "#1F77B4"
STAN_RED <- "#B2171D"
```
As mentioned above, I couldn't find the original data, so I copied it from the book's figure 19.3 on page 473.
```{r}
dilution_standards_data <- tibble::tribble(
~conc, ~dilution, ~y,
0.64, 1, c(101.8, 121.4),
0.32, 1 / 2, c(105.2, 114.1),
0.16, 1 / 4, c(92.7, 93.3),
0.08, 1 / 8, c(72.4, 61.1),
0.04, 1 / 16, c(57.6, 50.0),
0.02, 1 / 32, c(38.5, 35.1),
0.01, 1 / 64, c(26.6, 25.0),
0, 0, c(14.7, 14.2),
) %>%
mutate(rep = purrr::map(conc, ~ c("a", "b"))) %>%
unnest(c(y, rep))
knitr::kable(dilution_standards_data)
```
The following plot shows the two standard dilution curves.
They are quite similar.
```{r}
data_plot <- dilution_standards_data %>%
ggplot(aes(x = conc, y = y, color = rep)) +
geom_line(alpha = 0.5, linetype = 2) +
geom_point(alpha = 0.8) +
scale_x_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
scale_y_continuous(expand = expansion(c(0, 0.02)), limits = c(0, NA)) +
scale_color_brewer(type = "qual", palette = "Set1") +
theme(
legend.position = c(0.8, 0.2),
legend.background = element_blank()
) +
labs(
x = "concentration",
y = "y",
title = "Serial dilution standard curve",
color = "replicate"
)
data_plot
```
## Modeling
### Model specification
The model uses a normal likelihood to describe the posterior distribution $p(y|x)$.
The mean of the likelihood is defined for a given concentration $x$ using the standard equation used in the field:
$$
\text{E}[y | x, \beta] = g(x, \beta) = \beta_1 + \frac{\beta_2}{1 + (x/\beta_3)^{-\beta_4}} \\
$$
The model is a scaled and shifted logistic curve.
This structure results in the following interpretations for $\beta$, all of which are restricted to positive values:
1. $\beta_1$: color intensity when the concentration is 0
2. $\beta_2$: increase to saturation
3. $\beta_3$: the inflection point of the curve
4. $\beta_4$: rate of saturation
Below are the prior distributions for $\beta$. Note that they are are drastically different scales - this is critical to help the model fit the data.
$$
\beta_1 \sim N(10, 2.5) \\
\beta_2 \sim N(100, 5) \\
\beta_3 \sim N(0, 1) \\
\beta_4 \sim N(0, 2.5)
$$
The measurement error of the model, representing the variance in the model's likelihood is defined as follows:
$$
\tau(\alpha, \sigma_y, g(x, \beta), A) = \lgroup \frac{g(x,\beta)}{A} \rgroup^{2\alpha} \sigma^2_y
$$
Here, $\alpha$, restricted to lie between 0 and 1, allows the variance to be higher for larger measurement values.
$A$ is a constant (set to 30 by the authors) that allows $\sigma_y$ to be more easily interpreted as the variance from "typical" measurements.
Below are the priors for the new variables in the model.
$$
\alpha \sim \text{Beta}(1, 1) \qquad
\sigma \sim |N(0, 2.5)|
$$
### In Stan
Below is the Stan code for the model.
It looks very similar to the mathematical description of the model, a nice feature of the Stan probabilistic programming language.
The centrality and variance of the likelihood are calculated separately as `g` and `tau` so they can be used in the `model` and `generated quantities` block without duplicating the code.
The `log_lik` is calculated so that PSIS-LOO cross validation can be estimated.
I also included the ability to provide new data to make predictions over as `xnew`.
```{r}
dilution_model_file <- here::here("models", "serial-dilution.stan")
writeLines(readLines(dilution_model_file))
```
### Sampling
As mentioned above, specifically defining the prior distributions for each $\beta$ is necessary for MCMC to accurately sample from the posterior.
With those helping restrict the range of their values, the model fit very well.
```{r}
xnew <- seq(0, max(dilution_standards_data$conc), 0.001)
model_data <- list(
N = nrow(dilution_standards_data),
A = 30,
x = dilution_standards_data$conc,
y = dilution_standards_data$y,
M = length(xnew),
xnew = xnew
)
dilution_model <- stan(
dilution_model_file,
model_name = "serial-dilution",
data = model_data,
refresh = 1000
)
```
### Posterior distributions
The next step is to analyze the posterior draws of the model.
We can check the success of MCMC by visualizing the traces of the chains, looking for good mixing ("fuzzy caterpillars") and checking diagnostic values such as $\widehat{R}$ and $n_\text{eff}$.
The trace plots are shown below followed by a table of the posteriors with the diagnostic values.
Everything looks good suggesting MCMC was successful.
```{r}
model_pars <- c("beta", "alpha", "sigma")
rstan::stan_trace(dilution_model, pars = model_pars, ncol = 2, alpha = 0.7) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
theme(legend.position = "bottom")
```
```{r}
print(dilution_model, pars = model_pars)
```
The following density plots show the posterior distributions of the model parameters $\beta$, $\alpha$, and $\sigma$.
```{r}
rstan::stan_dens(
dilution_model,
pars = model_pars,
separate_chains = FALSE,
alpha = 0.6
) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0, 0.02)))
```
### Posterior predictive check
Below is a plot of the posterior predictive distributions of the model on the original data.
1,000 individual simulations are plotted in blue and the mean in black.
The simulated curves visually appear to correspond well with the observed data indicating the model has good fit.
```{r}
dilution_post_pred <- rstan::extract(dilution_model, "ypred")$ypred %>%
as.data.frame() %>%
as_tibble() %>%
set_names(seq(1, ncol(.))) %>%
mutate(draw = 1:n()) %>%
pivot_longer(-c(draw), names_to = "idx") %>%
left_join(
dilution_standards_data %>% mutate(idx = as.character(1:n())),
by = "idx"
)
```
```{r}
plt_n_draws <- 1000
plt_draws <- sample(1:max(dilution_post_pred$draw), plt_n_draws)
ppc_mean <- dilution_post_pred %>%
group_by(conc) %>%
summarize(value = mean(value)) %>%
ungroup()
dilution_post_pred %>%
filter(draw %in% !!plt_draws) %>%
mutate(grp = glue::glue("{draw}-{rep}")) %>%
ggplot(aes(x = conc, y = value)) +
geom_line(aes(group = grp), alpha = 0.05, color = SNS_BLUE) +
geom_line(group = "a", data = ppc_mean, color = "black") +
geom_point(data = ppc_mean, color = "black") +
geom_line(
aes(y = y, group = rep),
data = dilution_standards_data,
color = STAN_RED
) +
geom_point(aes(y = y), data = dilution_standards_data, color = STAN_RED) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0.02, 0.02))) +
labs(
x = "concentration",
y = "y",
title = "Posterior predictive distribution"
)
```
I also had the model make posterior predictions on concentrations across the observed range at smaller step-sizes.
The mean and 89% HDI are shown in blue below along with the observed data in red.
The inset plot is a zoomed-in view of the posterior predictive distribution at the lower concentrations.
```{r}
ynew_mean <- apply(rstan::extract(dilution_model, pars = "ynew")$ynew, 2, mean)
ynew_hdi <- apply(
rstan::extract(dilution_model, pars = "ynew")$ynew,
2,
bayestestR::hdi,
ci = 0.89
)
ynew_ppc <- tibble(
conc = xnew,
ynew_mean = ynew_mean,
ynew_hdi_low = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[2]]),
ynew_hdi_hi = purrr::map_dbl(ynew_hdi, ~ unlist(.x)[[3]])
)
```
```{r}
plot_posterior_pred <- function(ppc_df, obs_df, pt_size = 1.5) {
ppc_df %>%
ggplot(aes(x = conc, y = ynew_mean)) +
geom_ribbon(
aes(ymin = ynew_hdi_low, ymax = ynew_hdi_hi),
fill = SNS_BLUE,
alpha = 0.5
) +
geom_line(group = "a") +
geom_line(
aes(y = y, group = rep),
data = obs_df,
color = STAN_RED
) +
geom_point(aes(y = y), data = obs_df, size = pt_size, color = STAN_RED) +
scale_x_continuous(expand = expansion(c(0, 0))) +
scale_y_continuous(expand = expansion(c(0.02, 0.02)))
}
ppc_plot <- plot_posterior_pred(ynew_ppc, dilution_standards_data) +
labs(
x = "concentration",
y = "y",
title = "Posterior predictive distribution"
)
sub_max <- 0.04
sub_ppc_plot <- plot_posterior_pred(
ynew_ppc %>% filter(conc <= sub_max),
dilution_standards_data %>% filter(conc <= sub_max),
pt_size = 0.6
) +
theme(axis.title = element_blank())
ppc_plot +
inset_element(sub_ppc_plot, left = 0.5, bottom = 0.05, right = 0.9, top = 0.5)
```
---
## Session info
```{r}
sessionInfo()
```