-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathassignment-02.Rmd
177 lines (140 loc) · 5.58 KB
/
assignment-02.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# Assignment 2
2021-08-30
**[Assignment 2](`r paste0(CM_URL, "assignment-02.pdf")`)**
## Setup
```{r setup, warning=FALSE, message=FALSE}
knitr::opts_chunk$set(echo = TRUE, comment = "#>", dpi = 300)
for (f in list.files(here::here("src"), pattern = "R$", full.names = TRUE)) {
source(f)
}
# Libraries
library(aaltobda)
library(glue)
library(tidyverse)
# Data
algae <- readLines(here::here("data", "algae.txt"))
algae <- as.integer(algae)
```
## Exercise 1. Inference for binomial proportion
**Algae status is monitored in 274 sites at Finnish lakes and rivers.**
**The observations for the 2008 algae status at each site are presented in file "algae.txt" (’0’: no algae, ’1’: algae present).**
**Let $\pi$ be the probability of a monitoring site having detectable blue-green algae levels and $y$ the observations in algae.**
**Use a binomial model for the observations $y$ and a $\text{Beta}(2, 10)$ prior for binomial model parameter $\pi$ to formulate a Bayesian model.**
**a) Formulate (1) the likelihood $p(y|\pi)$ as a function of $\pi$, (2) the prior $p(\pi)$, and (3) the resulting posterior $p(\pi|y)$.**
```{r}
print(head(algae))
print(paste("Number of data points:", length(algae)))
print(paste("Number of 1's:", sum(algae)))
print(paste("Prop. of 1's:", round(mean(algae), 3)))
```
1. likelihood: $p(y|\pi) = \text{Beta}(44, 274-44)$
2. prior: $p(\pi) = \text{Beta}(2, 10)$
3. posterior: $p(\pi|y) = \text{Beta}(46, 240)$
**b) What can you say about the value of the unknown \pi according to the observations and your prior knowledge?**
**Summarize your results with a point estimate (i.e. $E(\pi|y)$) and a 90% posterior interval.**
Some test data provided by the question to check if I'm on the right track.
```{r}
algae_test <- c(0, 1, 1, 0, 0, 0)
```
```{r}
beta_point_est <- function(prior_alpha, prior_beta, data) {
y <- sum(data)
n <- length(data)
posterior <- (prior_alpha + y) / (prior_alpha + prior_beta + n)
return(posterior)
}
posterior_test <- beta_point_est(
prior_alpha = 2, prior_beta = 10, data = algae_test
)
stopifnot(round(posterior_test, 7) == 0.2222222)
beta_point_est(prior_alpha = 2, prior_beta = 10, data = algae)
```
```{r}
beta_interval <- function(prior_alpha, prior_beta, data, prob = 0.9) {
y <- sum(data)
n <- length(data)
p_low <- (1 - prob) / 2
q_low <- qbeta(p_low, prior_alpha + y, prior_beta + n - y)
q_high <- qbeta(1 - p_low, prior_alpha + y, prior_beta + n - y)
return(c(q_low, q_high))
}
posterior_interval_test <- beta_interval(
prior_alpha = 2, prior_beta = 10, data = algae_test, prob = 0.9
)
stopifnot(round(posterior_interval_test, 7) == c(0.0846451, 0.3956414))
beta_interval(prior_alpha = 2, prior_beta = 10, data = algae, prob = 0.9)
```
**c) What is the probability that the proportion of monitoring sites with detectable algae levels $\pi$ is smaller than $\pi_\theta = 0.2$ that is known from historical records?**
```{r}
beta_low <- function(prior_alpha, prior_beta, data, pi_0 = 0.2) {
y <- sum(data)
n <- length(data)
prob <- pbeta(pi_0, prior_alpha + y, prior_beta + n - y, lower.tail = TRUE)
return(prob)
}
test_prob <- beta_low(
prior_alpha = 2, prior_beta = 10, data = algae_test, pi_0 = 0.2
)
stopifnot(round(test_prob, 7) == 0.4511238)
beta_low(prior_alpha = 2, prior_beta = 10, data = algae, pi_0 = 0.2)
```
**d) What assumptions are required in order to use this kind of a model with this type of data?**
**(No need to discuss exchangeability yet, as it is discussed in more detail in BDA Chapter 5 and Lecture 7.)**
We need to assume that the data is independently and identically distributed, which included the assumption that the data is exchangable.
We are also assuming the data is accuractely collected in a consistent manner.
We are assuming there are no subgroups within the data that would necessitate a hierarchical model to account for the random effects variation.
**e) Make prior sensitivity analysis by testing a couple of different reasonable priors and plot the different posteriors.**
**Summarize the results by one or two sentences.**
```{r}
# Some interesting priors.
priors <- tibble::tribble(
~prior_alpha, ~prior_beta,
1, 1,
2, 2,
2, 10,
4, 20,
20, 100,
) %>%
mutate(
lbl = glue("Beta({prior_alpha},{prior_beta})"),
lbl = fct_inorder(lbl)
)
priors
```
```{r}
posterior_distribution <- function(prior_alpha,
prior_beta,
data,
step = 0.001,
...) {
y <- sum(data)
n <- length(data)
pi <- seq(0, 1, step)
posterior <- dbeta(pi, prior_alpha + y, prior_beta + n - y)
return(tibble(pi = pi, posterior = posterior))
}
# Get the posterior distribution for each prior.
posteriors <- priors %>%
mutate(posterior = purrr::map2(
prior_alpha, prior_beta, posterior_distribution,
data = algae
)) %>%
unnest(posterior)
# Plot the most interesting region of the posteriors.
posteriors %>%
filter(0.05 < pi & pi < 0.3) %>%
ggplot(aes(x = pi, y = posterior)) +
geom_line(aes(group = lbl, color = lbl), size = 0.9) +
scale_x_continuous(expand = expansion()) +
scale_y_continuous(expand = expansion(mult = c(0.01, 0.02))) +
scale_color_brewer(type = "qual", palette = "Set1") +
theme_bw() +
theme(legend.position = c(0.85, 0.7)) +
labs(x = "pi", y = "posterior probability", color = "prior")
```
The posterior is not very sensitive to the prior save for the exception of an overly-confident prior of $\text{Beta}(20, 100)$.
This is likely due to the large amount of data, meaning that the likelihood dominated the posterior.
---
```{r}
sessionInfo()
```