-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path19-2cats.Rmd
671 lines (436 loc) · 33.1 KB
/
19-2cats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
# Shuffling two categorical variables {#contingency}
```{r, echo = FALSE, warning=FALSE, message=FALSE}
library(tidyverse)
library(DT)
library(knitr)
library(blogdown)
library(stringr)
library(tweetrmd)
library(emo)
library(tufte)
library(cowplot)
library(lubridate)
library(ggthemes)
library(kableExtra)
library(ggforce)
library(datasauRus)
library(ggridges)
library(randomNames)
library(infer)
library(tiktokrmd)
library(ggridges)
library(colorspace)
options(crayon.enabled = FALSE)
```
```{block2, type='rmdnote'}
This text follows Chapter 9 of our textbook. **The reading below is required,** @whitlock2020 is not.
```
<span style="color: Blue;font-size:22px;"> Motivating scenarios: </span> <span style="color: Black;font-size:18px;"> We are interested to estimate an association between two categorical variables, and test if there is a non-random association. </span>
**Learning goals: By the end of this chapter you should be able to**
- Summarize the association between to categorical variables as
- A relative risk OR
- A (log) odds ratio
- Test the null hypothesis of no association by
- Permutation / Fishers exact test OR
- $\chi^2$ test
```{block2, type='rmdwarning'}
No additional reading is assigned. But [this paper](https://peerj.com/articles/9089/) by my colleague here [@fieberg2020] helped lays out why this is a good way to do and teach statistics.
```
## Associations between categorical variables.
One of the most common statistical questions we come across is "are the two variables associated?"
For instance, we might want to know if people who have and have not gotten a vaccine for some virus have a different probabilities of contracting or spreading that virus, or expressing any negative side effects of the vaccine.
Such an association would be quite interesting, and if it arises in a randomized controlled experiment, it would suggest that vaccination caused this outcome.
```{r modernadat, fig.cap='Participants receiving a placebo had a much higher incidence of covid than those receiving the Moderna vaccine. Data from the Moderna [press release](https://investors.modernatx.com/news-releases/news-release-details/moderna-announces-primary-efficacy-analysis-phase-3-cove-study).', echo=FALSE, fig.height=2, fig.width=3}
moderna_data <- tibble(treatment = c("vaccinated", "unvaccinated"),
prop_infected = c(11/15000, 185 / 15000))
ggplot(moderna_data , aes(x = treatment, y = prop_infected)) +
geom_col()+
scale_y_continuous(limits = c(0,.1)) +
labs(title = "Moderna data") +
theme()
```
## Summarizing associations between two categorical variables
When we are looking into the association between two categorical variables, we usually display data in a contingency table. For example, the contingency table below shows the outcomes of the Moderna COVID vaccine trial. Note that a contingency table is not 'tidy' but is useful for communicating result with readers.
```{r}
covid_counts <- tibble( outcome = c("Covid" ,"No COVID"),
Moderna = c( 11, 14989),
No_vaccine = c( 185, 14815))
kable(covid_counts)
```
### Relative Risk
The best summary of an association between two categorical variables is the relative risk, $\widehat{RR}$. The relative risk is the probability of a bad outcome conditional on being in scenario one divided by the probability of that bad outcome conditional on being in the other scenario. For example if 10 of 100 smokers die of lung cancer and 1 of 100 non-smokers die of lung cancer, the relative risk of dying of lung cancer given that you're a smoker is $\frac{10/100}{1/100} = 10$ times great than that given that you're a non-smoker. Note that hats $\hat{ }$ over p and RR indicate that these are estimates, not parameters.
$$\widehat{RR} = \frac{\widehat{p_1}}{\widehat{p_2}} \text{ , where } \widehat{p_1} = \frac{n_\text{bad outcomes in scenario one}}{n_\text{individuals in scenario one}}$$
For example, the relative risk of getting sick if vaccinated is the probability of getting sick if vaccinated ($\widehat{p}(\text{sick| vaccine})$) divided by the probability of getting sick if unvaccinated ($\widehat{p}(\text{sick| no vaccine})$). In the Moderna COVID vaccine trial
- 11 of 15,000 individuals in the vaccine group caught coronavirus ($\hat{p} = \frac{11}{15000} = \frac{0.11}{150}= \frac{0.07\overline{33}}{100}$).
- 185 of 15,000 individuals in the placebo group caught coronavirus ($\hat{p} = \frac{185}{15000} = \frac{1.85}{150}= \frac{1.2\overline{333}}{100}$).
So the relative risk of covid infection in the vaccinated was $\frac{11 / 15000}{185 / 15000} = \frac{185}{11} = 0.0595$ -- in other words, receiving the Moderna vaccine was associated with about a roughly 16.8-fold (i.e. 1/0.0595) decrease in the risk of getting covid. The relative risk can take any value between zero and infinity, with
- A relative risk below one indicating that option one (the one in the numerator) is safer, and
- A relative risk above one indicating that option two (the one in the denominator) is safer.
The relative risk is a fantastic summary statistic because it is straightforward, matches how we think, and is easy to communicate. Therefore, the relative risk should be reported whenever probabilities can be estimated without bias,
### The (log) Odds-Ratio
While the relative risk is a simple summary to communicate it suffers from two challenges
1. We often cannot study rare outcomes without some selection bias.
2. It has some non-desirable properties for statistical modeling.
For these reasons, we often deal with the less intuitive (log) Odds Ratio, rather than the simpler relative risk😭. <span style="color: LightGrey;"> Jason Kerwin argues that odds ratios are so confusing that they are [basically useless]( https://jasonkerwin.com/nonparibus/2014/07/04/odds-ratios-are-a-catastrophe/), do you agree?</span>
**The odds of an outcome** is the estimated probability of that outcome, $\widehat{p}$, divided by the estimated probability of not that outcome, $1 - \widehat{p}$. That is $\text{Odds} = \frac{\widehat{p}}{1-\widehat{p}}$.
**The odds ratio** is the odds of an outcome in scenario one divided by the odds of an outcome in the other scenario.
For example if 10 of 100 smokers die of lung cancer and 1 of 100 non-smokers die of lung cancer, the odds ratio of dying of lung cancer for smokers relative to non-smokers is $\frac{(10/100)/(90/100)}{(1/100)/(99/100)} = \frac{1/9}{1/99} = 99/11 = 9$. This shows
1. The odds ratio does not equal the relative risk.
2. The denominator drops out of the top and bottom -- meaning this will work even if we don't have true absolute probabilities. This will be the case in studies of rare outcomes in which it makes sense to increase the number of these rare outcomes we see (as we do in case-control studies).
## Quantifying uncertainty in associations between two catergorical variables
There is an equation for the uncertainty in the log odds ratio, but it isn't illuminating, and is a pain. I include it at the end of this chapter but I can't come up with a good reason to teach it to you. Instead, let's bootstrap!!! In the real world, you will simply use the default standard error that R gives you, but that doesn't do much to help us understand anything.
So let's **approximate the sampling distribution by resampling outcomes with replacement** and summarizing variability in this resampling to describe uncertainty.
```{r, echo=FALSE, message=FALSE, warning=FALSE}
covid_long <- covid_counts %>%
pivot_longer(cols = -1,names_to = "treatment",values_to = "n") %>%
uncount(n)
```
```{r, warning = FALSE, message=FALSE, eval=FALSE}
covid_long <- covid_counts %>%
pivot_longer(cols = -1,names_to = "treatment",values_to = "n") %>%
uncount(n)
n_reps <- 10000
boot_risk <- replicate(n = n_reps,simplify = FALSE,
expr = covid_long %>%
group_by(treatment)%>%
slice_sample(prop = 1, replace = TRUE) %>%
group_by(treatment) %>%
summarise(risk = mean(outcome == "Covid"))) %>%
bind_rows() %>%
mutate(replicate_id = rep(1:n_reps, each =2))
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
boot_risk <- read_csv("data/boot_moderna.csv")
```
Let's visualize this uncertainty by plotting this approximation of the sampling distribution.
```{r modernasamplingsdist, fig.cap = 'Bootstrapping participants in Moderna trials by treatment to consider uncertainty in our estimate of risk.', warning = FALSE, message=FALSE, fig.height=1.8, fig.width=6}
ggplot(boot_risk, aes(x = risk, fill = treatment))+
geom_density(alpha = .8, color = NA)+
theme_tufte()+
scale_y_continuous(expand = c(0,0))+
scale_x_continuous(breaks = seq(0,0.02,.002))+
annotate(geom = "text", x = c(0.0015, .0115), y = 575, label = c("vax","no vax"), hjust = 0)+
theme(axis.line = element_line(color = "black"),
legend.position = "none")+
labs(title = "Bootstrapped distribution of covid risk",
x = "risk of covid (from Moderna Study)")
```
**We can summarize this uncertainty for each treatment with the bootstrap standard error and the bootstrap confidence intervals of the risk**
```{r, warning = FALSE, message=FALSE}
boot_risk %>%
group_by(treatment) %>%
summarise(se_risk = sd(risk),
lower_95_CI = quantile(risk, prob = 0.025),
upper_95_CI = quantile(risk, prob = 0.975))
```
So a reasonable bound is that we think the risk of getting COVID over the course of the Moderna Vaccine trial was between 0.0333% and 0.113% for people who got the vaccine and between 1.05% and 1.4% for people who did not.
**To quantify uncertainty in the relative risk and odds ratio we pair replicates together and use our summaries of risk among the vaccinated and unvaccinated in each replicate, and then calculate eg confidence intervals and standard errors.**
```{r}
boot_risk %>%
pivot_wider(id_cols = replicate_id, names_from = treatment, values_from = risk) %>%
mutate(relative_risk = Moderna / No_vaccine,
odds_ratio = ( Moderna)*(1- Moderna) / ((No_vaccine)*(1 - No_vaccine))) %>%
summarise_at(.vars = c("relative_risk", "odds_ratio"), quantile, probs = c(0.025,0.975)) %>%
mutate(CI = c("lower_95", "upper_95") )
```
```{r echo = FALSE}
rm(boot_risk)
```
## Testing the null hypothesis of no association
```{r, echo=FALSE, fig.height=3, fig.width=2.5, out.extra='style="float:right; padding:10px"'}
null_data <- tibble(treatment = c("vaccinated", "unvaccinated"),
prop_infected = c(.01,.01))
tmp <- ggplot(null_data, aes(x = treatment, y = prop_infected)) +
geom_col()+
scale_y_continuous(limits = c(0,.1)) +
labs(title = "Null Hypothesis") +
theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot_grid(tmp , ggdraw(tmp + labs(title = "Alternative Hypothesis")) + draw_label(" NO", color = "red", size = 60), ncol = 1 )
```
1. State the null hypothesis and its alternative.
- Null hypotheses: There **is NO association** between our two categories. e.g. *The proportion of people with covid* ***does not differ*** *between the vaccinated and unvaccinated people*.
- Alternate hypotheses: There **is an association** between our two categories. e.g. *The proportion of people with covid* ***differs*** *between vaccinated and unvaccinated people*.
2. Calculate a test statistic to summarize our data.
3. Compare the observed test statistic to the sampling distribution of this statistic from the null model.
4. Interpret these results. If the test statistic is in an extreme tail of this sampling distribution, we reject the null hypothesis, otherwise we do not.
### By permutation
First let's test this null by permutation. Remember in permutation we generate a sampling distribution under the null by shuffling explanatory and response variables. We need a test statistic -- so let's take the odds ratio. <span style="color: lightgrey;"> (p-values and conclusions will be the exact same if we chose the relative risk).</span>
#### ***Shuffle labels for one replicates***
- Use the `sample()` function with `replace = FALSE` to shuffle labels
- So we the shuffle treatments across outcomes for each replicate to generate data from the null.
```{r, message=FALSE, warning=FALSE}
# shuffle the relationship between explanatory and response variable
covid_one_perm <- covid_long %>%
mutate(treatment = sample(treatment, replace = FALSE))
```
Count outcomes of the permutation
```{r}
table(covid_one_perm )
```
And summarize the association
```{r, message=FALSE, warning=FALSE, eval=FALSE}
# Summarize sampling distribution of permutation
covid_one_perm_summary <- covid_one_perm %>%
group_by(treatment)%>%
summarise(risk = mean(outcome == "Covid"),
odds = risk / (1-risk))
covid_one_perm_summary %>%
select(treatment, odds) %>%
pivot_wider(names_from = treatment,values_from = odds,names_prefix = "odds.") %>%
mutate(odds_ratio = odds.Moderna / odds.No_vaccine)
```
#### ***Shuffle labels for many replicates***
So now, lets make a sampling distribution from the null model by shuffling and summarizing a bunch of times!
```{r, message=FALSE, warning=FALSE, eval=FALSE}
n_reps <- 10000
covid_many_perm <- replicate(n = n_reps, simplify = FALSE,
expr = covid_long %>%
mutate(treatment = sample(treatment, replace = FALSE)) %>%
group_by(treatment)%>%
summarise(risk = mean(outcome == "Covid"),
odds = risk / (1-risk)) ) %>%
bind_rows()%>%
mutate(replicate_id = rep(1:n_reps, each = 2))
```
```{r, echo = FALSE, message=FALSE, warning=FALSE}
covid_many_perm <- read_csv("data/covid_many_perm.csv")
```
```{r, echo=FALSE}
DT::datatable(head(covid_many_perm %>% mutate_at(c("risk","odds"), round, digits = 5), n = 1000),
options = list(autoWidth = TRUE,pageLength = 10, lengthMenu = c(10, 50, 100)
))
```
Let's summarize these results as an odds ratio
```{r}
moderna_perm_summaries <- covid_many_perm %>%
select(treatment, odds, replicate_id) %>%
pivot_wider(id_cols = replicate_id,
names_from = treatment,
values_from = odds,
names_prefix = "odds.") %>%
mutate(odds_ratio = odds.Moderna / odds.No_vaccine)
```
#### ***Plot the data***
```{r modernaperm, fig.cap ="Comparing the efficacy of the Moderna COVID vaccine (red dashed line) to its sampling distribution under the null (bars). Each group consisted of 15,000 participants. The odds ratio and log odds ratio are presented in **A** and **B**, respectively", fig.height=2.3, fig.width=8}
actual_odds_ratio <- ((11 /15000) / (1 - (11 /15000) )) / ((185 /15000) / (1- (185/15000)))
odds_ratio_plot <- ggplot(moderna_perm_summaries, aes(x = odds_ratio)) +
geom_histogram(color = "white", binwidth = .04, size = .005) +
geom_vline(xintercept = c( actual_odds_ratio, 1 /actual_odds_ratio), color = "firebrick", lty = 2) +
annotate(x = -2, y =1500, label = "Actual\nodds\nratio", geom = "text", hjust = 0, vjust = 1, color = "firebrick")+
labs(title = "Null Sampling distribution: Moderna trial", subtitle = "odds ratio")
log_odds_ratio_plot <- ggplot(moderna_perm_summaries, aes(x = log(odds_ratio))) +
geom_histogram(color = "white", bins = 270, size = .03) +
geom_vline(xintercept = log(c( actual_odds_ratio, 1 / actual_odds_ratio)), color = "firebrick", lty = 2) +
annotate(x = log(actual_odds_ratio), y =500, label = " Actual\n log odds\n ratio", geom = "text", hjust = 0, vjust = 1, color = "firebrick")+
labs(title = "Null Sampling distribution: Moderna trial", subtitle = "log odds ratio")
plot_grid(odds_ratio_plot, log_odds_ratio_plot, labels = "AUTO")
```
From Figure \@ref(fig:modernaperm) it is pretty clear that it would be quite strange to see a result as extreme as we did if the null were true.
Figure \@ref(fig:modernaperm)B highlights the practical utility of using log odds ratios as opposed to odds ratios -- as it shows that both tails of the distribution are equidistant from the null, in contrast to the misleading impression from \@ref(fig:modernaperm)A.
#### ***Find a p-value***
Remember the p-value is the probability that we would observe our test statistic, or something more extreme if the null were true. While it is clear from Figure \@ref(fig:modernaperm) that it would be quite strange to see a result as extreme as we did if the null were true, we want to quantify this weirdness with a p-value.
We can approximate this by looking at the proportion of values from our permutation that are as or more extreme than our actual observation.
Because the odds ratio is a ratio, we take values that are as small as ours (or smaller) or as large as the reciprocal of ours (or larger) as being as or more extreme. The code below is written so it works whether our observed odds ratio was very small or very big.
```{r}
observed_both_tails <- c(actual_odds_ratio, 1 / actual_odds_ratio)
moderna_perm_summaries %>%
mutate(as_or_more_extreme = odds_ratio <= min(observed_both_tails) |
odds_ratio >= max(observed_both_tails) ) %>%
summarise(p_val = mean(as_or_more_extreme )) %>%
pull()
```
We find that none of ten-thousand permutations were as or more extreme than our data, so we conclude that our p-value is less than 0.0001, and reject the null hypothesis that (Moderna) vaccination status is not associated with COVID risk.
### With math
The permutation above did the job. We came to a solid conclusion. But there are a few downsides
- It can be a bit slow -- it takes my computer like five minutes to permute a data set this large 10,000 times. While five minutes isn't a lot of time it would add up if for example we wanted to do this for every mutation in a genome.
- It is influenced by chance. Every permutation will have a slightly different p-value due to the chance outcomes of random shuffling. While this shouldn't impact our conclusions it's a bit unsatisfying.
- We couldn't calculate an exact answer when p is really low. Again if p is really low, we don't really need to know it precisely, because we're going to reject the null, but again, we'd like a more satisfactory answer.
For these reasons, we usually only apply a permutation to cases without mathematically solved null sampling distributions. For the case of associations between categorial variables there is a relevant null sampling distribution -- the $\chi^2$ distribution!
#### $\chi^2$ (aka chi squared)
$\chi^2$ is a test statistic that quantifies the difference between observed and expected counts. That is, $\chi^2$ summarizes the fit of categorical data to expectations.We calculate $\chi^2$ as
\begin{equation}
\chi^2 = \sum \frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i}
(\#eq:chi2)
\end{equation}
Where each $i$ is a different potential outcome (in our example: COVID without vaccine, COVID with vaccine, NO COVID without vaccine, NO COVID with vaccine), and observations and **expectations are counts** not proportions.
We use the multiplication rule $P_{A \text{ AND } B} = P_{A} \times P_B$ to find null expectations (i.e. assuming independence) for proportions. We then convert these expected proportions to counts by multiplying by sample size. In our case
```{r, message=FALSE, warning=FALSE}
moderna_chi2_calcs <- covid_long %>%
group_by(treatment, outcome) %>%
summarise(count = n()) %>%
ungroup() %>%
mutate(total_n = sum(count)) %>%
group_by(outcome) %>%
mutate(p_covid = sum(count) / total_n) %>%
ungroup() %>%
group_by(treatment) %>%
mutate(p_treat = sum(count) / total_n) %>%
ungroup() %>%
mutate(expected_count = p_covid * p_treat * total_n,
chi2 = (expected_count - count)^2/ expected_count)
moderna_chi2_calcs
```
The code above finds the contribution of each category to $\chi^2$ in column `chi2`, as $\frac{(\text{Observed}_i - \text{Expected}_i)^2}{\text{Expected}_i}$ (Eq. \@ref(eq:chi2)). Note that because treatments where equally spread out, we just get the same values twice, this won't happen with unequal treatment proportions.
Adding these all up, we have a $\chi^2$ value of $`r moderna_chi2_calcs %>% mutate( x = paste("\\frac{(", count," - ",expected_count, ")^2}{",expected_count,"}",sep = "")) %>% pull(x) %>% paste0(collapse = " + ")`$ = `r paste( moderna_chi2_calcs %>% pull(chi2) %>% round(digits = 2), collapse = " + ")` = `r moderna_chi2_calcs %>% pull(chi2) %>% sum() %>% round(digits = 2)`.
#### The $\chi^2$ distribution {-}
So, is our $\chi^2$ value of `r moderna_chi2_calcs %>% pull(chi2) %>% sum() %>% round(digits = 2)` low or high? We can compare this to the null sampling distribution of $\chi^2$ to find out. To do so, let us first look at the distribution of $\chi^2$ values in our permutations under the null model. To do so we first permute as above, but now we're going through more tedious calculations to find the $\chi^2$ value.
```{r, eval=FALSE}
n_reps <- 10000
# permute
moderna_perm_for_chi2 <- replicate(n = n_reps, simplify = FALSE,
expr = covid_long %>%
mutate(treatment = sample(treatment, replace = FALSE)) %>%
# Boring formating stuff...
group_by(treatment, outcome) %>%
summarise(count = n(),.groups = "drop") %>%
mutate(total_n = sum(count)) %>%
group_by(outcome) %>%
mutate(p_covid = sum(count) / total_n) %>%
ungroup() %>%
group_by(treatment) %>%
mutate(p_treat = sum(count) / total_n) %>%
ungroup() %>%
mutate(expected_count = p_covid * p_treat * total_n) %>%
summarise(chi2 = sum((expected_count - count)^2/ expected_count)))%>%
bind_rows()
```
```{r, echo = FALSE, message=FALSE, warning=FALSE}
covid_chi2_perm <- read_csv("data/moderna_perm_for_chi2.csv")
```
```{r modernachi2perm, fig.cap= 'Distribution of chi2 values from our null permutation.', warning=FALSE, message=FALSE, fig.height= 2.4, fig.width =5}
moderna_chi2_plot <- ggplot(covid_chi2_perm , aes(x = chi2))+
geom_histogram(aes(y = ..density.. ),color = "white", binwidth = 1, size = .1)+
scale_y_continuous(limits = c(0,1)) +
labs(x = expression(chi^2,
title = "Null Sampling distribution: Moderna trial",
subtitle = expression(paste("Permutation-based ", chi^2))))
moderna_chi2_plot
```
As above (e.g. Figure \@ref(fig:modernaperm)), we see from Figure \@ref(fig:modernachi2perm) that the observed $\chi^2$ value of `r moderna_chi2_calcs %>% pull(chi2) %>% sum() %>% round(digits = 2)` is way outside of what we would expect from the null.
But remember, the whole reason we calculate $\chi^2$ is because there is a nice mathematical function that captures this null sampling distribution without permutation!
##### **Degrees of freedom**
It turns out that the distribution of $\chi^2$ values under the null depends on something called the **degrees of freedom**. The degrees of freedom specify the additional number of values we need to know all the outcomes in our study.
In the case of the Moderna trial, we know the number of participants, and need to calculate $p_\text{vaccinated}$ and $p_\text{covid +}$ from our data to calculate $\chi^2$. This means that if you gave me this information, plus the counts for one box in our table (e.g. the number of vaccinated people who got COVID) I could use basic algebra to learn all the other boxes. So we say there is one degree of freedom. In general, the degrees of freedom for associations between count data equals:
$$df = (\text{# potential values for explanatory var} -1) \times (\text{# potential values for response var} -1)$$.
In our case, this is $(2-1)\times(2-1)=1 \text{ degree of freedom}$. We see that the actual $\chi^2$ distribution with one degree of freedom
```{r modernachi2, fig.cap= 'Comparing the distribution of chi2 values from our null permutation (histogram in bars, density curve as black line) to the chi2 distribution with one degree of freedom (red)', warning=FALSE, message=FALSE, fig.height= 2.4, fig.width =5}
moderna_chi2_plot +
geom_density(adjust = 2, size = 2)+
stat_function(fun = dchisq, args = list(df = 1), color = "red", lwd = 2, alpha = .4)+
annotate(x = 1, y =.9, color = "red", geom = "label", label= expression(paste(chi^2,"distribution. df =1")) , hjust = 0)+
annotate(x = 1, y =.7, geom = "label", label= "Fitting a density curve to permutations" , hjust = 0)
```
***Effect size vs surprise*** It's worth noting a major difference between the relative risk and log odds ratio as compared to the $\chi^2$ value:
- The relative risk and log odds ratio quantify the association between categorical variables.
- $\chi^2$ quantifies the evidence for a deviation from expectations.
An identical estimated relative risk of 2:1 would have a way bigger $chi^2$ value if it came from two-thousand observations than if it came from twenty observations. As such $\chi^2$ does not capture the effect size in the same way as the relative risk or log odds ratio.
#### **A p-value**
We can use the $\chi^2$ distribution to find a p-value! Specificaly we can ask `R` to find the area under the $\chi^2$ distribution that is as or more extreme than our $\chi^2$ value with the `pchisq` function.
```{r}
observed_chi2 <-moderna_chi2_calcs %>% pull(chi2) %>% sum()
pchisq(observed_chi2, df = 1, lower.tail = FALSE)
```
This is an exceptionally small p-value. If we did 10,000,000,000,000,000,000,000,000,000,000,000 such experiments for which th null were true, only one would be as extreme as what we saw. So we (still) reject the null hypothesis.
Note that we told `R` that we
1. Had one degree of freedom (argument `df = 1`).
2. We were only interested in the upper tail (argument ` lower.tail = FALSE`).
So, why do we only look at one tail here? It's because both extreme associations (either an association between vaccine and COVID, or association with no vaccine and COVID (both tails in Figure \@ref(fig:modernaperm), result in large $\chi^2$ values) -- an exceptionally small $\chi^2$ value does not correspond to either tail we care about.
#### $\chi^2$ contingency test with the `chisq.test()` function
We can avoid all of this tedious calculation by using the`chisq.test()` function in R. Recall our initial data structure `covid_counts`
```{r, echo=FALSE}
covid_counts
```
We can type
```{r}
covid_chisq_test <- covid_counts %>%
dplyr::select(- outcome) %>% # to ignore labels
chisq.test(x = ., correct = FALSE)
covid_chisq_test
```
Note three things:
1. We typed `correct = FALSE`, this tells `R` we want the basic $\chi^2$ test and not to use a fancy correction. See [`help(chisq.test)`](https://stat.ethz.ch/R-manual/R-patched/library/stats/html/chisq.test.html) for more info.
2. The p-value was reported as `< 2.2e-16`. By default, this is what R tells us when the p-value is very small.
3. This is easy to look at and read, but not easy to work with because it isn't tidy.
We can solve issues 2 and 3, above, with the [`tidy`](https://generics.r-lib.org/reference/tidy.html) function in the [`broom`](https://broom.tidymodels.org/) package, which tidies output from `R`'s hypothesis testing output.
```{r}
library(broom)
tidy(covid_chisq_test)
```
Reassuringly, `R`'s answers match ours!!!
## More on the $\chi^2$ test.
### Uses of the $\chi^2$ test
The $\chi^2$ test tests the null hypothesis that counts follow expectations of some null model.
- We introduced a specific and common use of the $\chi^2$ test --- the $\chi^2$ contingency test, which is used to test for associations between two categorical outcomes.
- However, we can use the same framework to test the null hypothesis that babies have an equal probability of being born on any da of the week (e.g. a probability of 1/7 for each day), or if data fit a common distribution for counts such as the Poisson (if events are independent over time or space) or binomial (if counts are independent within groups). Much of this is covered in Chapter 8 of @whitlock2020 , which we are skipping here (I have never used any of that chapter outside of teaching). Feel free to look there for more information if you care.
### Assumptions of the $\chi^2$ test.
The $\chi^2$ test assumes
- Random sampling (without bias) in each group.
- Independent observations in each group.
- All categories have an expected count greater than one.
- No more than 20% of categories have an expected count of less than five.
When these assumptions are not met, we can be sure that the null $\chi^2$ distribution will match the mathematical approximation represented by the $\chi^2$ distribution.
The Moderna trial met all of these assumptions, but sometimes studies don't. When data do not meet test assumptions, we have a few options.
1. Ignore violations of assumptions if they are super minor.
2. Permute or simulate to generate a null that does not depend on mathematical assumptions (I did this a bunch in [this paper](https://nph.onlinelibrary.wiley.com/doi/10.1111/nph.16180) [@pickup2019]).
3. Collapse observations into smaller sensible categories e.g.
| Initial Groups | Collapsed Groups 1 | Collapsed Groups 2 |
|-|-|-|
| no COVID | no COVID | no or asymptomatic COVID |
| asymptomatic COVID | COVID | no orasymptomatic COVID |
| mild COVID | COVID | mild COVID |
| severe COVID | COVID | severe COVID or DEATH |
| COVID death | COVID | severe COVID or DEATH |
**Noote:** it would not make sense to have categories like No COVID and COVID death vs. asymptomatic COVID, mild COVID and severe COVID.
4. Use a more appropriate test.
#### Fisher's exact test as an alternative to the $\chi^2$ test.
Fisher's exact test is a common solution to the issue of having expected counts too small for a $\chi^2$ contingency test. Fisher's exact test is basically a permutation test, but instead of randomly shuffling, we use fancy math to lay out all possible permutations in their expected counts. This works well for cases with few observations because the math is doable, but gets to be impractical when there are many observations.
##### ***EXAMPLE: Fisher's exact test*** {-}
Could vampire bats preferentially prey in cows in estrus? For some reason ¯\\__(ツ)_/¯, @turner1975 looked into this and found:
```{r}
cows <- tibble(bitten = c("bit","not bit"),
estrus = c(15, 7),
not_estrus = c( 6, 322))
kable(cows)
```
A quick calculation (Relative Risk = $\frac{15 / (15+7)}{6/(6+322)}= \frac{15 \times 328}{22 \times 6} = \frac{5\times164}{22} = \frac{5\times82}{11} = \frac{410}{11} = 37.3$) shows that the risk of being bitten is about 37 times greater for cows in estrus, than those not in estrus. (Odds Ratio = $\frac{ (15/22)/(7/22)}{(6/328)/(322/328)} = \frac{ 15/7}{6/322} = \frac{15 \times 322}{6 \times 7} = \frac{15 \times 46}{6} = 5 \times 23 = 115$).
The visualization below (Fig. \@ref(fig:mosaicbite)) makes this point clearly.
```{r, mosaicbite, fig.cap="Visualizng cows being bitten by vampire bats.", fig.height=2.5, fig.width=2.5}
par(mar =c(1,1,1,1))
mosaicplot(column_to_rownames(cows, var = "bitten"), color = c("firebrick","cornsilk"),main="")
```
###### ***Calculating expectations*** {-}
As above, we calculate expected counts as $P_A \times P_B \times n$
```{r}
pivot_longer(cows, cols = contains("estrus"),
names_to = c("condition"),
values_to = c("count")) %>%
mutate(tot = sum(count)) %>%
group_by(condition) %>% mutate(prop_condition = sum(count) / tot) %>% ungroup() %>%
group_by(bitten) %>% mutate(prop_bite = sum(count) / tot) %>% ungroup() %>%
mutate(expected_count = prop_condition * prop_bite * tot) %>% kable(digits = 3)
```
Because we expect fewer than a count of five in more than 20% of our cells, we violate the assumptions of a $\chi^2$ test. So, we will use the Fisher's exact test.
```{r}
cows_exact_test <- cows %>%
dplyr::select(- bitten) %>% # to ignore labels
fisher.test()
cows_exact_test
tidy(cows_exact_test)
```
Again the extremely small p-value means that such an extreme association would be produced in about one in every 1,000,000,000,000,000 universes in which the null model is true.
We therefore reject the null hypothesis and conclude that the probability of being bitten by a vampire bat is larger for cows in estrus than cows not in estrus.
## Silly equations
Take the contingency table like the one below. The odds ratio is $\widehat{OR}=\frac{a\times d}{b \times c}$, and we estimate the standard error of the log odds ratio as $\sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}$. Th 95% CI for the log odds ratio is
$$log(\widehat{OR}) \pm 1.96 \times \sqrt{\frac{1}{a}+\frac{1}{b}+\frac{1}{c}+\frac{1}{d}}$$
And we can exponentiate this to estimate the 95% CI for the odds ratio.
| | Treatment | Control |
|-|-|-|
| Focal outcome | a | b |
| Alternate outcome | c | d |
## Quiz
```{r, echo=FALSE}
include_app("https://brandvain.shinyapps.io/2cats/",height = '800')
```
```{r, echo=FALSE}
rm(list = ls())
```