-
Notifications
You must be signed in to change notification settings - Fork 2
/
randomisation_notes.Rmd
590 lines (444 loc) · 41.7 KB
/
randomisation_notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
---
title: "Introduction to randomisation, bootstrap, and Monte Carlo methods in R"
author: "Brad Duthie"
date: "8 March 2023"
output:
html_document:
code_folding: show
bibliography: randomisation_notes.bib
---
Contents
================================================================================
********************************************************************************
**These notes focus on randomisation approaches to statistical hypothesis testing. After reading through this, you will have been introduced to some key statistical techniques in randomisation and Monte Carlo methods that can be coded in R.**
********************************************************************************
- [Introduction: Randomisation and Monte Carlo](#intro)
- [Random number generation in R: key functions](#rand_r)
- [Random normal: `rnorm`](#rnorm)
- [Random uniform: `runif`](#runif)
- [Random binomial: `rbinom`](#rbinom)
- [Random poisson: `rpois`](#rpois)
- [Random sampled values: `sample`](#sample)
- [Randomisation for hypothesis testing](#rand_h)
- [Bootstrapping confidence intervals](#rand_b)
- [Monte Carlo method for spatial data](#rand_m)
- [Conclusions](#conclusions)
- [Practice problems](#practice)
- [Appendix](#appendix)
********************************************************************************
<a name="intro">Introduction: Randomisation and Monte Carlo</a>
================================================================================
The R programming language has many built-in base functions for inferential statistics, and a growing library of [R packages](https://cran.r-project.org/web/packages/) continues to add to the list of functions available in R for performing specific statistical analyses. Users of R, however, need not be restricted to these functions; the flexibility of programming allows us to write our own code to address specific statistical questions and tailor our analyses to our own data. The freedom to code our own analyses makes it possible to apply statistical methods that would otherwise by very difficult to use.
In the preface to his book, "Randomization, Bootstrap, and Monte Carlo Methods in Biology", @Manly2007 emphasises how twentieth century advances in computation power have made the titular methods of his book possible for researchers. In these notes, I will introduce some basic examples of how to use randomisation, bootstrap, and Monte Carlo methods in the R programming language. Throughout these examples, I hope to also show how the process of randomisation and resampling applied in these methods can also make the logic of null hypothesis testing in frequentist statistics more intuitive. I have especially tried to do this with respect to p-values and confidence intervals, which are often misunderstood.
I will begin by explaining key R functions for generating random values. Next, I will show how a t-test can be performed using randomisation, followed by the generation of confidence intervals by sampling with replacement. Finally, I will show how a Monte Carlo test can be written in R to test whether or not a population of plants is aggregated within a fixed area.
<a name="rand_r">Random number generation in R: key functions</a>
================================================================================
There are several functions in R that generate random numbers from particular distributions, or sample elements from vectors or lists. Below, I introduce some of these functions and show how to use them.
<a name="rnorm">**Sampling from a normal distribution**</a>
The `rnorm` function returns some number (`n`) of (pseudo)randomly generated numbers given a set mean ($\mu$; `mean`) and standard deviation ($\sigma$; `sd`), such that $X\sim\mathcal N(\mu,\sigma^2)$. The default is to draw from a standard [normal (a.k.a., "Gaussian") distribution](https://en.wikipedia.org/wiki/Normal_distribution) (i.e., $\mu = 0$ and $\sigma = 1$).
```{r}
rand_norms_10 <- rnorm(n = 10, mean = 0, sd = 1);
```
The above code stores a vector of ten numbers, shown below.
```{r, echo = FALSE}
print(rand_norms_10);
```
We can verify that a standard normal distribution is generated by plotting a histogram of a very large number of values created using `rnorm`.
```{r}
rand_norms_10000 <- rnorm(n = 10000, mean = 0, sd = 1);
hist(rand_norms_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
<a name="runif">**Sampling from a uniform distribution**</a>
Like the `rnorm` function, the `runif` function returns some number (`n`) of random numbers, but from a [uniform distribution](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)) with a range from $a$ (`min`) to $b$ (`max`) such that $X\sim\mathcal U(a,b)$, where $-\infty < a < b < \infty$. The default is to draw from a standard uniform distribution (i.e., $a = 0$ and $b = 1$).
```{r}
rand_unifs_10 <- runif(n = 10, min = 0, max = 1);
```
The above code stores a vector of ten numbers, shown below.
```{r, echo = FALSE}
print(rand_unifs_10);
```
As with the randomly generated normally distributed numbers, we can verify that a standard uniform distribution is generated by plotting a histogram of a very large number of values created using `runif`.
```{r}
rand_unifs_10000 <- runif(n = 10000, min = 0, max = 1);
hist(rand_unifs_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
<a name="rbinom">**Sampling from a binomial distribution**</a>
Like previous functions, the `rbinom` function returns some number (`n`) of random numbers, but the arguments and output can be slightly confusing at first. Recall that a [binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution) describes the number of 'successes' for some number of independent trials ($\Pr(success) = p$). The `rbinom` function returns the number of successes after `size` trials, in which the probability of success in each trial is `prob`. For a concrete example, suppose we want to simulate the flipping of a fair coin 1000 times, and we want to know how many times that coin comes up heads ('success'). We can do this with the following code.
```{r}
coin_flips <- rbinom(n = 1, size = 1000, prob = 0.5);
print(coin_flips);
```
The above result shows that the coin came up heads `r coin_flips` times. Note, however, the (required) argument `n` above. This allows the user to set the number of sequences to run. In other words, if we set `n = 2`, then this could simulate the flipping of a fair coin 1000 times once to see how many times heads comes up, then repeating the whole process to see how many times heads comes up again (or, if it is more intuitive, the flipping of two separate fair coins 1000 times).
```{r}
coin_flips_2 <- rbinom(n = 2, size = 1000, prob = 0.5);
print(coin_flips_2);
```
In the above, a fair coin was flipped 1000 times and returned `r coin_flips_2[1]` heads, and then another fair coin was flipped 1000 times and returned `r coin_flips_2[2]` heads. As with the `rnorm` and `runif` functions, we can check to see what the distribution of the binomial function looks like if we repeat this process. Suppose, in other words, that we want to see the distribution of times heads comes up after 1000 flips. We can, for example, simulate the process of flipping 1000 times in a row with 10000 different coins using the code below.
```{r}
coin_flips_10000 <- rbinom(n = 10000, size = 1000, prob = 0.5);
```
I have not printed the above `coin_flips_10000` for obvious reasons, but we can use a histogram to look at the results.
```{r}
hist(coin_flips_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
As would be expected, most of the time 'heads' occurs around 500 times out of 1000, but usually the actual number will be a bit lower or higher due to chance. Note that if we want to simulate the results of individual flips in a single trial, we can do so as follows.
```{r}
flips_10 <- rbinom(n = 10, size = 1, prob = 0.5);
```
```{r, echo = FALSE}
print(flips_10);
```
In the above, there are `n = 10` trials, but each trial consists of only a single coin flip (`size = 1`). But we can equally well interpret the results as a series of `n` coin flips that come up either heads (`1`) or tails (`0`). This latter interpretation can be especially useful to write code that randomly decides whether some event will happen (`1`) or not (`0`) with some probability `prob`.
<a name="rpois">**Sampling from a poisson distribution**</a>
Many processes in biology can be described by a [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution). A Poisson process describes events happening with some given probability over an area of time or space such that $X\sim Poisson(\lambda)$, where the rate parameter $\lambda$ is both the mean and variance of the Poisson distribution (note that by definition, $\lambda > 0$). Sampling from a Poisson distribution can be done in R with `rpois`, which takes only two arguments specifying the number of values to be returned (`n`) and the rate parameter (`lambda`).
```{r}
rand_poissons <- rpois(n = 10, lambda = 1);
print(rand_poissons);
```
There are no default values for `rpois`. We can plot a histogram of a large number of values to see the distribution when $\lambda = 4$ below.
```{r}
rand_poissons_10000 <- rpois(n = 10000, lambda = 4);
hist(rand_poissons_10000, xlab = "Random value (X)", col = "grey",
main = "", cex.lab = 1.5, cex.axis = 1.5);
```
<a name="sample">**Sampling values from a vector**</a>
Sometimes it is useful to sample a set of values from a vector or list. The R function `sample` is very flexible for sampling a subset of numbers or elements from some structure (`x`) in R according to some set probabilities (`prob`). Elements can be sampled from `x` some number of times (`size`) with or without replacement (`replace`), though an error will be returned if the `size` of the sample is larger than `x` but `replace = FALSE` (default). To start out simple, suppose we want to ask R to pick a random number from one to ten with equal probability.
```{r}
rand_number_1 <- sample(x = 1:10, size = 1);
```
The above code will set `rand_number_1` to a randomly selected value, in this case `r rand_number_1`. Because we have not specified a probability vector `prob`, the function assumes that every element in `1:10` is sampled with equal probability. We can increase the `size` of the sample to `10` below.
```{r}
rand_number_10 <- sample(x = 1:10, size = 10);
print(rand_number_10);
```
Note that all numbers from 1 to 10 have been sampled, but in a random order. This is becaues the default is to sample with replacement, meaning that once a number has been sampled for the first element in `rand_number_10`, it is no longer available to be sampled again. To change this and allow for sampling with replacement, we can change the default.
```{r}
rand_number_10_r <- sample(x = 1:10, size = 10, replace = TRUE);
print(rand_number_10_r);
```
Note that the numbers {`r as.numeric(names(which(table(rand_number_10_r) > 1)))`} are now repeated in the set of randomly sampled values above. We can also specify the probability of sampling each element, with the condition that these probabilities need to sum to 1. Below shows an example in which the numbers 1-5 are sampled with a probability of 0.05, while the numbers 6-10 are sampled with a probability of 0.15, thereby biasing sampling toward larger numbers.
```{r}
prob_vec <- c( rep(x = 0.05, times = 5), rep(x = 0.15, times = 5) );
rand_num_bias <- sample(x = 1:10, size = 10, replace = TRUE, prob = prob_vec);
print(rand_num_bias);
```
Note that `rand_num_bias` above contains more numbers from 6-10 than from 1-5.
This kind of sampling can be very useful. If, for example, we have a large data set and want to look at a random subset of rows, we can do it by inserting sampled values and telling R to print it. Consider the built-in R base data set `ChickWeight`, a table with data on the effect of diet on chick weights.
```{r}
data("ChickWeight")
head(ChickWeight);
```
The table continues on with `r dim(ChickWeight)[1]` rows of data. But suppose instead of wanting to see the first six rows (as returned from using `head` above), we instead want to see a random assortment of rows. We can use `sample` to ask for a random sample of 10 rows as follows (note that `dim(ChickWeight)[1]` returns the length of the first dimension -- i.e., the total number of rows).
```{r}
rand_rows <- sample(x = 1:dim(ChickWeight)[1], size = 10);
print(ChickWeight[rand_rows,]);
```
The above now shows a random collection of rows, including `r rownames(ChickWeight[rand_rows,])`.
Finally, this kind of sampling does not need to be done only on subsets of numbers. Lists of characters can also be sampled using `sample`. We can take a look at this using the in-built data set `mtcars` in R, in which each row name is a different car. We can get the car names using the R function `rownames`, which will return a full list of all the cars in the data set.
```{r}
data("mtcars");
head(mtcars);
car_names <- rownames(mtcars);
print(car_names);
```
If we want to sample two of these cars, for whatever reason, we can use `sample` in the same way that we did when sampling numbers.
```{r}
cars_sampled <- sample(x = car_names, size = 2, replace = FALSE);
print(cars_sampled);
```
With this procedure, we have therefore sampled `r cars_sampled[1]` and `r cars_sampled[2]`. There are even more types of structures `x` that can be sampled with `sample`, but for now I will move on to using `sample` for statistical hypothesis sampling.
<a name="rand_h">Randomisation for hypothesis testing</a>
================================================================================
The ability to generate random values and sample randomly from a vector of numbers can be used for statistical hypothesis testing. To demonstrate this, I will use the [Bumpus data set](https://www.fieldmuseum.org/blog/hermon-bumpus-and-house-sparrows) of House Sparrow (*Passer domesticus*) survival, which was also used in the Stirling Coding Club [notes on writing manuscripts in Rmarkdown](https://stirlingcodingclub.github.io/Manuscripts_in_Rmarkdown/Rmarkdown_notes.html). A version of it can be viewed and downloaded [on GitHub](https://github.com/StirlingCodingClub/randomisation/blob/master/data/Bumpus_data.csv). The data set comes from Hermon Bumpus, who collected House Sparrows after a particularly severe storm in Rhode Island, USA to measure the morphological traits of birds who survived and perished [@Bumpus1898].
```{r}
birds <- read.csv("data/Bumpus_data.csv");
```
The first few rows look like the below.
```{r, echo = FALSE}
library(knitr);
kable(head(birds));
```
Suppose we hypothesise surviving sparrows will have a lower mass than perishing sparrows. If we want to do a simple test to see whether or not the body mass (`wgt`) of living sparrows is lower than that of dead sparrows, then we could use the base R function `t.test` as follows.
```{r}
bird_t_test <- t.test(birds[,5] ~ birds[,2], alternative = "less");
print(bird_t_test);
```
In the above, `birds[,5]` specifies the column of the response variable (sparrow mass) as a function of `birds[,2]` (`alive` versus `dead`). Specifying `alternative = "less"` indicates that our alternative hypothesis is that the true difference in the means is less than zero (i.e., $\mu_{alive} - \mu_{dead} < 0$). The difference between the estimated means is `r as.numeric(bird_t_test$estimate[1])` $-$ `r as.numeric(bird_t_test$estimate[2])` $=$ `r as.numeric(bird_t_test$estimate[1]) - as.numeric(bird_t_test$estimate[2])`. As indicated from the p-value shown above, if we assume that the null hypothesis is true (i.e., sparrows that lived do *not* have less mass than sparrows that died), then the probability of getting a difference in mass this low or lower is `r bird_t_test$p.value`.
The use of a t-test is sufficient for the above question; our data set is fairly large (`r sum(birds[,2] == "alive")` living sparrows and `r sum(birds[,2] == "dead")` dead sparrows), and there is no reason to believe the statistical assumptions of the t-test are being violated. Nevertheless, we can use resampling to test the same null hypothesis in a slightly different way -- a way that does not make the same standard t-test assumptions (e.g., normality). Consider the null hypothesis again, that the true body mass of living sparrows is not less than the true body mass of dead sparrows. In the t-test, we assume this null hypothesis and ask what is the probability of getting a value of $\hat\mu_{alive} - \hat\mu_{dead}$ as low as we did (note the hats above $\mu$ are used to indicate the observed *estimate* of the mean rather than the *true* mean).
**A different way to state the null hypothesis of the t-test is that *observed membership to the groups 'alive' versus 'dead' has no effect on whether or not the difference between 'alive' and 'dead' sparrow body mass is negative*.** In other words, we can randomly shuffle group membership and not expect the difference between $\hat\mu_{alive} - \hat\mu_{dead}$ to be much lower than the difference in the actual observed data set. If we shuffle group membership many times, we can generate a distribution of differences between living and dead sparow body masses that would be expected *if group membership were random* (i.e., if sparrow status as alive or dead were in fact random with respect to body mass). We can then compare this distribution with the actual observed difference from the data set to test what the probability of getting such a low value of $\hat\mu_{alive} - \hat\mu_{dead}$ would be if group membership were random -- this comparison **is a p-value**, just derived in a different way with slightly different assumptions. It is, however, the probability of getting a value as extreme as the one observed *given that the null hypothesis is true*. Below is a simple example of how to recreate the null distribution and compare it to the observed difference between group means to get a p-value.
```{r}
iter <- 9999; # Total iterations (+1 for observed data = 10k)
diff <- NULL; # To add difference between groups
N_birds <- dim(birds)[1]; # Total number of birds
for(i in 1:iter){
bird_samp <- sample(x = birds[,2], size = N_birds, replace = FALSE);
samp_alive <- which(bird_samp == "alive");
samp_dead <- which(bird_samp == "dead");
mn_samp_a <- mean(birds[samp_alive, 5]);
mn_samp_d <- mean(birds[samp_dead, 5]);
diff[i] <- mn_samp_a - mn_samp_d;
}
```
Within the `for` loop above, what I have done is randomly shuffle the "alive" versus "dead" column from the `birds` data set. Hence, `bird_samp` is a vector of group membership that is random with respect to sparrow body mass in column 5 of `birds`. The `which` functions within the loop find which sparrows are randomly assigned to be "alive" and "dead". The following two lines get the mean values of body mass for each randomly generated group. And the last line within the `for` loop gets the difference between these means.
We could equally well calculate a t-statistic within the loop, or even the difference between group medians, rather than the difference between group means [@Manly2007]. What matters is that we are comparing observed difference between groups with equivalent differences created from the randomly generated distribution. All of the randomly generated values are now stored in `diff`, so we can plot them with a histogram to see the null distribution.
```{r}
xlabel <- expression( paste("Random group mean difference (",
mu[alive] - mu[dead], ")" )
); # Gets the Greek mu with subscripts
hist(diff, xlab = xlabel, col = "grey", main = "", cex.lab = 1.5,
cex.axis = 1.5, breaks = 20);
```
Recall that the actual observed difference was `r as.numeric(bird_t_test$estimate[1]) - as.numeric(bird_t_test$estimate[2])`. This is quite far to the left of the above distribution. We can add an arrow indicating the observed value on the histogram using the code below.
```{r, eval = FALSE}
obs_alive <- which(birds[,2] == "alive");
obs_dead <- which(birds[,2] == "dead");
obs_diff <- mean(birds[obs_alive,5]) - mean(birds[obs_dead,5]);
arrows(x0 = obs_diff, x1 = obs_diff, y0 = 500, y1 = 10, lwd = 3, length = 0.1);
```
```{r, echo = FALSE}
xlabel <- expression( paste("Random group mean difference (",
mu[alive] - mu[dead], ")" )
); # Gets the Greek mu with subscripts
hist(diff, xlab = xlabel, col = "grey", main = "", cex.lab = 1.5,
cex.axis = 1.5, breaks = 20);
obs_alive <- which(birds[,2] == "alive");
obs_dead <- which(birds[,2] == "dead");
obs_diff <- mean(birds[obs_alive,5]) - mean(birds[obs_dead,5]);
arrows(x0 = obs_diff, x1 = obs_diff, y0 = 500, y1 = 10, lwd = 3, length = 0.1);
```
To get a p-value, we can simply count the number of values less than or equal to the observed `obs_diff` in our randomly generated distribution above, then divide by the total number of randomly generated (9999) and observed (1) differences.
```{r}
less_or_equal_obs <- sum(diff <= obs_diff) + 1;
total_generated <- length(diff) + 1;
new_p_value <- less_or_equal_obs / total_generated;
```
Using the randomisation method, we get a value of $P =$ `r new_p_value`. Note that this p-value would change slightly if we were to re-run the analysis, but not by much if a the number of times we resample is sufficiently high[^1]. Also note that the p-value we obtained after randomisation is quite close to the value obtained from the standard t-test, which was `r bird_t_test$p.value`. Conceptually, the two tests are doing the same thing -- assuming that the mean value of one group is not less than the mean value of another group, and asking what the probability is of getting such a low difference in means would be if this assumption were in fact true. This was a very basic example of a randomisation test in R, but I hope that it illustrates how useful randomisation for hypothesis testing can be. Randomisation is a highly flexible tool, which allows researchers to test hypotheses that might be otherwise difficult to address due to restrictive stastical assumptions of standard tests or peculariarites of specific data sets. It is, of course, important to carefully consider what needs to be randomised for a particular question of interest to ensure that the appropriate null distribution is generated.
<a name="rand_b">Bootstrapping confidence intervals</a>
================================================================================
We can apply the randomisation method to generating confidence intervals around some parameter value by resampling data with replacement. Recall the somewhat confusing frequentist definition of a confidence interval, that is an interval within which the parameter mean would be expected to be contained with some confidence if the population were repeatedly resampled. For example, for a 95\% confidence interval we would expect the mean to be observed within the lower and upper bounds 95\% of the time if we were to go out and re-sample from our population and re-calculate the interval bounds. Verbally, this is challenging to state, but the idea is well-reflected in the process of bootsrapping using resampling with replacement. The idea is that the distribution of values in the data represent the best information available on the distribution of the whole population of interest. Hence, to see what would happen if we were to re-sample from the whole population itself, we can sample with replacement from the collected data [@Manly2007].
To show a simple example of bootstrapping 95\% confidence intervals, I will again use the Bumpus data set. This time, however, I will focus on the whole data set of sparrows, not separating sparrows that are alive from those that are dead. The goal will be to generate 95\% bootstrapped confidence intervals for bird total length (`totlen`), which is found in column 3 of `birds`.
The general idea will be to repeatedly resample a new column `totlen` from the already existing column of data, with replacement. In other words, with each resampling, we will make a new `totlen` column of the same length as the original `totlen`, but some of the values will be repeated while others will be ommitted due to the process of resampling with replacement. We then calculate the mean total sparrow length from the new column created by resampling. The standard deviation of the resampled means will estimate the standard error of the true mean, as if we were to have repeatedly gone back out into the field to collect multiple new sparrows and measured their lengths.
```{r}
iter <- 10000;
resampled_means <- NULL;
N_birds <- dim(birds)[1]; # Total number of birds
for(i in 1:N_birds){
resampled_totlen <- sample(x = birds[,3], size = N_birds, replace = TRUE);
resampled_means[i] <- mean(resampled_totlen);
}
sd_resampled_means <- sd(resampled_means);
```
In the above loop, $`r iter`$ new data sets of `r N_birds` sparrow lengths are generated by sampling with replacement to create `resampled_totlen`. The mean of these resampled sparrow total lengths is calculated and appended to `resampled_means`. Outside of the loop, the standard deviation of the `resampled_means` is calculated to be `r sd_resampled_means`. Again, this standard deviation estimates what the standard deviation of *estimated mean* values of sparrow length would be if we were to go out and re-collect the same number of sparrows that Bumpus did. The standard deviation of means is, by definition, a standard error, so we can use this standard deviation to calculate 95\% confidence intervals just as we would with the standard error of the original data.
$$\hat\mu_{len} - \sigma z_{\alpha/2} < \mu_{len} < \hat\mu_{len} - \sigma z_{\alpha/2}$$
In the above, $\mu_{len}$ is the true mean total sparrow length, $\sigma$ is the standard deviation of the resampled means (`sd_resampled_means` in the code above), and $z_{\alpha/2}$ is the corresponding $z$ score from the standard normal distribution for which the probility exceeds $\alpha/2$ (for a 95\% confidence interval, this value is 1.96). Note again the hats above $\mu$ to differentiate the parameter estimate ($\hat\mu_{len}$) from the true parameter ($\mu_{len}$). We can then simply substitute $z_{\alpha/2} = 1.96$, $\sigma =$ `sd_resampled_means` $=$ `r sd_resampled_means`, and $\hat\mu_{len} =$ `mean(birds[,3])` $=$ `r mean(birds[,3])` to get the following 95\% confidence intervals.
```{r}
lower_95CI <- mean(birds[,3]) - 1.96 * sd_resampled_means;
upper_95CI <- mean(birds[,3]) + 1.96 * sd_resampled_means;
```
This results in an estimated mean total bird length of `r mean(birds[,3])` with a lower 95\% confidence interval of `r lower_95CI` and an upper 95\% confidence interval of `r upper_95CI`. We can compare this with 95\% confidence intervals calculated in the standard way using standard errors from the song sparrow data set as follows.
$$\hat\mu_{len} - \frac{\sigma}{\sqrt{n}} z_{\alpha/2} < \mu_{len} < \hat\mu_{len} - \frac{\sigma}{\sqrt{n}} z_{\alpha/2}$$
In the above, all that has changed is the $\sigma / \sqrt{n}$, which now is the standard deviation of the observed sparrow total lengths (i.e., *not* the standard deviation of the means of all lengths from resampling, as above), divided by the square root of the total number of sparrows observed (i.e., `N_birds` in the code above). This results in the confidence intervals calculated below.
```{r}
lower_95CI_std <- mean(birds[,3]) - 1.96 * ( sd( birds[,3] / sqrt(N_birds) ) );
upper_95CI_std <- mean(birds[,3]) + 1.96 * ( sd( birds[,3] / sqrt(N_birds) ) );
```
The standard method of bootsrapping results in the same estimated mean total bird length of `r mean(birds[,3])`, but with a lower 95\% confidence interval of `r lower_95CI_std` and an upper 95\% confidence interval of `r upper_95CI_std`.
There are several different methods of using bootstrapping to generate confidence intervals [@Manly2007]. My hope is that the above is enough to show the general logic of bootstrapping with resampling, along with a helpful example. **Note that means are not the only parameter values that can be bootstrapped to obtain confidence intervals**. By resampling data with replacement, confidence intervals can be generated around any number of parameter value estimates (e.g., variances, correlations, custom indices, etc.).
<a name="rand_m">Monte Carlo method for spatial data</a>
================================================================================
The previous two examples of [hypothesis testing](#rand_h) and [bootstrapping ](#rand_b) involved resampling from existing data sets to address a statistical question. Next, I will consider the case in which we address a null hypothesis with a distribution that cannot be derived by sampling from the observed data. **The goal will be to test whether or not plants of one species are randomly distributed across a fixed sampling area.**
The data set I will use includes locations of [Sonoran Desert rock fig](https://en.wikipedia.org/wiki/Ficus_petiolaris) trees (*Ficus petiolaris*) from a field site in Baja, Mexico. These trees were mapped in the lab of [John Nason](https://www.eeob.iastate.edu/faculty/nason/). Fig trees have a fascinating ecology, with mutualistic and antagonistic interactions with accompanying species of obligately pollinating and parasitising [fig wasps](https://en.wikipedia.org/wiki/Fig_wasp).
![](images/fig_tree.jpg)
The locations of these fig trees were originally collected for a much different purpose from what I am using them for here, but they are useful for a simplified example of Monte Carlo methods as applied to spatial data. The data are available [on GitHub](https://github.com/StirlingCodingClub/randomisation/blob/master/data/FpetTrees.csv).
```{r}
fpet <- read.csv("data/FpetTrees.csv");
```
The first six rows are shown below.
```{r, echo = FALSE}
lat1 <- 28.289;
lat2 <- 28.2918;
lon1 <- -113.1145;
lon2 <- -113.1095;
kable(head(fpet));
```
We can plot the latitude and longitude of each of the `r dim(fpet)[1]` trees below. The brown box shows the study plot at the field site, randomly set[^2] from latitude `r lat1` to `r lat2` and longitude `r lon1` to `r lon2`.
```{r, fig.height = 7, fig.width = 7}
lat1 <- 28.289;
lat2 <- 28.2918;
lon1 <- -113.1145;
lon2 <- -113.1095;
plot(x = fpet[,4], y = fpet[,3], xlab = "Longitude", ylab = "Latitude",
pch = 20, cex.lab = 1.25, cex.axis = 1.25,
xlim = c(lon1 - 0.001, lon2 + 0.001),
ylim = c(lat1 - 0.0005, lat2 + 0.0005));
polygon(x = c(lon1, lon1, lon2, lon2), y = c(lat1, lat2, lat2, lat1),
col = "tan");
points(x = fpet[,4], y = fpet[,3], pch = 20);
```
The focal question is whether or not these trees are randomly distributed within the brown box. This is not a question that can be answered by randomising the latitude and longitude coordinates of trees. What we need to do instead is compare the distribution of the trees above with that of trees with randomly placed latitude and longitude coordinates within the box. This is a job for a Monte Carlo test, which compares an observed test statistic with that derived from a theoretical null model [@Manly2007]. In this case, the test statistic we will use is distance to the nearest neighbour (i.e., for a focal *F. petiolaris* tree, how close the nearest conspecific). The null model that we will assume is a set of randomly sampled latitude and longitude coordinates within the fixed study area (note that we would normally want to translate the latitude and longitude coordinates to the [Universal Transverse Mercator (UTM)](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system) coordinate system for calculating distance between trees. I am skipping this step to avoid making things unnecessary complicated, and because it is not directly related to the logic of the Monte Carlo test).
More formally, our null hypothesis will be that the mean nearest neighour distance for a focal tree in the observed data will not differ significantly from the mean nearest neighour distance obtained form the same number of trees randomly distributed within the sampling area (brown box). To test this null hypothesis, we can randomly place `r dim(fpet)[1]` trees on the sampling area and calculate the mean nearest neighbour distance for the randomly placed trees. If we repeat this procedure a large number of times, then we can build a null distribution of nearest neighbour distances for trees that are randomly placed within the study are (i.e., random latitude and longitude coordinates). First, we need to write a function that calculates the distance between two trees; the below will work (again, recognising that the coordinates should *realy* be translated to [UTM](https://en.wikipedia.org/wiki/Universal_Transverse_Mercator_coordinate_system)).
```{r}
get_mean_nearest_neighbour <- function(trees, lat = 3, lon = 4){
N_trees <- dim(trees)[1];
tree_lat <- trees[,lat];
tree_lon <- trees[,lon];
nearests <- rep(x = NA, times = N_trees);
for(i in 1:N_trees){
nearest_neigh <- Inf;
for(j in 1:N_trees){
if(i != j){
diff_lat <- tree_lat[i] - tree_lat[j];
diff_lon <- tree_lat[i] - tree_lat[j];
sq_lat <- diff_lat * diff_lat;
sq_lon <- diff_lon * diff_lon;
neigh_d <- sqrt(sq_lat + sq_lon);
if(neigh_d < nearest_neigh){
nearest_neigh <- neigh_d;
}
}
}
nearests[i] <- nearest_neigh;
}
return(mean(nearests));
}
```
The arguments `lat` and `lon` specify the column locations for latitude and longitude coordinates, respectively. The details of the function are not important. I am just using the function to get some reasonable metric that can be calculated with the Monte Carlo test, which is shown with the code below. The idea of the code is to randomly sample real numbers for tree latitudes and longitudes within the defined study area (note the use of [`runif`](#runif) to do this). Given the random placement of these trees, the above function `get_mean_nearest_neighbour` returns the mean distance to the nearest neighbouring tree, calculated across all of the randomly placed trees. This mean nearest neighbour distance is then stored so that a null distribution can be built.
```{r, eval = FALSE}
iteration <- 9999;
N_trees <- dim(fpet)[1];
simulated_NN <- NULL;
while(iteration > 0){
rand_lat <- runif(n = N_trees, min = lat1, max = lat2);
rand_lon <- runif(n = N_trees, min = lon1, max = lon2);
fpet[,3] <- rand_lat;
fpet[,4] <- rand_lon;
mean_NN <- get_mean_nearest_neighbour(trees);
simulated_NN[iteration] <- mean_NN;
iteration <- iteration - 1;
}
```
```{r, echo = FALSE}
iteration <- 9999;
```
In the code above, a `while` loop is used to find the mean nearest neighbour of randomly placed trees `r iteration` times. Within the loop, trees are randomly placed with [`runif`](#runif), and the same number of trees are placed as were observed in the actual observed data set; `min` and `max` values in `rnorm` reflect the boundaries of the sampling area. We can see what these randomly placed trees look like by plotting some of the iterations below.
```{r, echo = FALSE, fig.height = 8, fig.width = 8}
iteration <- 16;
N_trees <- dim(fpet)[1];
simulated_NN <- NULL;
par(mfrow = c(4, 4), mar = c(0, 0, 0, 0));
while(iteration > 0){
rand_lat <- runif(n = N_trees, min = lat1, max = lat2);
rand_lon <- runif(n = N_trees, min = lon1, max = lon2);
fpet[,3] <- rand_lat;
fpet[,4] <- rand_lon;
mean_NN <- get_mean_nearest_neighbour(fpet);
simulated_NN[iteration] <- mean_NN;
plot(x = fpet[,4], y = fpet[,3], xlab = "", ylab = "",
pch = 20, cex.lab = 1.25, cex.axis = 1.25, xaxt = "n", yaxt = "n",
xlim = c(lon1 - 0.001, lon2 + 0.001),
ylim = c(lat1 - 0.0005, lat2 + 0.0005));
polygon(x = c(lon1, lon1, lon2, lon2), y = c(lat1, lat2, lat2, lat1),
col = "tan");
points(x = fpet[,4], y = fpet[,3], pch = 20);
iteration <- iteration - 1;
}
```
Below, the distribution of mean distance to the nearest neighbour is plotted for the 9999 randomly generated tree study areas. The arrow shows the actual observed mean distance between nearest neighbours, as calculated form the original data set.
```{r, echo = FALSE}
iteration <- 9999;
N_trees <- dim(fpet)[1];
simulated_NN <- NULL;
while(iteration > 0){
rand_lat <- runif(n = N_trees, min = lat1, max = lat2);
rand_lon <- runif(n = N_trees, min = lon1, max = lon2);
fpet[,3] <- rand_lat;
fpet[,4] <- rand_lon;
mean_NN <- get_mean_nearest_neighbour(fpet);
simulated_NN[iteration] <- mean_NN;
iteration <- iteration - 1;
}
hist(simulated_NN, xlab = "Mean tree nearest neighbour distance",
col = "grey", main = "", cex.lab = 1.5, cex.axis = 1.5, breaks = 20);
fpet <- read.csv("data/FpetTrees.csv");
obs <- get_mean_nearest_neighbour(fpet)
arrows(x0 = obs, x1 = obs, y0 = 500, y1 = 10, lwd = 3, length = 0.1);
```
It appears, from the position of the mean tree nearest neighbour distance in the observed data, that the *F. petiolaris* trees are no more or less spatially aggregated than would be expected by chance.
<a name="conclusions">Conclusions</a>
================================================================================
I hope that the above introduction has been helpful for showing some different kinds of statistical methods that are possible in R. For further reading on randomisation, bootstrap, and Monte Carlo methods, I recommend the text of @Manly2007.
<a name="practice">Practice problems</a>
================================================================================
Below is a practice problem. **To see the answer**, click on the '**Details**' arrows to the left at the bottom of each question. Note that your answers might differ from mine -- there is more than one way to solve each problem!
1. Generate a histogram of random numbers sampled from a gamma distribution with a shape parameter ($k$) of 2 and a scale parameter ($\theta$) of 3 using the `rgamma` function in base R. Try different shape and scale parameters to see how the gamma distribution changes with different parameter combinations.
```{r, eval = FALSE}
rand_gams <- rgamma(n = 10000, shape = 2, scale = 3);
hist(rand_gams, col = "grey", xlab = "Random gamma values");
```
2. Using the base R data set `trees`, calculate 95% confidence intervals for tree Girth, Height, and Volume. Note, to read in the data, type `data("trees")` in the console. You should then have the data table `trees`, with three columns and 31 rows. Feel free to use the functions in the [appendix](#appendix) to do the bootstrapping foward you. Try bootstrapping more than once with the same function to see what happens, and try comparing bootstrapping mean versus median confidence intervals.
```{r, eval = FALSE}
data(trees);
Girth_mean_CIs <- simpleboot(trees$Girth);
Girth_medi_CIs <- medianboot(trees$Girth);
Height_mean_CIs <- simpleboot(trees$Height);
Height_medi_CIs <- medianboot(trees$Height);
Volume_mean_CIs <- simpleboot(trees$Volume);
Volume_medi_CIs <- medianboot(trees$Volume);
```
3. Using the base R data set `mtcars`, use a randomisation test to see if there is a significant correlation between car miles per gallon (`mpg`) and car weight (`wt`).
```{r}
data(mtcars);
true_cor <- cor(mtcars[,1], mtcars[,6]);
iters <- 999;
the_cors <- NULL;
for(i in 1:iters){
samp_wt <- sample(mtcars[,6], size = dim(mtcars)[1]);
the_cors[i] <- cor(mtcars[,1], samp_wt);
}
more_extr <- sum(abs(the_cors) >= abs(true_cor));
p_val <- (more_extr + 1) / (iters + 1);
```
<a name = "appendix">Appendix: Some bootstrap functions</a>
================================================================================
To use the functions `simpleboot` and `medianboot`, highlight the code below and read it into the console. The first function returns lower and upper 95% bootstrapped confidence intervals around the mean of a vector `freqs`. The second function returns lower and upper 95% bootstrapped confidence intervals around the median of a vector `freqs`.
```{r}
# A simple bootstrap function from Manly (2007 p. 46)
simpleboot <- function(freqs, repli = 1000, alpha = 0.05){
vals <- NULL;
i <- 0;
while(i < repli){
boot <- sample(x = freqs, size = length(freqs), replace = TRUE);
strap <- mean(boot);
vals <- c(vals, strap);
i <- i + 1;
}
vals <- sort(x = vals, decreasing = FALSE);
lowCI <- vals[round( (alpha * 0.5) * repli )];
highCI <- vals[round( (1 - (alpha * 0.5) ) * repli )];
CIs <- c(lowCI, highCI);
return(CIs);
}
# A simple bootstrap function for the median
medianboot <- function(freqs, repli = 1000, alpha = 0.05){
vals <- NULL;
i <- 0;
while(i < repli){
boot <- sample(x = freqs, size = length(freqs), replace = TRUE);
strap <- median(boot);
vals <- c(vals, strap);
i <- i + 1;
}
vals <- sort(x = vals, decreasing = FALSE);
lowCI <- vals[round( (alpha * 0.5) * repli)];
highCI <- vals[round( (1 - (alpha * 0.5) ) * repli)];
CIs <- c(lowCI, highCI);
return(CIs);
}
```
References
================================================================================
[^1]: It is *not* acceptable, of course, to re-run a randomisation analysis until acheiving some p-value that is below a desired threshold.
[^2]: No, not really. The actual objective was not to randomly sample trees within a given area, but rather to map all of the trees that we could find for a variety of different purposes. But it is useful for illustration to pretend that we had instead sampled trees within a fixed area described by the brown box above.