-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path17-probability.qmd
483 lines (321 loc) · 12.7 KB
/
17-probability.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
# Probability
We will use code to answer the following probability questions:
* What is the chance that two people in this room have the same birthday?
* What is the chance that a casino loses money on 1,000 people betting on black?
* How many people need to play to lower the chance of losing money to less than 1%?
* What is the distribution of the number of people that win the lottery if the probability of winning is 1 in a million and 500,000 people by one tickets?
* Why did banks underestimate their risk during the 2008 financial crisis.
## Monte Carlo
Suppose we take a random sample of $n$ people not born on Feb 29. What is the chance that two or more people have the same birthday?
We can figure this out with math, but lets use a _Monte Carlo_ simulation instead.
The `sample` function take a random sample with or without replacement. So here is a randome sample of $n = 25$, 365 days
```{r}
set.seed(2023-10-1)
n <- 25
bdays <- sample(1:365, size = n, replace = TRUE)
```
Did two or more people have the same birthday?
```{r}
any(duplicated(bdays))
```
The probability can be thought of as taking $n$ samples over and over again, for ever. The proportion of times we see two or more can be defined as the probability. So approximate $\infty$ with 100,000 and report the proportion for $n=25$.
```{r}
n <- 25
B <- 10^5
replicate(B,{
bdays <- sample(1:365, size = n, replace = TRUE)
any(duplicated(bdays))
}) |> mean()
```
Use Monte Carlo to estimate the probability for $n=1,\dots,50$. Make a plot of the probability versus $n$.
```{r}
n <- seq(1, 50)
pr <- function(n){
replicate(B,{
bdays <- sample(1:365, size = n, replace = TRUE)
any(duplicated(bdays))
}) |> mean()
}
p <- sapply(n, pr)
plot(n, p)
```
We can compute the exact probability:
$$
1 - 1 \times \frac{364}{365}\times\frac{363}{365} \dots \frac{365-n + 1}{365}
$$
Write a function that computes this for any $n$ then add a curve to the previous plot to confirm that our approximation worked.
```{r}
epr <- function(n){
1 - prod(seq(365, 365 - n + 1)/365)
}
ep <- sapply(n, epr)
plot(n, p)
lines(n, ep, col = "red")
```
## Random variables
What is the chance that a casino loses money on 1,000 people betting on black?
Let's play once:
```{r}
x <- sample(c(-1, 1), size = 1, prob = c(9/19, 10/19))
```
If we play 10 times we get different values:
```{r}
replicate(10, sample(c(-1, 1), size = 1, prob = c(9/19, 10/19)))
```
:::{.callout-note}
We can get the same results using
```{r}
sample(c(-1, 1), size = 10, replace = TRUE, prob = c(9/19, 10/19))
```
even faster is to use
```{r}
x <- rbinom(10, 1, 10/19); x <- x*2 - 1
```
:::
These are the outcomes of a random variable $X$ with
$$
\mbox{Pr}(X = 1) = 10/19 \mbox{ and } \mbox{Pr}(X=-1) = 9/19
$$
We are interested in the sum of $n$ realizations of this random variable:
$$
S_n = \sum_{i=1}^{n} X_i
$$
And want to know
$$
\mbox{Pr}(S_n<0)
$$
Here is one realization of $S_n$ with $n=1000$:
```{r}
x <- sample(c(-1, 1), size = 1000, replace = TRUE, prob = c(9/19, 10/19))
s <- sum(x)
```
We can use math to compute the probability above, but we will instead use a Monte Carlo simulation. This can be useful for checking our math and aslo for cases when we can't do the math.
### Monte Carlo
We can see the distribution of $S$ by randomly generating it an infinite number of times. Let's approximate with 100,000 and look at the distribution of $S$
```{r}
s <- replicate(10^5,{
x <- sample(c(-1, 1), size = 1000, replace = TRUE, prob = c(9/19, 10/19))
sum(x)
})
plot(table(s))
```
We can use this compute the probability of interest:
```{r}
mean(s < 0)
```
## CLT
Do you recognize the distribution of $S_n$ shown above?
In your Probability class you learn about the Central Limit Theorem: the distribution of the sum of independent equally distributed random variables can be approximated with a normally distribution.
But for this to be useful, we need to know the _mean_ $\mbox{E}[S_n]$ and _standard error_ $\mbox{SE}[S_n]$
Once we know this we can estimate the probability of $S_n$ being in an interval, for any interval $(a,b)$:
$$
\mbox{Pr}(a<S_n<b) = \mbox{Pr}\left(\frac{a-\mbox{E}[S_n]}{\mbox{SE}[S_n]} < Z < \frac{b-\mbox{E}[S_n]}{\mbox{SE}[S_n]}\right)
$$
with
$$
Z = \frac{S_n-\mbox{E}[S_n]}{\mbox{SE}[S_n]}
$$
an standard normal random variable.
So all we need is $\mbox{E}[S_n]$ and $\mbox{SE}[S_n]$. This is usually easier to compute than the actual distribution of $S_n$
What is the expected value of $X_i$? Call it $\mu$.
$$
\mu = \mbox{E}[X_i] = -1 \times 9/19 + 1 \times 10/19 = 1/19 \approx 0.05
$$
What is the standard error? Call it $\sigma$
$$
\begin{align}
\sigma^2 &= \mbox{Var}[X_i] = \mbox{E}[(X_i-\mu)^2] = (-1 - 1/19)^2 \times 9/19 + (1 - 1/19)^2 \times 10/19\\
& = 4 \times 10/19 \times 9/19
\end{align}
$$
Let's check with a Monte Carlo simulation:
```{r}
x <- sample(c(-1, 1), size = 10^6, replace = TRUE, prob = c(9/19, 10/19))
mean(x)
sqrt(mean((x - mean(x))^2))
sd(x)
```
:::{.callout-note}
The `sd` function in R does not compute the standard error or population standard deviation. It returns an estimate of the population standard deviation based on s sample. It divides by $n-1$ instead of $n$. Here there practically no difference as $n$ is 1 million.
:::
This let's us compute:
$$
\mbox{E}[S_n] = \mbox{E}\left[\sum_{i=1}^n X_i\right] = \sum_{i=1}^n \mbox{E}[X_i] = n \mu
$$
$$
\mbox{Var}[S_n] =\mbox{Var}[\sum_{i=1}^n X_i] = \sum_{i=1}^n\mbox{Var}[X_i] = n \sigma^2
$$
So
$$
\mbox{SE}[S] = \sqrt{n} \sigma
$$
So
$$
\mbox{Pr}(S_n < 0) = \mbox{Pr}\left(Z < \frac{-n\mu}{\sqrt{n}\sigma}\right)
= \mbox{Pr}\left(Z < \frac{-\sqrt{n}\mu}{\sigma}\right)
$$
```{r}
n <- 1000
mu <- 1/19
sigma <- 2*sqrt(9/19*10/19)
pnorm(-sqrt(n)*mu/sigma)
```
Now to answer "How many people need to play to lower the chance of losing money to less than 1%?" we want:
$$
\mbox{Pr}\left(Z < \frac{-\sqrt{n}\mu}{\sigma}\right)
$$
Notice that the the quantity $\frac{-\sqrt{n}\mu}{\sigma}$ goes to $-\infty$ as $n$ gets larger so we can make $\mbox{Pr}(S_n<0)$ as small as we desire by simply making the $n$ larger.
We again can use CLT
$$
\mbox{Pr}\left(Z<\frac{-\sqrt{n}\mu}{\sigma} \right) \leq 0.01 \implies \frac{-\sqrt{n}\mu}{\sigma} = \Phi^{-1}(0.01)\implies \sqrt{n} = -\frac{\sigma}{\mu}{\Phi^{-1}(0.01)}
$$
```{r}
n <- ceiling((-sigma/mu*qnorm(0.01))^2)
```
## Exercises
(@) Check that you selected the correct $n$ with Monte Carlo simulation.
```{r}
## Hint: copy and paste one of the previous Monte Carlo simulations and change n
s <- replicate(10^5,{
x <- sample(c(-1, 1), size = n, replace = TRUE, prob = c(9/19, 10/19))
sum(x)
})
mean(s < 0)
```
(@) Suppose instead of the sum we are interested in the average:
$$
\bar{X}_n = S_n / n
$$
(@) What are the expected value of $\bar{X}$?
$$
\mbox{E}[\bar{X}_n] = \mbox{E}[\bar{S}_n/n] = 1/n \mbox{E}[\bar{S}_n] = 1/n \times n \times \mu = \mu
$$
(@) What is the standard error of $\bar{X}_n$?
$$
\mbox{SE}[\bar{X}_n] = 1/n \mbox{SE}[\bar{S}_n] = 1/n \times \sqrt{n} \sigma = \sigma / \sqrt{n}
$$
(@) What do the previous two results tell us about the difference between $\bar{X}_n$ and $\mu$ when $n$ is very large.
## Law of small numbers
What is the distribution of the number of people that win the lottery if the probability of winning is 1 in a million and 500,000 people by one tickets?
If we define $X_i$ as 1 if person $i$'s ticket won and 0 otherwise, then the number of people that win is the sum
$$
S_n = \sum_{i=1}^n X_i
$$
with $n=500,000$ and
$$
\mbox{Pr}(X_i=1) = 0.000001
$$.
CLT says $S_n$ should follow a normal distribution. Let's run Monte Carlo to check:
```{r}
s <- replicate(10^3,{
x <- sample(c(0, 1), size = 5*10^5, replace = TRUE,
prob = c(1 - 10^-6, 10^-6))
sum(x)
})
plot(table(s))
```
Does that look normal?
What happened? The _rule of thumb_ of $n=30$ was well surpassed.
The problem here is that $S_n$ can't be negative, the expected value is $n/10^{6}$, but the standard error is about $\sqrt{n}/10^3$. So to even get a symmetric distribution we need $\sqrt{n}/10^3$ at least twice as big as $n/10^{6}$.
However, there is a distribution the does approximate the distribution of $S_n$.
## Exercise
(@) Compare the estimates of the probabilities
$$
\mbox{Pr}(S_n = k), k=0,1,2,3,4
$$
To the probabilities of a Poisson distributed variavbel with rate $np = 500000/1000000 = 0.5$. Hint: Use `dpois` function.
```{r}
sapply(0:3, function(k) mean(s == k))
dpois(0:3, 0.5)
```
## 2008 Financial Crisis
```{r}
p <- 0.04 #default prob
s <- replicate(10^5,{
x <- sample(c(-200000, 10000), size = 10000, replace = TRUE, prob = c(p, 1-p))
sum(x)
})
hist(s)
```
```{r}
p <- 0.04 #default prob
s <- replicate(10^5,{
p <- 0.04 + runif(1, -0.02, 0.02)
x <- sample(c(-200000, 10000), size = 10000, replace = TRUE, prob = c(p, 1 - p))
sum(x)
})
hist(s)
```
What assumption is violated now?
Remember the formula for sum of variance is actually:
$$
\mbox{Var}(X_1 + X_2) = \mbox{Var}(X_1) + \mbox{Var}(X_2) + 2\mbox{Cov}(X_1, X_2)
$$
If variables are positively correlated, variance goes up.
## Exercises
Use Monte Carlo simulations to answer these questions. Feel free to confirm using math.
(@) Two teams, say the Celtics and the Cavs, are playing a seven game series. The Cavs are a better team and have a 60% chance of winning each game. What is the probability that the Celtics win **at least** one game? Use a Monte Carlo simulation to compute (or confirm) your answer.
```{r}
b <- 10^5
nowins <- replicate(b, {
x <- sample(c(0, 1), size = 4, replace = TRUE, prob = c(0.6, 0.4))
sum(x) == 0
})
1 - mean(nowins)
## using math
1 - 0.6^4
```
(@) Two teams, say the Cavs and the Warriors, are playing a seven game championship series. The first to win four games, therefore, wins the series. The teams are equally good so they each have a 50-50 chance of winning each game. If the Cavs lose the first game, what is the probability that they win the series?
(@) Two teams, $A$ and $B$, are playing a seven game series. Team $A$ is better than team $B$ and has a $p>0.5$ chance of winning each game. Write a function that uses Monte Carlo simulation to compute the probability of the underdog winning the series for any value $p$. The plot the results for `p <- seq(0.5, 0.95, 0.025)`.
(@) Repeat the exercise above, but now keep the probability fixed at `p <- 0.75` and compute the probability for different series lengths: best of 1 game, 3 games, 5 games,... Specifically, `n <- seq(1, 25, 2)`.
(@) Assume the distribution of female heights is approximated by a normal distribution with a mean of 64 inches and a standard deviation of 3 inches. If we pick a female at random, what is the probability that she is between 61 and 67 inches?
(@) Imagine the distribution of male adults is approximately normal with an expected value of 69 and a standard deviation of 3. How tall is the male in the 99th percentile? Hint: use `qnorm`.
(@) The distribution of IQ scores is approximately normally distributed. The average is 100 and the standard deviation is 15. Suppose you want to know the distribution of the highest IQ across all graduating classes if 10,000 people are born each in your school district. Run a Monte Carlo simulation with `B=1000` generating 10,000 IQ scores and keeping the highest. Make a histogram.
(@) In American Roulette you can also bet on green. There are 18 reds, 18 blacks and 2 greens (0 and 00). What are the chances the green comes out?
```{r}
p <- 1/19
```
(@) The payout for winning on green is \$17 dollars. This means that if you bet a dollar and it lands on green, you get \$17. Create a sampling model using sample to simulate the random variable $X$ for your winnings.
```{r}
sample(c(-1, 17), size = 1, replace = TRUE, prob = c(1-p, p))
```
(@) Compute the expected value of $X$.
```{r}
-1 * (1-p) + 17*p
```
(@) Compute the standard error of $X$.
```{r}
18 * sqrt(p * (1-p))
```
(@) Now create a random variable $S$ that is the sum of your winnings after betting on green 1000 times.
```{r}
n <- 1000
x <- sample(c(-1, 17), size = n, replace = TRUE, prob = c(1-p, p))
sum(x)
```
(@) What is the expected value of $S$?
```{r}
mu <- n * (-1 * (1-p) + 17*p)
```
(@) What is the standard error of $S$?
```{r}
se <- sqrt(n) * 18*sqrt(p*(1 - p))
```
(@) Use CLT to estimate the probability that you end up winning money?
$$
Pr(Z > -\mu/se) = ?
$$
```{r}
1 - pnorm(-mu/se)
```
(@) Create a Monte Carlo simulation that generates 1,000 outcomes of $S$. Compute the average and standard deviation of the resulting list to confirm the results of 6 and 7.
```{r}
B <- 100000
s <- replicate(B,{
n <- 1000
x <- sample(c(-1, 17), size = n, replace = TRUE, prob = c(18/19, 1/19))
sum(x)
})
mean(s > 0)
```
(@) Now check your answer based on CLT using the Monte Carlo result.