-
Notifications
You must be signed in to change notification settings - Fork 9
/
60-stat-learning-intro.Rmd
385 lines (307 loc) · 14.1 KB
/
60-stat-learning-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
# Introduction to statistical machine learning {#sec-mlintro}
```{r env_ml, echo = FALSE, message = FALSE}
library("ggplot2")
```
The next chapters will focus on concepts from statistical (hypothesis
testing in chapter \@ref(sec-testing)) and general machine learning
(chapters \@ref(sec-ul), \@ref(sec-dimred) and \@ref(sec-sl)). Before
diving into the technical details, it is useful to learn (or remind
ourselves) why these techniques are so incredibly important when
analysing (i.e. looking to understand) high throughput biomedical
data. The goals of these chapters is to
- understanding what these techniques do, how to apply them, and their
respective limitations;
- learn to frame the questions more accurately in the light of
available analysis techniques;
- be in a position to formalise a research question and define what
data is needed before generating it;
- to ultimately extracting meaningful results from a dataset.
## Hypothesis testing
Let's start with the following experiment. Researchers are interested
in the expression of three genes,
[A1CF](https://www.genecards.org/cgi-bin/carddisp.pl?gene=A1CF),
[BRCA1](https://www.genecards.org/cgi-bin/carddisp.pl?gene=BRCA1) and
[TP53](https://www.genecards.org/cgi-bin/carddisp.pl?gene=TP53) in the
absence and presence of a certain drug in a model cell line. The
expression of these genes is measured four times.
```{r, echo = FALSE, fig.cap = "Distributions of the expression of the genes A1CF, BRCA1 and TP53 under the control (no drug) and drug at concentrations 1 and 5.", fig.width = 8}
expd <- ge <- data.frame(sample = paste0("S", 1:12),
group = rep(c("CTRL", "DRUG", "DRUG"), each = 4),
concentration = factor(rep(c(0, 1, 5), each = 4)),
replicate = rep(1:4, 3),
stringsAsFactors = FALSE)
set.seed(1)
ge$A1CF <- rnorm(12, 6, 2)
ge$BRCA1 <- c(abs(rnorm(4, 2, 1)), rnorm(4, 8, 2), rnorm(4, 13, 2))
ge$TP53 <- c(rnorm(4, 10, 5), rnorm(4, 10, 3), rnorm(4, 10, 2))
ge <- tidyr::pivot_longer(ge,
names_to = "gene",
values_to = "expression",
c(A1CF, BRCA1, TP53))
ggplot(ge, aes(x = concentration, y = expression,
colour = concentration, label = replicate)) +
geom_point() +
facet_grid(. ~ gene) +
theme(legend.position="none") +
ggrepel::geom_text_repel(nudge_x = 0.2, segment.size = 0.5)
```
`r msmbstyle::question_begin()`
Make sure you understand the visualisation above.
1. What genes would you call differentially expressed, i.e. that show
different expressions between any condition.
2. What criteria do you rely on to conclude whether the genes are or
aren't differentially expressed?
`r msmbstyle::question_end()`
Now imagine if instead of having 3 genes, we had 20000!
`r msmbstyle::question_begin()`
Formalise the experiemental design, i.e. all the variables that define
what the experiment measures using a table.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r expdex1}
knitr::kable(expd)
```
`r msmbstyle::solution_end()`
Statistical hypothesis testing (chapter \@ref(sec-testing)) will help
us formalise when to call genes differentially expressed.
## Handling more data
There are many more types of patterns that we would be interested in
identifying in omics data. Let's reconsider the linear regression
model from section \@ref(sec-lm). Below is a figure from
@Majumder2013. Among the 20 scatter plots below, one represents the
actual data, and 19 are simulations that are based on the real
data.
```{r regsim, echo = FALSE, fig.cap = "One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes no effect ($H_{0}$ is true).", fig.fullwidth = TRUE}
knitr::include_graphics("./figs/uasa_a_808157_o_f0004g.png")
```
`r msmbstyle::question_begin()`
- Which plot is the most different from the others, in the sense that
there is the steepest slope?
- Do you think this is the real data? If it is not, what would you
conclude as to whether the trend in the real data is relevant or
not?
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
Imagine that instead of 20 correlations, we have thousands thereof!
How would you address that challenge?
`r msmbstyle::question_begin()`
An additionnal complication with large data is the appearance of
spurious positive results, called **false positives**. In the example
below, we calculate [Pearon correlation
coefficients](https://en.m.wikipedia.org/wiki/Pearson_correlation_coefficient)
between two vectors $x$ and $y$:
$$r_{xy} = \frac{\sum_{i = 1}^{n}(x_{i} - \bar x)(y_{i} - \bar y)}{\sqrt{\sum_{i = 1}^{n}(x_{i} - \bar x)^{2}} \sqrt{\sum_{i = 1}^{n}(y_{i} - \bar y)^{2}}}$$
where $n$ is the length of the vector, $x_{i}$ and $y_{i}$ are the $i_{th}$
elements of the $x$ and y$ vectors, $\bar x$ and $\bar y$ are the respective means.
A correlation coefficient ranges between -1 (for anti-correlated
vectors) to 1 (for perfectly correlated vectors). Non-correlated data
have a correlation coefficient close to 0.
```{r corfig, fig.cap = "Example of correlation, anti-correlation and lack thereof.", echo = FALSE, message = FALSE, fig.width = 8}
x <- rnorm(60)
y1 <- x + rnorm(60, 0, 0.3)
y2 <- -y1
y3 <- rnorm(60)
df <- data.frame(x = c(x, x, x),
y = c(y1, y2, y3),
type = rep(c(paste0("correlation (coefficient ", round(cor(x, y1), 2), ")"),
paste0("anti-correlation (coefficient ", round(cor(x, y2), 2), ")"),
paste0("no correlation (coefficient ", round(cor(x, y3), 2), ")")),
each = 60))
ggplot(df, aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm") +
facet_grid(. ~ type)
```
`r msmbstyle::question_begin()`
Generate two random vectors `x` and `y` of length 60 using
`rnorm`. Plot them, add a linear regression line, find the slope of
the regression line, and calculate their correlation coefficient using
the `cor` function.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r corex1, message = FALSE}
x <- rnorm(60)
y <- rnorm(60)
## fit linear model
fit <- lm(y ~ x)
## slope
fit
## correlation
cor(x, y)
## with base graphics
plot(x, y)
abline(fit)
## or with ggplot2
ggplot(data.frame(x, y),
aes(x, y)) +
geom_point() + geom_smooth(method = "lm")
```
`r msmbstyle::solution_end()`
Let's expand the code above and calculate the correlation coefficient
between hundreds of random vectors and choose the best one (in abolute
terms) among those. Let's repeat this procedure 1000 times to obtain
1000 best correlations (based on @Fan:2014). The figure below shows
the distribution of these 1000 best correlations when comparing
vectors of length 800 or 6400. We see that with more (random) data, we
increase that the risk of obtaining spurious correlations.
```{r spurcor, cache = TRUE, fig.cap = "Illustration of spurious correlation. Distribution of the maximum absolute sample correlation coefficients between a vector of length and 800 of 6400 others. Based on 1000 repetitions.", echo=FALSE}
## Generated with https://gist.github.com/lgatto/d8eb59effb815996eac789debb51690d
knitr::include_graphics("./figs/spurcor.png")
```
`r msmbstyle::question_begin()`
Calculate 100 correlation coefficients for pairs of random vectors or
length 10. What are the mean, maximum and minumum of your simulated
coefficients? Out of those 100 coefficients, how many would you
consider of interest if you didn't know they were produced by random
data?
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r corex2}
set.seed(42)
rs <- replicate(100, cor(rnorm(10), rnorm(10)))
summary(rs)
## Assuming a totally arbitrary value of 0.75, we would get
rs[abs(rs) > .75]
hist(rs)
rug(rs)
abline(v = rs[abs(rs) > .75], col = "red", lty = "dotted")
```
`r msmbstyle::solution_end()`
## What's an outlier
We are often interested in identifying outliers, i.e. points that are
different from the others. One way to define *different* is by
measuring distances between all the samples, or each sample to a
virtual *average sample*. Let's try this ourselves for a dataset of 5
sample that have each been characterised by measuring the expression
of genes $x$ and $y$.
```{r dist0, fig.cap = "A simulated dataset composed of 5 samples."}
set.seed(111)
samples <- data.frame(x = rnorm(5, 5, 1),
y = rnorm(5, 5, 1))
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
```
Visually, we could consider sample 4 to be an outlier. Let's start by
calculating all pairwise [Euclidean
distances](https://en.wikipedia.org/wiki/Euclidean_distance) with the
`dist`.
```{r dist1}
dist(samples)
```
`r msmbstyle::question_begin()`
- Familiarise yourself with the Euclidean distance.
- What other distances can be calculated with the `dist` function?
- What can you say about our assumption that sample 4 is an outlier
given the distance matrix above?
`r msmbstyle::question_end()`
`r msmbstyle::question_begin()`
An alternative approach would be to calculate the distance of each
sample to an average samples, represented by a red point on the figure
below.
```{r avgsample, echo = FALSE, fig.cap = "The average sample is shown as a red point among the 5 samples simulated above."}
avg_sample <- colMeans(samples)
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
points(avg_sample["x"], avg_sample["y"], pch = 19, col = "red", cex = 2)
```
- Compute such an average sample and visualise it as suggested above.
- Calculate the distance between that average sample and each of the 5
real data points.
- Verify that sample 4 is still an outlier.
`r msmbstyle::question_end()`
`r msmbstyle::solution_begin()`
```{r dist2}
avg_sample <- colMeans(samples)
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
points(avg_sample["x"], avg_sample["y"], pch = 19, col = "red", cex = 2)
dists <- dist(rbind(samples, avg_sample))
## convert the object of `dist` to a matrix that we can subset easily
dists <- as.matrix(dists)
dists[6, ]
which.max(dists[6, ])
```
`r msmbstyle::solution_end()`
The concept of outlier is very intuitive, but becomes murky when the
number of dimensions becomes very high.
## The curse of dimensionality
There is one very well documented feature of high dimension data,
known as the curse of dimensionality. When the numbers of dimensions
increase, data become increasingly sparse, the definition of distance
or density becomes difficult to define and more and more points become
outliers.
Below, we demonstrate how the number of outliers increases with the
number of dimenstion ([source](https://lgatto.github.io/curse-dimensionality/)):
> The simulation generates N points in hypercubes of d dimensions and
> counts how many of those points are outliers, defined as being in
> the 1% outer shell of the space. In 1 dimension, we have 1% of the
> uniformely distributed points that are outliers. In 50 dimension,
> there are already 60% of the points that are outliers; in 100
> dimensions, almost 90% of the points are outliers.
```{r curse, cache = FALSE}
d <- 1:200
N <- 1000
dat <- lapply(d, function(i) replicate(i, runif(N)))
nout <- function(m) mean(apply(m, 1, function(i) any(i < 0.01 | i > 0.99)))
res <- sapply(dat, nout)
plot(res, type = "l",
xlab = "Number of dimensions",
ylab = "Proportion of 1% outliers",
main = "Curse of Dimensionality")
grid()
```
## Machine learning
The dataset below is composed of 100 data points (samples). For each
data point, we have measured values x and y (two genes). We have a
dataset of 100 points in 2 dimensions.
```{r unlabex, fig.cap = "A dataset of 100 unlabelled points measured along two dimensions.", echo = FALSE}
set.seed(123)
dat <- rbind(mvtnorm::rmvnorm(50, c(0, 0)),
mvtnorm::rmvnorm(50, c(3, 3)))
dat <- as.data.frame(dat)
colnames(dat) <- c("x", "y")
plot(dat)
```
`r msmbstyle::question_begin()`
- How many groups do you think there are among these 100 observations?
- How would you intuitively define them?
`r msmbstyle::question_end()`
Now imagine that we know about number of group (which would be called
classes in this setting) and the label of each point. In the figure
below, we have two classes, red and black, and we know exactly which
points below to the red class and the black class.
```{r labex, fig.cap = "A dataset of 100 labelled points measured along two dimensions.", echo = FALSE}
dat$class <- factor(rep(c("A", "B"), each = 50))
plot(dat[, 1:2], col = dat$class, pch = 19)
```
We now obtain data for three new points (samples), annotated as 1, 2,
and 3 below.
```{r labex2, fig.cap = "A dataset of 100 labelled and three new, unlalled, points.", echo = FALSE}
plot(dat[, 1:2], col = dat$class, pch = 19)
new_points <- data.frame(x = c(0, 4, 1.65),
y = c(0, 5, 1.7))
points(new_points$x, new_points$y, cex = 2.5)
text(new_points$x, new_points$y, 1:3)
```
`r msmbstyle::question_begin()`
To what class (red or black) should points 1, 2, and 3 be assigned to?
`r msmbstyle::question_end()`
The examples presented above fall squarely into the discipline of
machine learning. The first example is a case of **unsupervised
learning**. In unsupervised learning we aim at find patterns in a
dataset without the help of any additional information or
labels. Typical applications of unsupervised learning is clustering
(as in our first example and chapter \@ref(sec-ul)) or dimensionality
reduction (chapter \@ref(sec-dimred))
In our second example, we were given labels for each observation. This
is an example of **supervised learning**, that aims at learning inputs
and mapping them to certain target outputs with the help of known
labels. Example of supervised learning are classification as in the
example above (see chapter \@ref(sec-sl)) and regression, where the
target isn't a category, but a value.
There are additional types of machine learning, such as
semi-supervised learning (combining the two approaches above),
self-supervised learning (supervised without any labels, i.e. where
they need to be identified automatically), reinforcement learning
(automatic tuning of the learning when new information is received),
but these are beyond the scope of this course.