60-stat-learning-intro.Rmd

# Introduction to statistical machine learning {#sec-mlintro}

```{r env_ml, echo = FALSE, message = FALSE}
library("ggplot2")
```

The next chapters will focus on concepts from statistical (hypothesis
testing in chapter \@ref(sec-testing)) and general machine learning
(chapters \@ref(sec-ul), \@ref(sec-dimred) and \@ref(sec-sl)). Before
diving into the technical details, it is useful to learn (or remind
ourselves) why these techniques are so incredibly important when
analysing (i.e. looking to understand) high throughput biomedical
data. The goals of these chapters is to

- understanding what these techniques do, how to apply them, and their
  respective limitations;
- learn to frame the questions more accurately in the light of
  available analysis techniques;
- be in a position to formalise a research question and define what
  data is needed before generating it;
- to ultimately extracting meaningful results from a dataset.

## Hypothesis testing

Let's start with the following experiment. Researchers are interested
in the expression of three genes,
[A1CF](https://www.genecards.org/cgi-bin/carddisp.pl?gene=A1CF),
[BRCA1](https://www.genecards.org/cgi-bin/carddisp.pl?gene=BRCA1) and
[TP53](https://www.genecards.org/cgi-bin/carddisp.pl?gene=TP53) in the
absence and presence of a certain drug in a model cell line. The
expression of these genes is measured four times.

```{r, echo = FALSE, fig.cap = "Distributions of the expression of the genes A1CF, BRCA1 and TP53 under the control (no drug) and drug at concentrations 1 and 5.", fig.width = 8}
expd <- ge <- data.frame(sample = paste0("S", 1:12),
                         group = rep(c("CTRL", "DRUG", "DRUG"), each = 4),
                         concentration = factor(rep(c(0, 1, 5), each = 4)),
                         replicate = rep(1:4, 3),
                         stringsAsFactors = FALSE)
set.seed(1)
ge$A1CF <- rnorm(12, 6, 2)
ge$BRCA1 <- c(abs(rnorm(4, 2, 1)), rnorm(4, 8, 2), rnorm(4, 13, 2))
ge$TP53 <-  c(rnorm(4, 10, 5), rnorm(4, 10, 3), rnorm(4, 10, 2))
ge <- tidyr::pivot_longer(ge,
                          names_to = "gene",
                          values_to = "expression",
                          c(A1CF, BRCA1, TP53))

ggplot(ge, aes(x = concentration, y = expression,
               colour = concentration, label = replicate)) +
    geom_point() +
    facet_grid(. ~ gene) +
    theme(legend.position="none") +
    ggrepel::geom_text_repel(nudge_x = 0.2, segment.size = 0.5)
```

`r msmbstyle::question_begin()`
Make sure you understand the visualisation above.

1. What genes would you call differentially expressed, i.e. that show
different expressions between any condition.

2. What criteria do you rely on to conclude whether the genes are or
   aren't differentially expressed?
`r msmbstyle::question_end()`

Now imagine if instead of having 3 genes, we had 20000!

`r msmbstyle::question_begin()`
Formalise the experiemental design, i.e. all the variables that define
what the experiment measures using a table.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r expdex1}
knitr::kable(expd)
```
`r msmbstyle::solution_end()`


Statistical hypothesis testing (chapter \@ref(sec-testing)) will help
us formalise when to call genes differentially expressed.

## Handling more data

There are many more types of patterns that we would be interested in
identifying in omics data. Let's reconsider the linear regression
model from section \@ref(sec-lm). Below is a figure from
@Majumder2013. Among the 20 scatter plots below, one represents the
actual data, and 19 are simulations that are based on the real
data.

```{r regsim, echo = FALSE, fig.cap = "One of these plots is the plot of the actual data, and the remaining are null plots, produced by simulating data from a null model that assumes no effect ($H_{0}$ is true).", fig.fullwidth = TRUE}
knitr::include_graphics("./figs/uasa_a_808157_o_f0004g.png")
```

`r msmbstyle::question_begin()`
- Which plot is the most different from the others, in the sense that
  there is the steepest slope?

- Do you think this is the real data? If it is not, what would you
  conclude as to whether the trend in the real data is relevant or
  not?
`r msmbstyle::question_end()`

`r msmbstyle::question_begin()`
Imagine that instead of 20 correlations, we have thousands thereof!
How would you address that challenge?
`r msmbstyle::question_begin()`

An additionnal complication with large data is the appearance of
spurious positive results, called **false positives**. In the example
below, we calculate [Pearon correlation
coefficients](https://en.m.wikipedia.org/wiki/Pearson_correlation_coefficient)
between two vectors $x$ and $y$:

$$r_{xy} = \frac{\sum_{i = 1}^{n}(x_{i} - \bar x)(y_{i} - \bar y)}{\sqrt{\sum_{i = 1}^{n}(x_{i} - \bar x)^{2}} \sqrt{\sum_{i = 1}^{n}(y_{i} - \bar y)^{2}}}$$

where $n$ is the length of the vector, $x_{i}$ and $y_{i}$ are the $i_{th}$
elements of the $x$ and y$ vectors, $\bar x$ and $\bar y$ are the respective means.

A correlation coefficient ranges between -1 (for anti-correlated
vectors) to 1 (for perfectly correlated vectors). Non-correlated data
have a correlation coefficient close to 0.

```{r corfig, fig.cap = "Example of correlation, anti-correlation and lack thereof.", echo = FALSE, message = FALSE, fig.width = 8}
x <- rnorm(60)
y1  <- x + rnorm(60, 0, 0.3)
y2 <- -y1
y3 <- rnorm(60)
df <- data.frame(x = c(x, x, x),
           y = c(y1, y2, y3),
           type = rep(c(paste0("correlation (coefficient ", round(cor(x, y1), 2), ")"),
                        paste0("anti-correlation (coefficient ", round(cor(x, y2), 2), ")"),
                        paste0("no correlation (coefficient ", round(cor(x, y3), 2), ")")),
                      each = 60))

ggplot(df, aes(x = x, y = y)) +
    geom_point() +
    geom_smooth(method = "lm") +
    facet_grid(. ~ type)
```

`r msmbstyle::question_begin()`
Generate two random vectors `x` and `y` of length 60 using
`rnorm`. Plot them, add a linear regression line, find the slope of
the regression line, and calculate their correlation coefficient using
the `cor` function.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`

```{r corex1, message = FALSE}
x <- rnorm(60)
y <- rnorm(60)

## fit linear model
fit <- lm(y ~ x)

## slope
fit

## correlation
cor(x, y)

## with base graphics
plot(x, y)
abline(fit)

## or with ggplot2
ggplot(data.frame(x, y),
       aes(x, y)) +
    geom_point() + geom_smooth(method = "lm")
```
`r msmbstyle::solution_end()`

Let's expand the code above and calculate the correlation coefficient
between hundreds of random vectors and choose the best one (in abolute
terms) among those. Let's repeat this procedure 1000 times to obtain
1000 best correlations (based on @Fan:2014). The figure below shows
the distribution of these 1000 best correlations when comparing
vectors of length 800 or 6400. We see that with more (random) data, we
increase that the risk of obtaining spurious correlations.


```{r spurcor, cache = TRUE, fig.cap = "Illustration of spurious correlation. Distribution of the maximum absolute sample correlation coefficients between a vector of length and 800 of 6400 others. Based on 1000 repetitions.", echo=FALSE}
## Generated with https://gist.github.com/lgatto/d8eb59effb815996eac789debb51690d
knitr::include_graphics("./figs/spurcor.png")
```

`r msmbstyle::question_begin()`
Calculate 100 correlation coefficients for pairs of random vectors or
length 10. What are the mean, maximum and minumum of your simulated
coefficients? Out of those 100 coefficients, how many would you
consider of interest if you didn't know they were produced by random
data?
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`

```{r corex2}
set.seed(42)
rs <- replicate(100, cor(rnorm(10), rnorm(10)))
summary(rs)
## Assuming a totally arbitrary value of 0.75, we would get
rs[abs(rs) > .75]

hist(rs)
rug(rs)
abline(v = rs[abs(rs) > .75], col = "red", lty = "dotted")
```
`r msmbstyle::solution_end()`


## What's an outlier

We are often interested in identifying outliers, i.e. points that are
different from the others. One way to define *different* is by
measuring distances between all the samples, or each sample to a
virtual *average sample*. Let's try this ourselves for a dataset of 5
sample that have each been characterised by measuring the expression
of genes $x$ and $y$.

```{r dist0, fig.cap = "A simulated dataset composed of 5 samples."}
set.seed(111)
samples <- data.frame(x = rnorm(5, 5, 1),
                      y = rnorm(5, 5, 1))
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
```

Visually, we could consider sample 4 to be an outlier. Let's start by
calculating all pairwise [Euclidean
distances](https://en.wikipedia.org/wiki/Euclidean_distance) with the
`dist`.

```{r dist1}
dist(samples)
```


`r msmbstyle::question_begin()`
- Familiarise yourself with the Euclidean distance.
- What other distances can be calculated with the `dist` function?
- What can you say about our assumption that sample 4 is an outlier
  given the distance matrix above?
`r msmbstyle::question_end()`


`r msmbstyle::question_begin()`
An alternative approach would be to calculate the distance of each
sample to an average samples, represented by a red point on the figure
below.

```{r avgsample, echo = FALSE, fig.cap = "The average sample is shown as a red point among the 5 samples simulated above."}
avg_sample <- colMeans(samples)
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
points(avg_sample["x"], avg_sample["y"], pch = 19, col = "red", cex = 2)
```

- Compute such an average sample and visualise it as suggested above.
- Calculate the distance between that average sample and each of the 5
  real data points.
- Verify that sample 4 is still an outlier.
`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
```{r dist2}
avg_sample <- colMeans(samples)
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
points(avg_sample["x"], avg_sample["y"], pch = 19, col = "red", cex = 2)
dists <- dist(rbind(samples, avg_sample))
## convert the object of `dist` to a matrix that we can subset easily
dists <- as.matrix(dists)
dists[6, ]
which.max(dists[6, ])
```
`r msmbstyle::solution_end()`

The concept of outlier is very intuitive, but becomes murky when the
number of dimensions becomes very high.

## The curse of dimensionality


There is one very well documented feature of high dimension data,
known as the curse of dimensionality. When the numbers of dimensions
increase, data become increasingly sparse, the definition of distance
or density becomes difficult to define and more and more points become
outliers.

Below, we demonstrate how the number of outliers increases with the
number of dimenstion ([source](https://lgatto.github.io/curse-dimensionality/)):

> The simulation generates N points in hypercubes of d dimensions and
> counts how many of those points are outliers, defined as being in
> the 1% outer shell of the space. In 1 dimension, we have 1% of the
> uniformely distributed points that are outliers. In 50 dimension,
> there are already 60% of the points that are outliers; in 100
> dimensions, almost 90% of the points are outliers.

```{r curse, cache = FALSE}
d <- 1:200
N <- 1000
dat <- lapply(d, function(i) replicate(i, runif(N)))
nout <- function(m) mean(apply(m, 1, function(i) any(i < 0.01 | i > 0.99)))
res <- sapply(dat, nout)
plot(res, type = "l",
     xlab = "Number of dimensions",
     ylab = "Proportion of 1% outliers",
     main = "Curse of Dimensionality")
grid()
```

## Machine learning

The dataset below is composed of 100 data points (samples). For each
data point, we have measured values x and y (two genes). We have a
dataset of 100 points in 2 dimensions.

```{r unlabex, fig.cap = "A dataset of 100 unlabelled points measured along two dimensions.", echo = FALSE}
set.seed(123)
dat <- rbind(mvtnorm::rmvnorm(50, c(0, 0)),
             mvtnorm::rmvnorm(50, c(3, 3)))
dat <- as.data.frame(dat)
colnames(dat) <- c("x", "y")
plot(dat)
```

`r msmbstyle::question_begin()`
- How many groups do you think there are among these 100 observations?
- How would you intuitively define them?
`r msmbstyle::question_end()`

Now imagine that we know about number of group (which would be called
classes in this setting) and the label of each point. In the figure
below, we have two classes, red and black, and we know exactly which
points below to the red class and the black class.

```{r labex, fig.cap = "A dataset of 100 labelled points measured along two dimensions.", echo = FALSE}
dat$class <- factor(rep(c("A", "B"), each = 50))
plot(dat[, 1:2], col = dat$class, pch = 19)
```

We now obtain data for three new points (samples), annotated as 1, 2,
and 3 below.


```{r labex2, fig.cap = "A dataset of 100 labelled and three new, unlalled, points.", echo = FALSE}
plot(dat[, 1:2], col = dat$class, pch = 19)
new_points <- data.frame(x = c(0, 4, 1.65),
                         y = c(0, 5, 1.7))
points(new_points$x, new_points$y, cex = 2.5)
text(new_points$x, new_points$y, 1:3)
```

`r msmbstyle::question_begin()`
To what class (red or black) should points 1, 2, and 3 be assigned to?
`r msmbstyle::question_end()`

The examples presented above fall squarely into the discipline of
machine learning. The first example is a case of **unsupervised
learning**. In unsupervised learning we aim at find patterns in a
dataset without the help of any additional information or
labels. Typical applications of unsupervised learning is clustering
(as in our first example and chapter \@ref(sec-ul)) or dimensionality
reduction (chapter \@ref(sec-dimred))


In our second example, we were given labels for each observation. This
is an example of **supervised learning**, that aims at learning inputs
and mapping them to certain target outputs with the help of known
labels. Example of supervised learning are classification as in the
example above (see chapter \@ref(sec-sl)) and regression, where the
target isn't a category, but a value.

There are additional types of machine learning, such as
semi-supervised learning (combining the two approaches above),
self-supervised learning (supervised without any labels, i.e. where
they need to be identified automatically), reinforcement learning
(automatic tuning of the learning when new information is received),
but these are beyond the scope of this course.