66-unsupervised.Rmd

# Unsupervised learning: clustering {#sec-ul}

## Introduction

After learing about dimensionality reduction and PCA, in this chapter
we will focus on **clustering**. The goal of clustering algorithms is
to find homogeneous subgroups within the data; the grouping is based
on similiarities (or distance) between observations. The result of a
clustering algorithm is to group the observations (features) into
distinct (generally non-overlapping) groups. These groups are, even if
imperfect (i.e. ignoring intermediate states) are often very useful in
interpretation and assessment of the data.


In this chapter we will:

- Introduce clustering methods.
- Learn how to use a widely used non-parametric clustering algorithms
  **k-means**.
- Learn how to use recursive clustering approaches known as
  **hierarchical clustering**.
- Observe the influence of clustering parameters and distance metrics
  on the outputs.
- Provide real-life example of how to apply clustering on omics data.

Clustering is a widely used techniques in many areas of omics data
analysis. The *How does gene expression clustering work?*
[@Dhaeseleer:2005] is a useful reading to put the following chapter
into context.

## How to measure similarities?

Clustering algorithms rely on a measurement of similarity between
between features to group them into clusters. An important step is
thus to decide how to measure these similarities. A range of distance
measurements are available:

The **Euclidean** distance between two points $A = (a_{1}, \ldots , a_{p})$ and
$B = (b_{1}, \ldots , b_{p})$ in a $p$-dimensional space (for the $p$ features)
is the square root of the sum of squares of the differences in all
$p$ coordinate directions:

$$ d(A, B) = \sqrt{ (a_{1} - b_{1})^{2} + (a_{2} - b_{2})^{2} + \ldots + (a_{p} - b_{p})^{2} } $$

The **Minkovski** allows the exponent to be $m$ instead of 2, as in the Euclidean distance:

$$ d(A, B) = ( (a_{1} - b_{1})^{m} + (a_{2} - b_{2})^{m} + \ldots + (a_{p} - b_{p})^{m} )^{\frac{1}{m}} $$

The **maximum** (or $L_{\inf}$) distance is the maximum of the absolute differences
between coordinates:

$$ d(A, B) = max_{i} | a_{i} − b_{i} | $$

The **Manhattan** (or City Block, Taxicab or $L_{1}$) distance takes the sum of the
absolute differences in all coordinates:

$$ d(A, B) = | a_{1} − b_{1} | + | a_{2} − b_{2} | + \ldots + |a_{p} − b_{p}| $$

When computing a **binary** distance, the vectors are regarded as
binary bits, with non-zero elements are *on* and zero elements are
*off*.  The distance is the proportion of bits in which only one is on
amongst those in which at least one is on. For example, the binary
distance between vectors `x` and `y` is 0.4 because out of the 5 pairs
of bits that have at least on *1* (all but the second elements), 2
have only on (the first and fifth).

```{r}
x <- c(0, 0, 1, 1, 1, 1)
y <- c(1, 0, 1, 1, 0, 1)
```


The **canberra** distance, that computes the sum of the difference to
the sum of each the elements of vectors:

$$ \sum{ \frac{ | x_{i} - y_{i} | }{ | x_{i} | + | y_{i} | } } $$


**Correlation based distance** uses the Pearson correlation (see the
`cor` function and chapter \ref(sec-mlintro)) or its absolute value
and transform it in into a distance metric:

$$ d(A, B) = 1 - cor(A, B) $$

These are some of the distances available in the `dist` function, or
in the `bioDist` package. See also section 5.3 from [@MSMB] for
additional details.

`r msmbstyle::question_begin()`

Generate the 5 data points along 2 dimensions as illustrated below and
calculate all their Euclidean pairwise distance using
`dist()`. Interprete the output of the function. What class is it?


```{r, fig.cap = "A simulated dataset composed of 5 samples."}
set.seed(111)
samples <- data.frame(x = rnorm(5, 5, 1),
                      y = rnorm(5, 5, 1))
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
```
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r}
dist(samples)
class(dist(samples))
```
`r msmbstyle::solution_end()`


`r msmbstyle::question_begin()`

Calculate the position of the *average sample*, visualise it on the
figure above. Which samples are respectively the closest and furthers
to the *average sample*?

`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`
```{r, fig.cap = "Visualisation of the average sample, in red."}
avg_sample <- colMeans(samples)
plot(samples, cex = 3)
text(samples$x, samples$y, 1:5)
points(avg_sample["x"], avg_sample["y"], pch = 19, col = "red", cex = 2)
```
```{r}
(dists <- dist(rbind(samples, avg_sample)))
dists <- as.matrix(dists)
d6 <- dists[6, -6]
which.min(d6) ## closest
which.max(d6) ## furthest
```
`r msmbstyle::solution_end()`

## k-means clustering

The k-means clustering algorithm[^learnkmean] aims at partitioning *n*
observations into a fixed number of *k* clusters. This algorithm will
find homogeneous clusters.

[^learnkmean]: We will learn how the algorithm works below.

In R, we use

```{r, eval = FALSE}
stats::kmeans(x, centers = 3, nstart = 10)
```

where

- `x` is a numeric data matrix
- `centers` is the pre-defined number of clusters
- the k-means algorithm has a random component and can be repeated
  `nstart` times to improve the returned model

To learn about k-means, let's use the `giris` dataset, that provides
the expression of 4 genes in 150 patients grouped in categories A, B
or C.

```{r girisplot, fig.cap = "The *giris* dataset describes 150 patients by the expression of 4 genes. Each patient was assigned a grade A, B or C."}
library("rWSBIM1322")
data(giris)
pairs(giris[, 1:4], col = giris$GRADE)
```

`r msmbstyle::question_begin()`

Run the k-means algorithm on the `giris` data, save the results in a
new variable `cl`, and explore its output when printed. What is the
class of `cl`?

`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`

```{r solkmcl1}
cl <- kmeans(giris[, -5], 3, nstart = 10)
class(cl)
```
`r msmbstyle::solution_end()`


`r msmbstyle::question_begin()`

The actual results of the algorithms, i.e. the cluster membership can
be accessed in the `cluster` element of the clustering result
output. Extract and interpret these data.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`

```{r solkmcl2}
cl$cluster
```
`r msmbstyle::solution_end()`


`r msmbstyle::question_begin()`
Produce 2 PCA plots, colouring one with the original grade, and the
other one with the results of the clustering algorithm. Compare the
two plots.
`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`

```{r solkmcl3, message = FALSE, fig.cap = "PCA analysis of the `giris` data, highlighting the original groups (left) and the clustering results (right).", fig.width = 10, fig.height = 8, fig.fullwidth = TRUE}
pca <- prcomp(giris[, -5], scale = TRUE, center = TRUE)
summary(pca)

library("factoextra")
library("patchwork")
p1 <- fviz_pca(pca, habillage = giris$GRADE)
p2 <- fviz_pca(pca, habillage = cl$cluster)
p1 / p2
```
`r msmbstyle::solution_end()`

`r msmbstyle::question_begin()`
Compare the categories and clustering results and count the number of
matches/mismatches.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`

```{r solkmcl4}
table(giris$GRADE, cl$cluster)
```
`r msmbstyle::solution_end()`

### How does k-means work {-}

The algorithm starts from a matrix of *n* features in *p* observations
and proceeds as follows:

**Initialisation**: randomly assign the *p* obervations to *k* clusters.

```{r kmworksinit, fig.cap="k-means random intialisation", echo = FALSE, fig.width = 9, fig.heigth = 4}
set.seed(12)
x <- data.frame(pca$x[, 1:2])
x$init <- sample(LETTERS[1:3], nrow(x), replace = TRUE)
p1 <- ggplot(x, aes(x = PC1, y = PC2, colour = init)) +
    geom_point() +
    labs(title = "Initialisation")
p1
```

**Iteration**:

1. Calculate the centre of each subgroup as the average position of
   all observations is that subgroup.
2. Each observation is then assigned to the group of its nearest
   centre.

It's also possible to stop the algorithm after a certain number of
iterations, or once the centres move less than a certain distance.

```{r kmworksiter, fig.cap="k-means iteration: calculate centers (left) and assign new cluster membership (right)", fig.width = 10, fig.height = 8, fig.fullwidth = TRUE, echo = FALSE}
library("tidyverse")
centres <- x %>%
    group_by(init) %>%
    summarise(PC1 = mean(PC1),
              PC2 = mean(PC2)) %>%
    dplyr::select(PC1, PC2, init)

p2 <- p1 +
    labs(title = "Initialisation with centres") +
    geom_point(data = centres, aes(x = PC1, y = PC2, col = init),
               shape  = 1, size = 5) +
    theme(legend.position = "none")


tmp <- as.matrix((rbind(centres[, 1:2], x[, 1:2])))
tmp <- dist(tmp)
tmp <- as.matrix(tmp)[, 1:3]
ki <- apply(tmp, 1, which.min)
ki <- LETTERS[1:3][ki]
x$iter <- ki[-(1:3)]

p3 <- ggplot(x, aes(x = PC1, y = PC2, colour = iter)) +
    labs(title = "Iteration 1") +
    geom_point() +
    geom_point(data = centres, aes(x = PC1, y = PC2, col = init),
               shape  = 1, size = 5) +
    theme(legend.position = "none")
p2 / p3
```

**Termination**: Repeat the iteration steps until no point changes its
cluster membership.


```{r, results='markup', fig.cap="k-means convergence (credit Wikipedia).", echo=FALSE, purl=FALSE, out.width='100%', fig.align='center'}
knitr::include_graphics("https://upload.wikimedia.org/wikipedia/commons/e/ea/K-means_convergence.gif")
```

### Model selection

Due to the random initialisation, one can obtain different clustering
results.

```{r, fig.cap = "Different k-means results on the same `girsi` data", fig.width = 10, fig.height = 8, fig.fullwidth = TRUE, echo = FALSE}
set.seed(211)
cl1 <- kmeans(giris[, -5], centers = 3, nstart = 1)
cl2 <- kmeans(giris[, -5], centers = 3, nstart = 1)
## table(cl1$cluster, cl2$cluster)
x <- data.frame(pca$x[, 1:2])
x$kmeans_1 <- LETTERS[1:3][cl1$cluster]
x$kmeans_2 <- LETTERS[1:3][cl2$cluster]
p1 <- ggplot(x, aes(x = PC1, PC2, colour = kmeans_1)) +
    geom_point()
p2 <- ggplot(x, aes(x = PC1, PC2, colour = kmeans_2)) +
    geom_point()
p1 / p2
```

When k-means is run multiple times (as set by `nstart`), the best
outcome, i.e. the one that generates the smallest *total within
cluster sum of squares (SS)*, is selected. The total within SS is
calculated as:

For each cluster results:

- for each observation, determine the squared euclidean distance to
  centre of cluster
- sum all distances

Note that this is a **local minimum**; there is no guarantee to obtain
a global minimum.


### How to determine the number of clusters

1. Run k-means with `k=1`, `k=2`, ..., `k=n`
2. Record total within SS for each value of k.
3. Choose k at the *elbow* position, as illustrated below.

```{r kmelbow, echo=FALSE, fig.cap = "Total within sum of squared distances for different values is *k*."}
ks <- data.frame(k = 1:5)
ks$tot_within_ss <- sapply(ks$k, function(k) {
    cl <- kmeans(giris[, -5], k, nstart = 10)
    cl$tot.withinss
})

ggplot(ks, aes(x = k, y = tot_within_ss)) +
    geom_line() + geom_point()
```

`r msmbstyle::question_begin()`
Calculate the total within sum of squares for k from 1 to 5 for our
`giris` test data, and reproduce the figure above.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`

```{r solkmelbow}
ks <- 1:5
tot_within_ss <- sapply(ks, function(k) {
    cl <- kmeans(giris[, -5], k, nstart = 10)
    cl$tot.withinss
})
plot(ks, tot_within_ss, type = "b")
```
`r msmbstyle::solution_end()`


See also this interactive [K-Means Clustering Explorable
Explainer](https://k-means-explorable.vercel.app/) to help you
understand the algorithm, its parameters and dig into some additional
details.


## Hierarchical clustering

### How does hierarchical clustering work

**Initialisation**:  Starts by assigning each of the n points its own cluster

**Iteration**

1. Find the two nearest clusters, and join them together, leading to
   n-1 clusters
2. Continue the cluster merging process until all are grouped into a
   single cluster

**Termination:** All observations are grouped within a single cluster.

```{r hcldata, fig.width = 12, echo=FALSE, fig.cap = "Hierarchical clustering: initialisation (left) and colour-coded results after iteration (right)."}
set.seed(42)
xr <- data.frame(x = rnorm(5),
                 y = rnorm(5))
cls <- c("red", "blue", "orange", "blue", "orange")
cls <- scales::col2hcl(cls, alpha = 0.5)
par(mfrow = c(1, 2))
plot(xr, cex = 3)
text(xr$x, xr$y, 1:5)
plot(xr, cex = 3, col = cls, pch = 19)
text(xr$x, xr$y, 1:5)
```

The results of hierarchical clustering are typically visualised along
a **dendrogram**[^dendro], where the distance between the clusters is
proportional to the branch lengths.

[^dendro]: Note that dendrograms, or trees in general, are used in
    evolutionary biology to visualise the evolutionary history of
    taxa. In this course, we use them to represent similarities (based
    on distances) and don't assess any evolutionary relations between
    features or samples.


```{r hcldendro, echo=FALSE, fig.cap = "Visualisation of the hierarchical clustering results on a dendrogram"}
plot(hcr <- hclust(dist(xr)))
```

In R:

- Calculate the distance using `dist`, typically the Euclidean
  distance.
- Apply hierarchical clustering on this distance matrix using
  `hclust`.


`r msmbstyle::question_begin()`
Apply hierarchical clustering on the `giris` data and generate a
dendrogram using the dedicated `plot` method.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r hclsol, fig.cap = ""}
d <- dist(giris[, -5])
hcl <- hclust(d)
hcl
plot(hcl)
```
`r msmbstyle::solution_end()`

The `method` argument to `hclust` defines how the clusters are grouped
together, which depends on how distances between a cluster and an
obsevation (for example *(3, 5)* and *1* above), or between cluster
(for example (*(2, 4)* and *(1, (3, 5))* above) are computed.

The **complete linkage** (default) method defines the distances
between clusters as the largest one between any two observations in
them, while the **single linkage** (also called nearest neighbour
method) will use the smallest distance between observations. The
**average linkage** method is halfways between the two. **Ward's**
method takes an analysis of variance approach, and aims at minimising
the variance within clusters.

See [section 5.6](https://www.huber.embl.de/msmb/Chap-Clustering.html#hierarchical-clustering)
of *Modern Statistics of Modern Biology* [@MSMB] for further details.

`r msmbstyle::question_begin()`
Using the `giris` example, compare the linkage methods presented above.
`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r linkcomp, fig.cap = "Effect of using different linkage methods.", fig.fullwidth = TRUE, fig.width = 8, fig.height = 8}
d <- dist(giris[, -5])
hcl_complete <- hclust(d, method = "complete")
hcl_single <- hclust(d, method = "single")
hcl_average <- hclust(d, method = "average")
hcl_ward <- hclust(d, method = "ward.D")
par(mfrow = c(2, 2))
plot(hcl_complete)
plot(hcl_single)
plot(hcl_average)
plot(hcl_ward)
```
`r msmbstyle::solution_end()`

### Defining clusters

After producing the hierarchical clustering result, we need to *cut
the tree (dendrogram)* at a specific height to defined the
clusters. For example, on our test dataset above, we could decide to
cut it at a distance around 1.5, with would produce 2 clusters.

```{r cuthcl, echo=FALSE, fig.cap = "Cutting the dendrogram at height 1.5."}
plot(hcl)
abline(h = 1.5, col = "red")
```

In R we can us the `cutree` function to

- cut the tree at a specific height: `cutree(hcl, h = 1.5)`
- cut the tree to get a certain number of clusters: `cutree(hcl, k = 3)`


`r msmbstyle::question_begin()`

- Cut the iris hierarchical clustering result at a height to obtain
  3 clusters by setting `h`.

- Cut the iris hierarchical clustering result at a height to obtain
  3 clusters by setting directly `k`, and verify that both provide
  the same results.

`r msmbstyle::question_end()`

`r msmbstyle::solution_begin()`
```{r cuthclsol}
plot(hcl)
abline(h = 3.9, col = "red")
cutree(hcl, k = 3)
cutree(hcl, h = 3.9)
identical(cutree(hcl, k = 3), cutree(hcl, h = 3.9))
```
`r msmbstyle::solution_end()`


## Pre-processing

Many of the machine learning methods that are regularly used are
sensitive to difference scales. This applies to unsupervised methods
as well as supervised methods.

A typical way to pre-process the data prior to learning is to scale
the data (see chapter \@ref(sec-norm)), or apply principal component
analysis (see chapter \@ref(sec-dimred)).

In R, scaling is done with the `scale` function.

`r msmbstyle::question_begin()`

Using the `mtcars` data as an example, verify that the variables are
of different scales, then scale the data. To observe the effect
different scales, compare the hierarchical clusters obtained on the
original and scaled data.

`r msmbstyle::question_end()`


`r msmbstyle::solution_begin()`

```{r scalesol, fig.width=12, fig.cap=""}
boxplot(mtcars)
hcl1 <- hclust(dist(mtcars))
hcl2 <- hclust(dist(scale(mtcars)))
par(mfrow = c(1, 2))
plot(hcl1, main = "original data")
plot(hcl2, main = "scaled data")
```
`r msmbstyle::solution_end()`


## Additional exercises

`r msmbstyle::question_begin()`
- Load the small `g3` data set that provides expression profiles of
  genes 1 to 3 in samples 1 to 5 and visualise these.

- Compute and compare the distances between the genes using the
  Euclidean and correlation distances (these can be calculated, for
  example, with the `euc` and `cor.dist` from the `bioDist`
  package). Interpret these distances based on the visualisation
  above.

- Compare the effect of scaling on these distances. Visualise the
  scaled and unscaled profiles to help interpret the distances.

`r msmbstyle::question_end()`

```{r disttoyexample, include = FALSE, fig.cap = "Expression profiles of genes 1 to 3 in samples 1 to 5."}
library("rWSBIM1322")
data(g3)
matplot(t(g3), type = "b", xlab = "Samples", ylab = "Gene expression")
```

```{r disttoy2, include = FALSE}
library("bioDist")
ds <- rbind(euc = euc(g3),
            cor = cor.dist(g3))
colnames(ds) <- c("1-2", "1-3", "2-3")
ds
```

```{r disttoy3, include = FALSE, fig.cap = "Expression profiles without (left) and with (right) scaling."}
g3_scaled <- scale(g3) ## scale
ds <- rbind(euc = euc(g3),
            euc_scaled = euc(g3_scaled),
            cor = cor.dist(g3),
            cor_scaled = cor.dist(g3_scaled))
colnames(ds) <- c("1-2", "1-3", "2-3")
ds
par(mfrow = c(1, 2))
matplot(t(g3), type = "b", xlab = "Samples", ylab = "Gene expression")
matplot(t(g3_scaled), type = "b", xlab = "Samples", ylab = "Gene expression")

```

`r msmbstyle::question_begin()`

Following up the exercise above, produce and interpret the four
hierarchical clusterings, build using Euclidean and correlation
distances (these can be calculated, for example, with the `euc` and
`cor.dist` from the `bioDist` package).

`r msmbstyle::question_end()`


```{r hclscale, include = FALSE, fig.cap = ""}
par(mfrow = c(2, 2))
plot(hclust(euc(g3)), main = "Euclidean distance")
plot(hclust(euc(g3_scaled)), main = "Euclidean distance (scaled/centred)")
plot(hclust(cor.dist(g3)), main = "Pearson correlational distance")
plot(hclust(cor.dist(g3_scaled)), main = "Pearson correlational distance (scaled/centred)")
```


`r msmbstyle::question_begin()`
An important caveat of clustering is that a clustering algorithm such
as *k-means* will always find clusters, even when there is
none. Illustrate this by creating a 2-dimensional random data by
generation a matrix of 100 rows and 2 columns from *N(0, 1)* and
searching for 2, 3, 4, 5 and 6 cluster. Visualise these results and assess
whether they look convincing or not.
`r msmbstyle::question_end()`


```{r, include = FALSE}
set.seed(123)
xy <- matrix(rnorm(200), ncol = 2)
kcl2 <- kmeans(xy, centers = 2)
kcl3 <- kmeans(xy, centers = 3)
kcl4 <- kmeans(xy, centers = 4)
kcl5 <- kmeans(xy, centers = 5)
kcl6 <- kmeans(xy, centers = 6)
par(mfrow = c(2, 3))
plot(xy)
plot(xy, col = kcl2$cluster)
plot(xy, col = kcl3$cluster)
plot(xy, col = kcl4$cluster)
plot(xy, col = kcl5$cluster)
plot(xy, col = kcl6$cluster)
```


`r msmbstyle::question_begin()`

Using the same value `k = 3`, verify if k-means and hierarchical clustering
produce the similar or identical results on the `giris` data. Visualise the
two sets of results on PCA plots.

Which one, if any, is correct?
`r msmbstyle::question_end()`

```{r iris2algs, fig.width = 12, echo = FALSE, eval = FALSE}
km <- kmeans(giris[, -5], centers = 3, nstart = 10)

hcl <- hclust(dist(giris[, -5]))
hcl_res <- cutree(hcl, k = 3)
table(km$cluster, hcl_res)

pca <- prcomp(giris[, -5], scale = TRUE, center = TRUE)

library("factoextra")
library("patchwork")
p1 <- fviz_pca_ind(pca, habillage = km$cluster) +
    labs(title = "kmeans")
p2 <- fviz_pca_ind(pca, habillage = hcl_res) +
    labs(title = "hclust")

p1 + p2

## comparison with expercted results
table(km$cluster, giris$GRADE)
table(hcl_res, giris$GRADE)

p3 <-  fviz_pca_ind(pca, habillage = giris$GRADE) +
    labs(title = "GRADE")

p1 + p2 + p3
```

`r msmbstyle::question_begin()`

- Load the `mulvey2015_se` data from the `pRolocdata` package. The goal
  of this exercise is to identify proteins that have a specific
  expression pattern over the cell development time. Do do so,
  permform a k-means clustering setting k = 12.

- Visualise the expression profiles over the development time for the
  12 clusters identified above. See below for an example of such a
  visualisation.

`r msmbstyle::question_end()`

```{r, echo = FALSE, fig.fullwidth = TRUE, fig.width = 12, fig.heigth = 9}
library("pRolocdata")
data(mulvey2015_se)
cl <- kmeans(assay(mulvey2015_se), centers = 12, nstart = 10)
rowData(mulvey2015_se)$cl <- factor(cl$cluster)

data.frame(assay(mulvey2015_se),
           cl = rowData(mulvey2015_se)[["cl"]]) %>%
    rownames_to_column(var = "protein") %>%
    pivot_longer(names_to = "sample",
                 values_to = "expression", c(-cl, -protein)) %>%
    mutate(time = sub("hr", "", sub("^.+_", "", sample))) %>%
    mutate(rep = sub("_.+$", "", sub("rep", "", sample))) %>%
    mutate(sample = paste(time, rep, sep = "_")) %>%
    ggplot(aes(x = sample, y = expression, group = protein, colour = cl)) +
    geom_line() +
    facet_wrap(. ~ cl)
```

`r msmbstyle::question_begin()`
Using the `hyperLOPIT2015_se` data from the `pRolocdata` package:

- Filter out all features that have an `unknown` under the feature
  variable `markers`.

- Then, for all the marker classes (organelles), calculate an average
  profile by computing the mean value for each column.

- Perform a hierarchical clustering showing the relation between the
  average marker profiles, such as shown below.
`r msmbstyle::question_end()`

```{r, echo = FALSE, message = FALSE, warning = FALSE}
library("pRolocdata")
data(hyperLOPIT2015)
suppressPackageStartupMessages(library("pRoloc"))
par(oma = c(0, 0, 0, 0),
    mar = c(15, 4, 1, 0))
pRoloc::mrkHClust(hyperLOPIT2015)
```