-
Notifications
You must be signed in to change notification settings - Fork 9
/
95-ccl.Rmd
351 lines (267 loc) · 11.9 KB
/
95-ccl.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
# Conclusions {#sec-ccl}
In this course, we have consolidated our skills in data analysis and
visualisation using R. In particular, this course has focused on
interpretation and understanding of the outputs.
We have also learned and applied new tools, in particular how to
manipulate sequence data using `Biostrings` and how to use statistical
learning tools to explore data and identify patterns of biological
interest. In terms of statistical testing and machine learning, it can
be useful to summarise the different classes of techniques we have
touch on using the figures below. The grid on each of these figures
represents a matrix with quantitative values with features (genes,
proteins, transcripts, ...) along the rows and samples along the
columns. Annotations of features or samples are presented as coloured
boxes on the right or top if the matrices.
In hypothesis test, we start with quantitative data and an
experimental design, i.e. sample annotation that group samples in
biologically relevant groups. The output of hypothesis testing is a
set of metrics for each features (a p-value, an adjusted p-value, a
fold-change, ...) that informs whether that feature shows any
difference between the biological groups of interest.
```{r, echo = FALSE, fig.cap = "Hypothesis testing.", fig.with = 5, fig.height = 3}
par(oma = c(0, 0, 0, 0), mar = c(0, 0, 2, 0))
rWSBIM1322:::ht_summary()
```
When performing clustering (unsupervised machine learning), we only have our
quantitative data as input, and the clustering algorithm (whether
k-means, hierarchical clustering among many others) suggests a set of
groups to cluster the features (as shown below) or samples.
```{r, echo = FALSE, fig.cap = "Clustering of features.", fig.with = 5, fig.height = 3}
par(oma = c(0, 0, 0, 0), mar = c(0, 0, 2, 0))
rWSBIM1322:::ul_summary()
```
When performing dimensionality reduction with, for example PCA, one
starts with a *n* by *m* data set as input to reduce the number of
dimensions in either direction. On the figure below, the number of
features *n* was reduced to 2 to, typically, visualise the sample
along a scatter plot.
```{r, echo = FALSE, fig.cap = "Dimensionality reduction.", fig.with = 5, fig.height = 3}
par(oma = c(0, 0, 0, 0), mar = c(0, 0, 2, 0))
rWSBIM1322:::dr_summary()
```
In classification (supervised machine learning), we need labelled
data, i.e. a set of sample (top on the figures below) or features
(bottom on the figure below). The classifier uses the data to learn from
assigned labels and applies that learnt model to infer the most likely
label of the unlabelled data.
```{r, echo = FALSE, fig.cap = "Classification of samples (top) or features (bottom).", fig.with = 5, fig.height = 6}
par(oma = c(0, 0, 0, 0), mar = c(0, 0, 2, 0))
rWSBIM1322:::sl_summary()
```
The next steps of the curriculum (course
[WSBIM2122](https://uclouvain-cbio.github.io/WSBIM2122/])) will build
upon the skills gained in this course to fully analyse complete
datasets from omics technologies, using state-of-the-art statistical
and machine learning methods and software. The course will be project
based: each experiment and associated technologies will be introduced,
the analysis pipeline will be explained, and the students will then
implement and present the data analysis and critical interpretation of
the results.
## Additional exercise
These final exercises integrate several techniques seen throughout
this course.
### Exercise 1 {-}
`r msmbstyle::question_begin()`
The the RNA-Seq data and sample annotation in the files returned by
the `rWSBIM1207::kem2.tsv()` function. How many genes have been
assayed? Describe the experimental design at hand.
`r msmbstyle::question_end()`
```{r, include = FALSE}
e <- as.matrix(read.delim(rWSBIM1207::kem2.tsv()[1],
row.names = 1))
dim(e)
```
`r msmbstyle::question_begin()`
Print and describe the experimental design.
`r msmbstyle::question_end()`
```{r, include = FALSE}
## Below, we harmonise the samples names to have dots in the
## experimental design dataframe (pd: phenotypic data) and expression
## matix.
pd <- readr::read_tsv(rWSBIM1207::kem2.tsv()[2]) %>%
mutate(sample_id = sub("-", ".", sample_id)) %>%
mutate(group = paste(cell_type, treatment, sep = "."))
pd
```
<!-- Cell types A and B have been assay in a stimulted and non-stimulated -->
<!-- (none) conditions. There are 4 replicates in each of the 4 groups. -->
`r msmbstyle::question_begin()`
Before performing log-transformation of the data, check if there are
any 0-expression values. Indeed, these would be converted to `-Inf`
after log-transformation. Are there any 0-values in the data. If so,
how many are there.
`r msmbstyle::question_end()`
```{r, include = FALSE}
table(e == 0)
```
`r msmbstyle::question_begin()`
Add 1 to all expression values, then log-2 transform the expression
data. Visualise the distributions of the expression values in each
samples before and after transformation.
`r msmbstyle::question_end()`
```{r, fig.cap = "Expression values in cell types A and B without (black and gree) and with stimulation (red and blue).", warning = FALSE, include = FALSE}
boxplot(e, col = as.factor(pd$group), main = "Raw values")
e <- log2(e + 1)
boxplot(e, col = as.factor(pd$group), main = "Log-transformed")
```
Continue with log-transformed data, without preforming any normalisation.
`r msmbstyle::question_begin()`
Compute, for each gene, the number of 0-expression values and
visualise these as a table showing the number of genes with 0, 1, 2,
... zero values.
`r msmbstyle::question_end()`
```{r, include = FALSE}
table(row_zero <- apply(e, 1, function(x) sum(x == 0)))
```
`r msmbstyle::question_begin()`
Visualise and interpret the experiment design (cell type and
treatment) using a principal component analyis.
`r msmbstyle::question_end()`
```{r, fig.cap = "PCA of *kem2* samples.", include = FALSE}
library("factoextra")
pca_samples <- prcomp(t(e), scale = TRUE, center = TRUE)
fviz_pca_ind(pca_samples, habillage = pd$group)
```
<!-- The PCA shows that samples group by treatment along the second PC -->
<!-- (22.3% of variance explained) and that close to 50% of the variance is -->
<!-- explained by variability along all samples. -->
`r msmbstyle::question_begin()`
Use a t test to identify the genes that are differentially expressed
between the stimulated and non-stimulated samples of cell type
A. Visualise the results on a volcano plot.
`r msmbstyle::question_end()`
<!-- We start by selecting the relevant samples and create a new expression -->
<!-- matrix matching cell type A. -->
```{r, include = FALSE}
cell_type_a <- pd %>% filter(cell_type == "A")
cell_type_a
e_a <- e[, cell_type_a[[1]]]
e_a[1:10, 1:4]
```
<!-- Below, we perform a t test on each of the `r nrow(e_a)` genes. -->
```{r, fig.cap = "Histogram of p-values, showing an enrichment of small values.", include = FALSE}
my_t_test <- function(x)
t.test(x[1:4], x[5:8])$p.value
res <- data.frame(row.names = rownames(e))
res$pv <- apply(e_a, 1, my_t_test)
hist(res$pv)
```
```{r, include = FALSE}
res$adjp <- p.adjust(res$pv, method = "BH")
my_lfc <- function(x)
mean(x[1:4]) - mean(x[5:8])
res$lfc <- apply(e_a, 1, my_lfc)
res %>%
arrange(adjp) %>%
head()
```
```{r, fig.cap = "Volcano plot illustrating how stimulation down-regulates the expression of numerous genes.", include = FALSE}
with(res, plot(lfc, -log10(adjp)))
abline(v = c(-1, 1))
abline(h = -log10(0.001))
```
`r msmbstyle::question_begin()`
Considering that a gene is called differentially expressed if it has
an absolute log2 fold-change > 1 and an adjusted p-value < 0.001, how
many differentially genes are there?
`r msmbstyle::question_end()`
```{r, include = FALSE}
res %>%
filter(adjp < 0.001, abs(lfc) > 1) %>%
nrow
```
`r msmbstyle::question_begin()`
Visualise the expression of these differentially expressed genes in
all sampes using a heatmap and interpret the figure.
`r msmbstyle::question_end()`
```{r, fig.cap = "Heatmap of the 73 differentially expressed genes.", include = FALSE}
k_de <- res %>%
rownames_to_column() %>%
filter(adjp < 0.001, abs(lfc) > 1) %>%
dplyr::select(rowname) %>%
unlist()
heatmap(e[k_de, ])
```
```{r, message = FALSE, fig.cap = "Annotated heatmap of the 73 differentially expressed genes.", include = FALSE}
library("ComplexHeatmap")
col_annot <- HeatmapAnnotation(cell_type = pd$cell_type, treatment = pd$treatment)
Heatmap(e[k_de, ], top_annotation = col_annot)
```
<!-- We see that most genes that are differentially expressed in cell type -->
<!-- A are show a consistent pattern in cell type B, except for the botton -->
<!-- 5 genes, that show a low expression accross both conditions. -->
`r msmbstyle::question_begin()`
Select the 3 genes with the smallest p-values and visualise their
expression in all samples (cell types A and B, both stimulated and
un-stimulated) using boxplots.
`r msmbstyle::question_end()`
```{r, fig.cap = "Expression of the three most differentially expressed genes.", include = FALSE}
k_3 <- res %>%
rownames_to_column() %>%
arrange(adjp) %>%
head(3) %>%
dplyr::select(rowname) %>%
unlist()
e[k_3, ] %>%
as.data.frame() %>%
rownames_to_column() %>%
pivot_longer(names_to = "sample_id",
values_to = "expression",
-rowname) %>%
full_join(pd) %>%
ggplot(aes(x = treatment, y = expression)) +
geom_boxplot(outlier.size = -1) +
geom_jitter() +
facet_grid(rowname ~ cell_type)
```
`r msmbstyle::question_begin()`
Perform a k-means clustering of all genes and samples using k = 5 and
visualise these clusters on a PCA plot. Interprete the figure.
`r msmbstyle::question_end()`
```{r, fig.cap = "PCA of the *kem2* genes displaying the 5 arbitrary clusters.", include = FALSE}
set.seed(1)
res$cl <- kmeans(e, centers = 5, nstart = 10)$cl
pca <- prcomp(e, scale = TRUE, center = TRUE)
fviz_pca_ind(pca, habillage = res$cl, geom = "point")
```
<!-- The clusters represent the diversity of gene expression along -->
<!-- PC1. Cluster 3, on the left of the figure, seems to be the most -->
<!-- variable and different, while to others represent a more continuous -->
<!-- gradient along of gene expression variation. -->
`r msmbstyle::question_begin()`
Repeat the volcano plot above, colouring the genes based on their
respective clusters. Refine your interpretation of the gene-level PCA
above.
`r msmbstyle::question_end()`
```{r, fig.cap = "Volcano plot displaying the five clusters", include = FALSE}
with(res, plot(lfc, -log10(adjp), col = res$cl))
legend("bottomleft", legend = 1:5, pch = 1, col = 1:5)
```
<!-- We see that the most diverse cluster corresponds roughly to the genes -->
<!-- that show significant down regulation upon stimulation. -->
### Exercise 2 {-}
`r msmbstyle::question_begin()`
Load the `recapSE1` data available in the `rWSBIM1322` package
(version >= 0.3.1) with `data(recapSE1)`. The data contains
quantitative proteomics data. You are asked to:
- Familiarise yourself with the experimental design.
- Verify if the data need to be transformed and or normalised.
- Perform an principal component analyse and interpret it.
- Identify differentially abundant proteins for each condition
compared to the reference group `CTRL`.
- Are there any significantly up- or down-regulated proteins shared
between all treatments?
`r msmbstyle::question_end()`
### Exercise 3 {-}
`r msmbstyle::question_begin()`
Load the `recapSE2` data available in the `rWSBIM1322` package
(version >= 0.3.2) with `data(recapSE2)`. The data contains
quantitative proteomics data. You are asked to:
- Familiarise yourself with the experimental design.
- Verify if the data need to be transformed and or normalised.
- Perform an principal component analyse and interpret it.
- Identify differentially abundant proteins for each condition
compared to the reference group `CTRL`.
- Repeat the analyses above after removing the outlier. What is the
effect of that outlier of the differential analysis.
`r msmbstyle::question_end()`