-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathlab_functional.Rmd
285 lines (220 loc) · 11.8 KB
/
lab_functional.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
---
title: "Gene Function Annotation"
subtitle: Workshop on RNA-Seq
editor_options:
chunk_output_type: console
output:
bookdown::html_document2:
highlight: textmate
toc: true
toc_float:
collapsed: true
smooth_scroll: true
print: false
toc_depth: 4
number_sections: true
df_print: default
code_folding: none
self_contained: false
keep_md: false
encoding: 'UTF-8'
css: "assets/lab.css"
include:
after_body: assets/footer-lab.html
---
```{r,child="assets/header-lab.Rmd"}
```
# Loading data
<div class="boxy boxy-exclamation boxy-yellow">
Set `~/RNAseq/labs/` as your working directory (i.e. your working directory should be the directory where you saved this .Rmd file). Create a subdirectory named `data` in this directory (`~/RNAseq/labs/data/`) for input and output files.
</div>
Loading packages and data.
```{r}
library(pheatmap)
library(rafalib)
library(DESeq2)
library(pvclust)
library(biomaRt)
library(enrichR)
library(fgsea)
# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
# read data
download_data("data/gene_counts.csv")
data <- read.csv("data/gene_counts.csv",row.names = 1)
dim(data)
# read metadata
download_data("data/metadata_raw.csv")
metadata <- read.csv("data/metadata_raw.csv",row.names = 1,stringsAsFactors=T)
```
Here you would select genes based on p-values and logFC you obtained from differential expression.
```{r,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
res <- read.csv("data/dge_results.csv",row.names=1)
res1 <- na.omit(res)
selected_genes <- rownames(res1)[(res1$pvalue < 0.001) & (abs(res1$log2FoldChange) > 1)]
```
<div class="boxy boxy-lightbulb">
There are many ways of selecting gene lists. One could for example select only the top 500 variable genes. CPM counts are converted to log scale and row-wise variance is computed.
```{r,eval=F,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
# Gene selection
vars <- apply(data,1,var)
vars <- sort(vars,decreasing=T)
top_var <- names(vars)[1:500]
```
</div>
## Clustering on Heatmap
Now that you understand the concepts of hierarchical clustering both at the sample and at the gene level, we can use a heatmap function to explore the visual consequences of clustering. Here, we can make use of the `pheatmap()` function, which by default will do the clustering of the rows and columns.
```{r,results='hide',chunk.title=TRUE,fig.height=7,fig.width=5}
cl <- pheatmap( data[rownames(data) %in% selected_genes,],scale="row",color=colorRampPalette(c("navy","white","firebrick"))(90),border_color=NA,cluster_cols=F,cutree_rows=2)
gene_clusters <- cutree(cl$tree_row,k=2)
```
# Functional Analysis
R-packages | Comments
--- | ---
topGO | GO
goana | GO
GOseq | GO
topKEGG | KEGG
kegga | KEGG
enrichR | GO, KEGG, many others
piano | GO, KEGG, GSEA, many others, enrichment consensus
ClusterProfiler | GO, KEGG, GSEA, many others, nice plots!
Pathview | Nice visualization for KEGG pathways
fgsea | GSEA
## enrichR
You can list all available databases by using the command `listEnrichrDbs()` function.
```{r,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
genes_cluster1 <- names(gene_clusters)[gene_clusters == 1]
genes_cluster2 <- names(gene_clusters)[gene_clusters == 2]
head(data[genes_cluster1, ])
head(data[genes_cluster2, ])
```
### GO enrichment
```{r,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
go_cluster <- enrichr(genes = genes_cluster2,databases = "GO_Biological_Process_2023")
go_cluster <- go_cluster$GO_Biological_Process_2023
go_cluster <- go_cluster[order(go_cluster$P.value),]
go_cluster[1:5,]
{
mypar(1,1,mar=c(2,25,2,2))
barplot(-log10(go_cluster$P.value[15:1]),horiz=T,border=F,yaxs="i",
names.arg=go_cluster$Term[15:1],las=1,cex.names=0.8)
abline(v=0,lwd=2)
}
```
<i class="fas fa-comments"></i> **Exploratory Questions**
* Which is the most enriched GO term for cluster1?
* How many genes from your data set were detected in this most GO term?
What is the percentage of genes out of the total list that defines the GO term?
* Which genes from your data set belong to this most enriched GO term?
* Some genes make part of several GO terms, can you list some?
### KEGG enrichment
```{r,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
kegg_cluster <- enrichr(genes=genes_cluster2,databases="KEGG_2019_Human")
kegg_cluster <- kegg_cluster$KEGG_2019_Human
kegg_cluster <- kegg_cluster[order(kegg_cluster$P.value),]
kegg_cluster[1:5,]
{
mypar(1,1,mar=c(2,25,2,2))
barplot(-log10(kegg_cluster$P.value[15:1]),horiz=T,border=F,yaxs="i",
names.arg=kegg_cluster$Term[15:1],las=1,cex.names=0.8)
abline(v=0,lwd=2)
}
```
<i class="fas fa-comments"></i> **Exploratory Questions**
* Which is the most enriched KEGG pathway for cluster1?
* How many genes from your data set were detected in this most enriched KEGG pathway ?
* What is the percentage of genes out of the total list that defines the KEGG pathway ?
* Which genes from your data set belong to this most enriched KEGG pathway ?
* Some genes make part of several KEGG pathways, can you list some?
* Taking the results from GO term enrichment and the KEGG enrichment, what is a most likely function happening in your data set?
## GSEA
MSigDB is one of the largest curated databases of signatures and pathways created by the BROAD INSTITUTE. The Molecular Signatures Database (MSigDB) is accessible [here](http://software.broadinstitute.org/gsea/msigdb/index.jsp).
For this exercise, you can download a couple of data sets:
Database | Link to Download
--- | ---
KEGG | [c2.cp.kegg.v6.2.symbols.gmt](http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/6.2/c2.cp.kegg.v6.2.symbols.gmt)
GO_BP | [c5.bp.v6.2.symbols.gmt](http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/6.2/c5.bp.v6.2.symbols.gmt)
HALLMARK | [h.all.v6.2.symbols.gmt](http://software.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/6.2/h.all.v6.2.symbols.gmt)
As you could already notice, the differences in gene expression between days 0 to 7 are very clear. Here we will illustrate a gene set enrichment to explore which pathways are UP or DOWN regulated.
<div class="boxy boxy-exclamation boxy-yellow">
Here you would use the logFC values you obtained from differential expression. But for simplicity and reduce dependency on other parts of the course, we are using a simple difference in mean expression.
</div>
```{r}
# if folder and files doesn't exist, download it
if(!dir.exists("data/MSigDB_files/")) {
dir.create("data/MSigDB_files")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c2.cp.kegg.v6.2.symbols.gmt.txt","./data/MSigDB_files/c2.cp.kegg.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c3.tft.v6.2.symbols.gmt.txt","./data/MSigDB_files/c3.tft.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/h.all.v6.2.symbols.gmt.txt","./data/MSigDB_files/h.all.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c2.cp.v6.2.symbols.gmt.txt","./data/MSigDB_files/c2.cp.v6.2.symbols.gmt.txt")
download.file(url="https://raw.github.com/NBISweden/workshop-RNAseq/master/data/MSigDB_files/c5.bp.v6.2.symbols.gmt.txt","./data/MSigDB_files/c5.bp.v6.2.symbols.gmt.txt")
}
```
```{r,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
# Create a gene rank based on the gene expression
gene_rank <- setNames( res$log2FoldChange,casefold(rownames(res),upper=T) )
# Load hallmark pathways
hallmark_pathways <- gmtPathways("data/MSigDB_files/h.all.v6.2.symbols.gmt.txt")
```
Once our list of genes are sorted (from highest to lowest expression in day 7), we can proceed with the enrichment itself. Here, we will be using the `fgsea()` function. This will result in a table containing information for several pathways. We can then sort and filter those pathways to visualize only the top ones. You can select/filter them by either `p-value` or normalized enrichment score (`NES`).
```{r,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
# Perform enrichemnt analysis
fgseaRes <- fgsea( pathways=hallmark_pathways, stats=gene_rank, minSize=15, maxSize=500, nperm=10000)
# Filter the results table to show only the top 10 UP and DOWN regulated processes (optional)
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
# Nice summary table (shown as a plot)
plotGseaTable(hallmark_pathways[topPathways], gene_rank, fgseaRes, gseaParam = 0.5)
```
<i class="fas fa-comments"></i> **Exploratory Questions**
* Which is the most significantly up-regulated pathway at day 7?
* What do the length of the black bars mean?
* Is there any pathway that is significantly down-regulated at 0?
* Which is the most enriched pathway at day 0?
* What happens if we use `-res$log2FoldChange` as gene rank ? What changes and why?
Checking for individual enrichment:
```{r,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
topPathways
source("https://raw.githubusercontent.com/czarnewski/niceRplots/master/R/helper_functions.R")
mypar(2,3)
# Nice plot to see enrichment of one specific pathway
p <- list()
for (i in c(topPathwaysUp[1:3],topPathwaysDown[3:1])){
plot_enrich(i,hallmark_pathways,gene_rank)
# p[[i]] <- plotEnrichment(hallmark_pathways[[i]], gene_rank) + ggplot2::ggtitle(i)
}
# p
```
<i class="fas fa-comments"></i> **Exploratory Questions**
* What do the black lines in the X-axis mean?
* By looking at the green line distribution, and the comparison made (day00 - day07), at which time point is this pathways enriched?
<i class="fas fa-comments"></i> **Exploratory Questions (optional)**
* What if instead of the hallmark gene set, you use the GO? Try that.
* How different are the results from EnrichR and GSEA using GO annotation? Remember to use the same comparison (i.e. day00 vs day07).
## Online enrichment tools:
Below you can find a list with the most commonly used online enrichment tools. Each one differs a bit on what they can do. The packages are sorted in order, we would like you to work on the workshop.
Package | link | Database | Comment
--- | --- | --- | ---
Enrichr | http://amp.pharm.mssm.edu/Enrichr/ | GO, KEGG, TF, many others | Extensive libraries
GOrilla | http://cbl-gorilla.cs.technion.ac.il | GO | Support for REVIGO
REVIGO | http://revigo.irb.hr | GO | Summarises redundancy
DAVID | https://david.ncifcrf.gov | GO, KEGG, TF, many others | *Not updated
KEGG | https://www.genome.jp/kegg/ | KEGG | Shows the pathways
Reactome | https://www.reactome.org | KEGG-like | Shows the pathways/reactions
Panther | http://www.pantherdb.org/about.jsp | GO | Evolutionary conserved GO annotation
In case you want to test the online tools above, you can use the code below to copy the gene vector into memory, and then paste it to the webtools. Should work for both Mac and Windows users.
```{r,eval=F,results='hide',chunk.title=TRUE,fig.height=5,fig.width=10}
clip <- pipe("pbcopy","w")
write.table(genes_cluster1, file=clip, sep = '\t', row.names = FALSE)
close(clip)
```
## Protein-protein interactions
In the course we will not have much time to work on gene expression networks. However, there are some nice databases that allow you to easily visualize protein-protein interactions. Please paste the list of genes using the command above into those sites to have a visual insight on how the Up- or Down- regulated genes interact with each other.
Database | Link
--- | ---
GeneMANIA | https://genemania.org
STRING-DB | https://string-db.org
MIST | https://fgrtools.hms.harvard.edu/MIST/help.jsp
***