-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbootcamp_intro.Rmd
773 lines (572 loc) · 25.8 KB
/
bootcamp_intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
---
title: SPEAQeasy bootcamp
author:
- name: Leonardo Collado-Torres
affiliation:
- &libd Lieber Institute for Brain Development, Johns Hopkins Medical Campus
- &ccb Center for Computational Biology, Johns Hopkins University
email: [email protected]
output:
BiocStyle::html_document:
toc: true
toc_float: true
toc_depth: 2
code_folding: show
---
This document is part of the _R/Bioconductor-powered Team Data Science_ [LIBD bootcamps](https://lcolladotor.github.io/bioc_team_ds/rbioconductor-data-science-bootcamps.html#libd-bootcamps).
You can download the [**latest version of this Rmd document here**](https://github.com/LieberInstitute/SPEAQeasy-example/blob/master/bootcamp_intro.Rmd). ^[This was useful particularly during October 5-7 while I was updating this document frequently.]
# Setup
Please install the latest:
* R version from [CRAN](https://cran.r-project.org/)
* [RStudio Destkop](https://rstudio.com/products/rstudio/download/) ^[You might want to try out [RStudio Desktop Preview](https://rstudio.com/products/rstudio/download/preview/) as discussed on the [2020-10-02 LIBD rstats club session](https://twitter.com/LIBDrstats/status/1312187753371561984?s=20). ]
Once you have installed R and RStudio, please install the following R packages:
```{r "install_packages", eval = FALSE}
## For installing R packages from CRAN and GitHub
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
## For installing Bioconductor packages
remotes::install_cran("BiocManager")
## Workflow-related R packages
remotes::install_cran(c(
"rmarkdown",
"sessioninfo",
"getopt",
"here",
"tidyr",
"usethis"
))
remotes::install_github("LieberInstitute/jaffelab")
## Differential-expression related R packages
BiocManager::install(c("limma", "edgeR", "SummarizedExperiment", "recount", "ExploreModelMatrix"))
## Genomics-related R packages
BiocManager::install(c(
"clusterProfiler", "org.Hs.eg.db",
"VariantAnnotation"
))
## Visualization-related R packages
remotes::install_cran(c(
"pheatmap",
"RColorBrewer",
"ggplot2",
"plotly"
))
BiocManager::install(c("iSEE", "enrichplot"))
## Extra ones
remotes::install_cran(c("knitcitations", "statmod"))
BiocManager::install(c("BiocStyle"))
```
# Bootcamp plan
* Day 1: Monday October 5th 2020, 3-5 pm
- RNA sequencing primer
- LIBD's RNA-seq processing pipeline: SPEAQeasy
- `r BiocStyle::Biocpkg("SummarizedExperiment")` overview
- Exploring quality control metrics
* Day 2: Tuesday October 6th 2020, 3-5 pm
- Exploring gene expression
- Identifying sample swaps
- Statistical modeling
* Day 3: Wednesday October 7th 2020, 1-3 pm
- Identifying differentially expressed genes
- Visualizing differentially expressed genes
- Gene ontology enrichment analyses
- Beyond SPEAQeasy
# Day 1
## RNA sequencing primer
[Slides](https://docs.google.com/presentation/d/1X3U2lcf75Yh9BK_Scs-O3M8b9-JauUNN8ZO_xhlKs7w/edit?usp=sharing)
For more details, I have some older slides on the technology for RNA-seq from [PDCB-HTS 2010](http://lcolladotor.github.io/courses/PDCB-HTS.html).
## SPEAQeasy overview
Let's go to the [SPEAQeasy](http://research.libd.org/SPEAQeasy/) documentation website.
We'll mostly be using the [SPEAQeasy main output files](http://research.libd.org/SPEAQeasy/outputs.html).
### SPEAQeasy-example
Example data using [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/) with 40 samples from the Bipolar project with Peter Zandi et al.
It shows:
1. How to download the example data
1. How to run SPEAQeasy with this data
1. How to identify sample swaps
1. How to run differential expression analyses with this data
We'll view these links, but we'll start with:
* [Home](http://research.libd.org/SPEAQeasy-example/index.html)
* [Download data](http://research.libd.org/SPEAQeasy-example/prepare_data.html)
## SummarizedExperiment overview
### General
In general:
* `r BiocStyle::Biocpkg("SummarizedExperiment")`
- [Introduction vignette](http://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html)
* Bioconductor 2015 paper
- DOI: [10.1038/nmeth.3252](https://doi.org/10.1038/nmeth.3252)
### Example data
Let's look at the example data
```{r "load_data"}
## For loading the data
library("SummarizedExperiment")
library("here")
## Load the example data
load(here("rse_speaqeasy.RData"), verbose = TRUE)
```
Let's look at the sample metadata we have in this example data. Some of these variables are defined in the [SPEAQeasy documentation](http://research.libd.org/SPEAQeasy/outputs.html#quality-metrics) that we'll be using shortly.
```{r "metadata"}
colData(rse_gene)
```
SPEAQeasy can align data to the human or mouse genomes using specific Gencode annotations. The example data we are working with is from human using Gencode v32.
```{r "gene data"}
rowData(rse_gene)
```
The bulk of the data is stored in the gene count matrix generated using `featureCounts` by `SPEAQeasy`. This is a large matrix with one row per gene and one column per sample. As such, we'll just explore a tiny piece of it.
```{r "count data"}
## The names of the count matrices we have
assayNames(rse_gene)
## The full "counts" matrix
dim(assay(rse_gene, "counts"))
## The top left corner of the matrix counts
jaffelab::corner(assay(rse_gene, "counts"))
```
## Exploring quality control metrics
The [SPEAQeasy documentation](http://research.libd.org/SPEAQeasy/outputs.html#quality-metrics) explains the meaning of some of the quality metrics that are included in the RSE files generated by this processing pipeline. Let's explore these metrics and their relationship to some of our sample metadata we manually included, such as the age at time of death and primary diagnosis.
We can start with some boxplots comparing some metrics against primary diagnosis and brain region, as shown in the [SPEAQeasy-example differential expression script](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#14_Stats_vs_diagnosis_and_brain_region).
We can also use tools like `r BiocStyle::CRANpkg("ggplot2")` and `r BiocStyle::CRANpkg("plotly")` to interactively explore some metrics. For doing so, it'll be useful to check some of our LIBD rstats club sessions on:
* [High quality graphics in R](https://twitter.com/LIBDrstats/status/1308934265267195905?s=20)
* [Interactive graphics with plotly](https://twitter.com/LIBDrstats/status/1304799631386308608?s=20)
### Exercise
Adapt the interactive graphics code from the LIBD rstats club session available [here](https://gist.github.com/lcolladotor/350ac8843e153d135f64880a1d029b14) for our data.
### Exercise hints
```{r "interactive_exploration"}
## First we load the required packages
library("ggplot2")
library("plotly")
## Next, we convert the sample metadata to a data.frame
## which will make it easier to work with
pd_df <- as.data.frame(colData(rse_gene))
## We need a "key" variable that is unique
## Let's make the key
## Let's make a "highlighted" table
## Make a plot using the highlight table
## Make a second plot
## Convert them to interactive plots
## Now group them together
## Fancy one
```
### Exercise solution
#### Step 1
As a first step, let's add the code from the gist that is listed below each of the comments from the previous [code](https://gist.github.com/lcolladotor/350ac8843e153d135f64880a1d029b14).
```{r "interactive_exploration_step1", eval = FALSE}
## First we load the required packages
library("ggplot2")
library("plotly")
## Next, we convert the sample metadata to a data.frame
## which will make it easier to work with
pd_df <- as.data.frame(colData(rse_gene))
## We need a "key" variable that is unique
length(unique(pd_df$RNum))
nrow(pd_df)
## Let's make the key
pd_df$key <- pd_df$RNum
## Let's make a "highlighted" table
pd_key <- highlight_key(pd_df, ~key)
## Make a plot using the highlight table
gg_mean_mito_vs_mean_gene <-
ggplot(
pd_key,
aes(x = mean_mitoRate, y = mean_totalAssignedGene, color = Region)
) +
geom_point()
gg_mean_mito_vs_mean_gene
## Make a second plot
gg_mean_mito_vs_mean_RIN <-
ggplot(pd_key, aes(x = mean_mitoRate, y = mean_RIN, color = Region)) +
geom_point()
gg_mean_mito_vs_mean_RIN
## Convert them to interactive plots
p_mean_mito_vs_mean_gene <- ggplotly(gg_mean_mito_vs_mean_gene)
p_mean_mito_vs_mean_RIN <- ggplotly(gg_mean_mito_vs_mean_RIN)
## Now group them together
p_merged <- subplot(
p_mean_mito_vs_mean_gene,
p_mean_mito_vs_mean_RIN,
nrows = 1,
shareX = TRUE,
shareY = FALSE,
which_layout = 2
)
## Fancy one
p_fancy <- highlight(
p_merged,
on = "plotly_click",
selectize = TRUE,
dynamic = TRUE,
persistent = TRUE
)
```
#### Step 2
Now that we have the code, we need to adapt it. Some variable names are slightly different, some don't exist. This process will typically involve several error messages while we figure out what's there and what's different. The end result will look like the following code.
```{r "interactive_exploration_answer"}
## First we load the required packages
library("ggplot2")
library("plotly")
## Next, we convert the sample metadata to a data.frame
## which will make it easier to work with
pd_df <- as.data.frame(colData(rse_gene))
## We need a "key" variable that is unique
## We replaced RNum by SAMPLE_ID
length(unique(pd_df$SAMPLE_ID))
nrow(pd_df)
## Let's make the key
pd_df$key <- pd_df$SAMPLE_ID
## Let's make a "highlighted" table
pd_key <- highlight_key(pd_df, ~key)
## Make a plot using the highlight table
## Here we remoted "mean_" from the variable and object names
## and we changed Region to BrainRegion
gg_mito_vs_gene <-
ggplot(
pd_key,
aes(x = mitoRate, y = totalAssignedGene, color = BrainRegion)
) +
geom_point()
## Make a second plot
## We don't have RIN, so we'll use rRNA_rate here
gg_mito_vs_rRNA_rate <-
ggplot(pd_key, aes(x = mitoRate, y = rRNA_rate, color = BrainRegion)) +
geom_point()
## Convert them to interactive plots
p_mito_vs_gene <- ggplotly(gg_mito_vs_gene)
p_mito_vs_rRNA_rate <- ggplotly(gg_mito_vs_rRNA_rate)
## Now group them together
p_merged <- subplot(
p_mito_vs_gene,
p_mito_vs_rRNA_rate,
nrows = 1,
shareX = TRUE,
shareY = FALSE,
which_layout = 2
)
## Fancy one
p_fancy <- highlight(
p_merged,
on = "plotly_click",
selectize = TRUE,
dynamic = TRUE,
persistent = TRUE
)
## End result
p_fancy
```
# Day 2
You can download the [**latest version of this Rmd document here**](https://github.com/LieberInstitute/SPEAQeasy-example/blob/master/bootcamp_intro.Rmd).
## Exploring gene expression
One way to explore gene expression data as a whole is to use a dimension reduction technique such as principal component analysis (PCA) or multidimensional scaling (MDS) ^[For details about them, check [this thread on CrossValidated](https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-component-analysis-and-multidimensional).]. This is typically done with a subset of highly variable genes for computational speed, though with smaller datasets, you can do it with all the data.
In the [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#15_Explore_and_visualize_gene_expression), PCA was computed on the `log2(RPKM + 1)` data.
Once we have the top principal components (PCs), we can then make scatterplots or boxplots comparing them against some known covariates from our study as well as quality metrics generated by SPEAQeasy. It is also useful to compare the top PCs (the ones that explain the most variance) and color the samples (dots) by some grouping variables we are interested in; check [this example](http://bioconductor.org/packages/release/workflows/vignettes/rnaseqGene/inst/doc/rnaseqGene.html#pca-plot) where the main covariate is associated with PC1 that explains ~50% of the variance. This process can reveal potential batch effects that we need to consider in our statistical models.
### Exercise
* Compute the PCs just like in the [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#15_Explore_and_visualize_gene_expression).
* Adapt the PC1 vs PC2 code to make three new scatterplots:
- PC1 vs `mitoRate`,
- PC1 vs `totalAssignedGene`,
- PC1 vs `rRNA_rate`
* Optional: make interactive versions of these graphs.
Here's some code to get you started.
```{r "pcs"}
## Load some required packages
library("recount")
library("jaffelab")
# Filter for expressed
rse_gene <- rse_gene[rowMeans(getRPKM(rse_gene, "Length")) > 0.2, ]
# Explore gene expression
geneExprs <- log2(getRPKM(rse_gene, "Length") + 1)
set.seed(20201006)
pca <- prcomp(t(geneExprs))
pca_vars <- getPcaVars(pca)
pca_vars_lab <- paste0(
"PC", seq(along = pca_vars), ": ",
pca_vars, "% Var Expl"
)
```
## Identifying sample swaps
One of the main outputs from SPEAQeasy is a VCF file with ~740 common coding variants that you can compare across RNA-seq samples to identify issues with sample identity, or even better, against DNA genotyping data. [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/swap_speaqeasy.html) includes an example where we do the comparison and identify a few samples that are problematic.
Resolving sample identities is a headache-inducing process, since some issues are more complex than others to resolve and the decisions you make in how to prioritize the evidence will lead to different conclusions.
## Statistical modeling
There are many Bioconductor packages that enable us to perform the statistical tests for differential expression analysis. Some of the main ones are `r BiocStyle::Biocpkg("limma")`, `r BiocStyle::Biocpkg("edgeR")`, and `r BiocStyle::Biocpkg("DESeq2")`. In our recent projects we have mostly used `r BiocStyle::Biocpkg("limma")` due to its performance and speed, which matters most when analyzing exon or exon-exon junction expression data (count matrices with thousands of rows more than gene expression data).
### Normalization
Prior to running a differential expression analysis, we first have to make the data comparable. This is done through a process called _normalization_. Many methods exist across Bioconductor packages for doing this. Given that we like to use `limma-voom` for the modeling, we'll use the companion normalization methods from `r BiocStyle::Biocpkg("edgeR")`. The `RNAseq123` Bioconductor workflow explains why normalization is important, so let's [take a look](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#normalising-gene-expression-distributions).
Let's use `edgeR::calcNormFactors()` with the SPEAQeasy-example data.
```{r "calc_norm_factors"}
## For the normalization functions
library("edgeR")
## From
## http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#16_Modeling
dge <- DGEList(
counts = assays(rse_gene)$counts,
genes = rowData(rse_gene)
)
dge <- calcNormFactors(dge)
```
### Design matrix
Differential expression analysis methods rely on the user specifying a _design matrix_. The simplest way to think about it is in terms of a linear regression, although some models use other distributions. R provides a useful function for making design matrices called `model.matrix()` that can take a _formula_ input such as `~ group + lane` or `~ 0 + group + lane` as shown [in RNAseq123](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#differential-expression-analysis).
We can spend more time learning about regression starting with [linear regression](https://lcolladotor.github.io/bioc_team_ds/helping-others.html#linear-regression-example). However a user-friendly way that is quite powerful is implemented in the `r BiocStyle::Biocpkg("ExploreModelMatrix")` package. Let's look at the [first example](http://bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html) from their vignette.
#### Exercise
1. Create the following model matrix.
```{r "model"}
## Clean up the PrimaryDx variable
rse_gene$PrimaryDx <- factor(rse_gene$PrimaryDx)
## Modified from
## http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#16_Modeling
mod <- model.matrix(~ PrimaryDx + BrainRegion,
data = colData(rse_gene)
)
```
2. Use `r BiocStyle::Biocpkg("ExploreModelMatrix")` to visualize the design matrix.
You might need to install `r BiocStyle::Biocpkg("ExploreModelMatrix")` first as shown below.
```{r "install_explore_model_matrix", eval = FALSE}
## Install a new package
BiocManager::install("ExploreModelMatrix")
```
Then adapt the code from the [first example](http://bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html) in their vignette documentation.
```{r "use_explore_model_matrix"}
library("ExploreModelMatrix")
## Adapt code from
## http://bioconductor.org/packages/release/bioc/vignettes/ExploreModelMatrix/inst/doc/ExploreModelMatrix.html
```
Note that `sampleData` cannot take `NA` values, so you'll have to choose some columns to work with. For example:
* `PrimaryDx`
* `BrainRegion`
* `Sex`
* `AgeDeath`
* `mitoRate`
* `totalAssignedGene`
* `rRNA_rate`
#### Solution
Here's one solution to the exercise.
```{r "explore_model_solution"}
## Here we select manually a few columns to work with
## and we use the same formula we used for making our model matrix
app <-
ExploreModelMatrix(
sampleData = colData(rse_gene)[, c(
"PrimaryDx",
"BrainRegion",
"Sex",
"AgeDeath",
"mitoRate",
"totalAssignedGene",
"rRNA_rate"
)],
designFormula = ~ PrimaryDx + BrainRegion
)
## We can the explore the result interactively through a nice
## shiny web application created by ExploreModelMatrix
if (interactive()) {
shiny::runApp(app)
}
```
# Day 3
You can download the [**latest version of this Rmd document here**](https://github.com/LieberInstitute/SPEAQeasy-example/blob/master/bootcamp_intro.Rmd).
## Identifying DEGs
Let's take a look at the `RNAseq123` workflow to learn about how to [remove heteroscedascity from count data](http://bioconductor.org/packages/release/workflows/vignettes/RNAseq123/inst/doc/limmaWorkflow.html#removing-heteroscedascity-from-count-data).
```{r "run_voom"}
## Load limma
library("limma")
## Let's run limma-voom
vGene <- invisible(voom(dge, mod, plot = TRUE))
```
Once we have removed the heteroscedascity from the data, we can proceed with the statistical tests. However, note that we have some repeated brains which means that we have non-independent data and need to take this into account.
```{r "dup_brains"}
table(table(rse_gene$BrNum))
```
We can do so using the function `limma::duplicateCorrelation()`.
```{r "dup_corr", warning=FALSE}
# Get duplicate correlation
gene_dupCorr <- duplicateCorrelation(vGene$E, mod,
block = colData(rse_gene)$BrNum
)
gene_dupCorr$consensus.correlation
```
In this particular case, `duplicateCorrelation()` had several warnings, but in larger studies where we have more samples across individuals it will be more precise.
In any case, we can next finally perform the statistical tests using a series of `limma` functions: `lmFit()`, `eBayes()` and `topTable()`. Some key arguments are:
* `lmFit(correlation, block)` which you won't need if you don't have a blocking covariate (duplicated brains or some other technical covariate like batch).
* `topTable(coef)` which specifies which coefficient we want to test. If we specify more than one, we will compute an F-statistic instead of a t-statistic. The other arguments we specify below allow us to get the full table of results instead of just the very first.
```{r "limma"}
# Fit linear model
fitGeneDupl <- lmFit(
vGene,
correlation = gene_dupCorr$consensus.correlation,
block = colData(rse_gene)$BrNum
)
# Here we perform an empirical Bayesian calculation to obtain
# our significant genes
ebGeneDupl <- eBayes(fitGeneDupl)
outGeneDupl <- topTable(
ebGeneDupl,
coef = 2,
p.value = 1,
number = nrow(rse_gene),
sort.by = "none"
)
dim(outGeneDupl)
## Compare against the default topTable() output
## (we'll skip a few columns of output)
topTable(ebGeneDupl, coef = 2)[, -c(1, 3, 6:10)]
```
## Visualizing DEGs
Once we have our list of differentially expressed genes (DEGs), we can make some boxplots as in [SPEAQeasy-example](http://research.libd.org/SPEAQeasy-example/de_analysis_speaqeasy.html#17_Check_plots) to check that the model results make sense. We frequently use [`jaffelab::cleaningY()`](http://research.libd.org/jaffelab/reference/cleaningY.html) to remove (regress out) the effect of covariates in our model.
Another way to visualize DEGs is to make a heatmap which is where `r BiocStyle::CRANpkg("pheatmap")` comes into play. It can be useful to see how the top genes are related.
```{r "pheatmap"}
## For making a heatmap
library("pheatmap")
## Find the genes with a p-value less than 0.005
## (normally we would look at the FDR < 0.05 genes or 0.1)
sigGene <- outGeneDupl[outGeneDupl$P.Value < 0.005, ]
## Extract the normalized expression
exprs_heatmap <- vGene$E[rownames(sigGene), ]
## Build an annotation data.frame for our heatmap
df <- as.data.frame(colData(rse_gene)[, c("PrimaryDx", "Sex", "BrainRegion")])
rownames(df) <- colnames(exprs_heatmap) <- gsub("_.*", "", colnames(exprs_heatmap))
colnames(df) <- c("Diagnosis", "Sex", "Region")
# Manually determine coloring for plot annotation
palette_names = c('Dark2', 'Paired', 'YlOrRd')
ann_colors = list()
for (i in 1:ncol(df)) {
col_name = colnames(df)[i]
n_uniq_colors = length(unique(df[,col_name]))
# Use a unique palette with the correct number of levels, named with
# those levels
ann_colors[[col_name]] = RColorBrewer::brewer.pal(n_uniq_colors, palette_names[i])[1:n_uniq_colors]
names(ann_colors[[col_name]]) = unique(df[,col_name])
}
# Display heatmap
pheatmap(
exprs_heatmap,
cluster_rows = TRUE,
show_rownames = FALSE,
cluster_cols = TRUE,
annotation_col = df,
annotation_colors = ann_colors
)
```
```{r pdf_heatmap, eval=FALSE, echo=FALSE}
## For some reason, I need to run this manually to save the PDF
# Write heatmap to PDF
pdf(file = here("intro_de_heatmap.pdf"))
pheatmap(
exprs_heatmap,
cluster_rows = TRUE,
show_rownames = FALSE,
cluster_cols = TRUE,
annotation_col = df,
annotation_colors = ann_colors
)
dev.off()
```
### Exercise
* Make a plot with 4 boxplots (brain region by primary diagnosis) for the top differentially expressed gene.
### Solution
```{r "top_gene_solution"}
## Let's find that top gene
i <- which.min(outGeneDupl$P.Value)
## We can use boxplot() from base R plots
boxplot(
vGene$E[i,] ~ rse_gene$PrimaryDx + rse_gene$BrainRegion,
las = 2,
ylab = "Expression",
xlab = ""
)
## Or we can use ggplot2
## First we need to build a temporary data frame with
## the data for ggplot2
df_temp <- data.frame(
Expression = vGene$E[i,],
Region = rse_gene$BrainRegion,
Diagnosis = rse_gene$PrimaryDx
)
## Next we can make the boxplot, we'll use "fill" to color
## the boxes by the primary diagnosis variable
ggplot(df_temp, aes(y = Expression, x = Region, fill = Diagnosis)) +
geom_boxplot() +
theme_dark(base_size = 20)
```
## GO enrichment analyses
Josh Stolz will show us some [slides](https://docs.google.com/presentation/d/1sQdOThDTq1iu8SV8H-HOJ8nGwgTxd5NSXnyfR1w6c28/edit?usp=sharing).
### Demo
You might need to install `r BiocStyle::Biocpkg("enrichplot")` first as shown below.
```{r "install_explore_enrichplot", eval = FALSE}
## Install a new package
BiocManager::install("enrichplot")
```
Here's code for a GO demonstration by Josh.
```{r "go_demo"}
## Load required R packages
library("enrichplot")
library("clusterProfiler")
library("org.Hs.eg.db")
## Sample Data
data(geneList, package = "DOSE")
## make cutoff log2 fold change
gene <- names(geneList)[abs(geneList) > 2]
## We need get different kinds of gene labels
gene.df <- bitr(
gene,
fromType = "ENTREZID",
toType = c("ENSEMBL", "SYMBOL"),
OrgDb = org.Hs.eg.db
)
## do gene ontology enrichment for the CC ontology
ego <- enrichGO(
gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE
)
## Visualize the GO enrichment results in different ways
barplot(ego, showCategory = 20)
dotplot(ego, showCategory = 30)
edox <- setReadable(ego, "org.Hs.eg.db", "ENTREZID")
## cnetplot() and heatplot() show the actual genes
## that are present in our data (geneList) for each
## of the GO terms that were significantly enriched
cnetplot(edox,
foldChange = geneList,
circular = TRUE,
colorEdge = TRUE)
heatplot(edox, foldChange = geneList)
```
### Exercise
* Which genes from our dataset are involved in the _chromosome, centromeric region_ ontology?
* Make a `heatplot()` using the results of enrichment analysis from the biological process ontology.
### Solution
```{r "go_solution"}
## Finding the genes for _chromosome, centromeric region_
ego_df <- as.data.frame(ego)
strsplit(
ego_df$geneID[
ego_df$Description == "chromosome, centromeric region"
],
"/")[[1]]
## Run the enrichment with the BP ontology
ego_bp <- enrichGO(
gene = gene,
universe = names(geneList),
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE
)
## Then make the heatplot
ego_bp_x <- setReadable(ego_bp, "org.Hs.eg.db", "ENTREZID")
heatplot(ego_bp_x, foldChange = geneList)
```
## Beyond SPEAQeasy
* [recount3](http://research.libd.org/recount3/): over 750,000 human and mouse public RNA-seq processed with another aligner from the one used in SPEAQeasy: STAR instead of HISAT2.
* Other tools like Salmon and kallisto.
# Reproducibility
```{r reproducibility}
# Date this report was generated
message(Sys.time())
# Reproducibility info
options(width = 120)
sessioninfo::session_info()
```