forked from bioinformatics-core-shared-training/RNAseq-R
-
Notifications
You must be signed in to change notification settings - Fork 2
/
rna-seq-de.Rmd
260 lines (172 loc) · 10.1 KB
/
rna-seq-de.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "RNA-seq analysis in R"
subtitle: "Differential Expression of RNA-seq data"
author: "Mark Dunning"
date: "February 2020"
output:
html_notebook:
toc: yes
toc_float: yes
html_document:
toc: yes
toc_float: yes
minutes: 300
layout: page
bibliography: ref.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
**Original Authors: Belinda Phipson, Anna Trigos, Matt Ritchie, Maria Doyle, Harriet Dashnow, Charity Law**, **Stephane Ballereau, Oscar Rueda, Ashley Sawle**
Based on the course [RNAseq analysis in R](http://combine-australia.github.io/2016-05-11-RNAseq/) delivered on May 11/12th 2016 and modified by Cancer Research Uk Cambridge Centre for the [Functional Genomics Autumn School 2017](https://bioinformatics-core-shared-training.github.io/cruk-autumn-school-2017/)
## Resources and data files
This material has been created using the following resources:
- http://www.statsci.org/smyth/pubs/QLedgeRPreprint.pdf [@Lun2016]
- http://monashbioinformaticsplatform.github.io/RNAseq-DE-analysis-with-R/99-RNAseq_DE_analysis_with_R.html
- http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
- https://bioconductor.github.io/BiocWorkshops/rna-seq-data-analysis-with-deseq2.html
## Differential expression with `DESeq2`
Now that we are happy that we have normalised the data and that the quality looks good, we can continue to testing for differentially expressed genes. There are a number of packages to analyse RNA-Seq data. Most people use `DESeq2` or `edgeR`. We will use `DESeq2` for the rest of this practical.
**First make sure we have all the objects and libraries loaded*
```{r}
library(DESeq2)
library(tximport)
```
### Recap of pre-processing
The previous section walked-through the pre-processing and transformation of the count data. Here, for completeness, we list the minimal steps required to process the data prior to differential expression analysis.
Note that although we spent some time looking at the quality of our data , these steps are not required prior to performing differential expression so are not shown here. Remember, `DESeq2` [requires raw counts](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-un-normalized-counts) so the `vst` transformation is not shown as part of this basic protocol.
```{r eval=FALSE}
dirs <- list.files(file.path("salmon_quant/"))
quant_files <- paste0("salmon_quant/",dirs,"/quant.sf.gz")
names(quant_files) <- dirs
tx2gene <- read.csv("tx2gene.csv")
txi <- tximport(quant_files,type="salmon",tx2gene = tx2gene,ignoreTxVersion = TRUE)
sampleinfo <- read.delim("meta_data/sampleInfo_corrected.txt")
rownames(sampleinfo) <- sampleinfo$run
dds <- DESeqDataSetFromTximport(txi,
colData = sampleinfo,
design <- ~CellType)
keep <- rowSums(assay(dds) >= 5) >= 2
dds <- dds[keep,]
```
We also have the output of the pre-processing section saved as an R object if you didn't manage to complete these steps.
```{r}
## Only run if you didn't complete the previous section on pre-processing
load("Robjects/preprocessing.Rdata")
```
## Differential Expression with DESeq2
We have previously defined the test condition using the `design` argument when we created the object. This can be checked using the `design` function.
Typically we decide the design for the analysis when we create the DESeq2 objects, but it can be modified prior to the differential expression analysis
```{r}
colData(dds)
design(dds) <- ~CellType
```
The function runs a couple of processing steps automatically to adjust for different library size and gene-wise variabiliy, which you can read about in the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#the-deseq2-model).
Firstly apply the median ratio normalisation method to compensate for differences in library sizes
```{r eval=FALSE}
dds <- estimateSizeFactors(dds)
```
estimate the dispersion for each gene
```{r eval=FALSE}
dds <- estimateDispersions(dds)
```
Apply statistical testing based on the negative binomial distribution.
```{r eval=FALSE}
dds <- nbinomWaldTest(dds)
```
Fortunately, there is one convenient function that will apply the three steps
```{r}
de.cellType <- DESeq(dds)
```
The results of the analysis can be obtained using the `results` function and displayed to the screen. Each row is a particular gene measured in the study (i.e. all genes in the organism being studied) and each column reports some aspect of the differential expression analysis for that gene. Note that all genes are reported. At this stage the gene identifiers are not very informative, something we will fix in the next section.
```{r}
results(de.cellType)
```
The output can be converted into a data frame and manipulated in the usual manner. It is recommended to use `dplyr` to manipulate the data frames with the standard set of operations detailed on the [dplyr cheatsheet](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
- `select` to pick which columns to display
- `filter` to restrict the rows
- `mutate` to add new variables to the data frame
- `arrange` to order the data frame according to values of a column
However, `dpylr` does not like data frame that have rownames. We can use the `rownames_to_column` function from the `tibble` package to add an extra column that contains the Ensembl gene IDs.
The `%>%` symbol refers to the piping operation in R, which is a way of chaining operations together.
```{r}
library(dplyr)
library(tibble)
results.cellType <- as.data.frame(results(de.cellType)) %>%
rownames_to_column("GeneID")
results.cellType
```
We can sort the rows by adjusted p-value and then print the first 10 rows.
```{r}
arrange(results.cellType, padj) %>%
head(n=10)
```
Or we can sort the rows and then write the resulting data frame to a file.
```{r}
arrange(results.cellType, padj) %>%
write.csv("basal_vs_luminal_DESeq_all.csv")
```
```{r}
arrange(results.cellType, padj) %>%
filter(padj < 0.05) %>%
write.csv("basal_vs_luminal_DESeq_DE.csv")
```
> ## Challenge 1 {.challenge}
>
> 1. Re-run the analysis to find differentially-expressed genes between the developmental stages *virgin* and *lactation*
> 2. Write a csv file that contains results for the genes that have a p-value less than 0.05 and a log2 fold change more than 1, or less than -1.
> HINT: So that we don't overwrite our results so far, it may be convenient to create a new `DESeqDataSet` object for the new differential expression analysis.
```{r eval=FALSE}
dds.status <- dds
design(dds.status) <- ......
```
### Changing the direction of the contrast
In this initial analyis `DESeq2` has automatically decided which member of our sample groups to use as our baseline (`basal` in this case) so that the log2 fold changes are reported with a positve value meaning higher expression in `luminal`. If we want to change this behaviour we can change the `contrast` argument in the `results` function
```{r eval=FALSE}
## This should give the same as the table above
results(de.cellType, contrast=c("CellType","luminal","basal"))
## Changing the direction of the contrast
results(de.cellType, contrast=c("CellType","basal","luminal"))
```
If we change to performing differential expression analysis on the `Status` variable then there are various contrasts that can be made; `pregnant` vs `lactation`, `lactation` vs `virgin` etc. When the `results` function is run the table that is displayed is for the contrast `virgin vs lactate`. The `resultsNames` function can tell us which other contrasts we can access.
```{r}
dds.status <- dds
design(dds.status) <- ~Status
de.status <- DESeq(dds.status)
resultsNames(de.status)
results.status <- data.frame(results(de.status))
```
### Intersecting gene lists
A venn diagram is a common way of visualising the overlap between two genelists. We need to create a data frame where each column indicates whether each gene is differentially expressed in a particular contrast or not. To create such columns we can do a logical test on the adjusted p-values from our results tables.
```{r message=FALSE}
venn_data <- data.frame(CellType = results.cellType$padj<0.05,
Status = results.status$padj < 0.05)
library(limma)
vennDiagram(venn_data)
```
> ## Challenge 2 {.challenge}
>
> 1. Use a venn diagram to visualise the overlap in the genes found to be differentially expressed in the `pregnant vs virgin` and ` lactation vs virgin` contrasts.
> 2. How many genes are in common?
### Fitting alternative models to the data
`DESEq2` allows for more complicated models to be fit to the data. For guidance on how to fit more complicated models you can consult the [DESeq2 vignette](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html), the [limma user guide](https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf) or the Bioconductor mailing list.
In particular, DESeq2 allows [multi-factor models](http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#multi-factor-designs) which can account for other sources of variation in the data such as batches or gender.
Lets suppose that we wanted the different between virgin and lactatin individuals, but controlling for `CellType`. The main assumption being that the effect of `Status` is the same regardless of `CellType` The design for such an analysis would be:-
```{r}
dds.mf <- dds
design(dds.mf) <- ~CellType+Status
de.mf <- DESeq(dds.mf)
results.mf <- results(de.mf,contrast=c("Status","lactation","virgin"))
results.mf
```
### Exporting normalized counts
The `DESeq` workflow applies *median of ratios normalization* that accounts for differences in sequencing depth between samples. The user does not usually need to run this step. However, if you want a matrix of counts for some application outside of Bioconductor the values can be extracted from the `dds` object.
```{r}
dds <- estimateSizeFactors(dds)
countMatrix <-counts(dds, normalized=TRUE)
head(countMatrix)
write.csv(countMatrix,file="normalized_counts.csv")
```
```{r eval=FALSE}
save(de.cellType,de.status,de.mf, file="Robjects/DE.Rdata")
```