forked from bioinformatics-core-shared-training/RNAseq-R
-
Notifications
You must be signed in to change notification settings - Fork 2
/
getting-raw-reads.Rmd
222 lines (163 loc) · 7.99 KB
/
getting-raw-reads.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
---
title: "RNA-seq analysis in R"
author: "Stephane Ballereau, Mark Dunning, Oscar Rueda, Ashley Sawle"
date: "February 2020"
output:
html_notebook:
toc: yes
toc_float: yes
html_document:
toc: yes
toc_float: yes
layout: page
subtitle: Obtaining and aligning RNA-seq reads from public repositories
bibliography: ref.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,eval=FALSE)
```
## Set up a database of SRA runs
Raw reads from NGS experiments tend to be distributed through the Short Read Archive (SRA). The `SRAdb` Bioconductor package can be used to query and download files that are hosted in SRA. More information can be found in the [package vignette](http://bioconductor.org/packages/release/bioc/vignettes/SRAdb/inst/doc/SRAdb.pdf).
```{r eval=FALSE}
vignette("SRAdb")
```
Firstly, we need to download a database file. This is a large file and could take some time (`r round(file.size("SRAmetadb.sqlite")/1000000000,2)` )Gb.
```{r}
library(SRAdb)
sqlfile <-'SRAmetadb.sqlite'
if(!file.exists('SRAmetadb.sqlite')) sqlfile <<- getSRAdbFile()
sra_con <- dbConnect(SQLite(),sqlfile)
```
## Obtain information for a particular experiment
We can now query what information is available for a particular experiment; in this case `SRP045534`. This should list the samples that are available and their respective identifiers.
```{r}
sraInf <- getSRAinfo("SRP045534",sra_con, sraType="sra")
sraInf
```
## Download the set of sra files
We can now directly download each `sra` file. The `sra` file is SRA's own archive format, but we can extract the raw reads in the more common `.fastq` format in the next step.
Here, `sapply` is a convenient way of repeating the same operation for a vector of arguments. In this case we want to run the `getSRAfile` function with different file names.
```{r}
sapply(sraInf$run, function(x) try(getSRAfile(x,sra_con, fileType="sra"),silent=TRUE))
```
## Extracting fastq files
Using the [sra-toolkit](https://www.ncbi.nlm.nih.gov/sra) command-line utility from NCBI we can generate the `fastq` files from these archive files. We can do this within a Terminal (i.e. not within RStudio) with the following, making sure your working directory contains the `.sra` files.
```{bash}
for sra in *.sra
do
fastq-dump $sra
done
```
After each fastq file has been extracted, you should see a message to report have many reads (spots) are contained in the file
## Quality assessment of reads
The [fastqc](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) is recommened for a preliminary assessment of the read quality. However, caution should be exercised when interpreting the results as the reports are not specifically-tailored for RNA-seq. Some sections are know to flag-up warning or error messages for perfectly fine RNA-seq experiments.
We can run this at the command line:-
```{bash cache=TRUE,eval=FALSE}
for fq in *.fastq
do
fastqc $fq
done
```
## Downloading the reference genome
To align to the `mm10` genome, we will first download the reference genome from UCSC. We have to download each individual chromosome separately, and then join together into a single file.
```{bash eval=FALSE}
wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz' -O chromFa.tar.gz
gunzip chromFa.tar.gz
tar xvf chromFa.tar
cat *.fa > mm10.fa
rm chr*.fa
rm chromFa.tar.gz
```
## Alignment using bowtie
Firstly, we need to build an *index* file from the reference genome that we have downloaded:-
```{bash eval=FALSE}
bowtie2-build mm10.fa mm10
```
In practice, we would probably run the alignment of each sample in *parallel* using the high-performance cluster. However, for illustration purposes, we give the script that will align each sample individually.
```{bash eval=FALSE}
bowtie2 -x mm10 -U SRR1552444.fastq -S SRR1552444.sam
samtools view -bS SRR1552444.sam > SRR1552444.bam
samtools sort SRR1552444.bam -o SRR1552444.sorted.bam
samtools index SRR1552444.sorted.bam
bowtie2 -x mm10 -U SRR1552445.fastq -S SRR1552445.sam
samtools view -bS SRR1552445.sam > SRR1552445.bam
samtools sort SRR1552445.bam -o SRR1552445.sorted.bam
samtools index SRR1552445.sorted.bam
bowtie2 -x mm10 -U SRR1552446.fastq -S SRR1552446.sam
samtools view -bS SRR1552446.sam > SRR1552446.bam
samtools sort SRR1552446.bam -o SRR1552446.sorted.bam
samtools index SRR1552446.sorted.bam
bowtie2 -x mm10 -U SRR1552447.fastq -S SRR1552447.sam
samtools view -bS SRR1552447.sam > SRR1552447.bam
samtools sort SRR1552447.bam -o SRR1552447.sorted.bam
samtools index SRR1552447.sorted.bam
bowtie2 -x mm10 -U SRR1552448.fastq -S SRR1552448.sam
samtools view -bS SRR1552448.sam > SRR1552448.bam
samtools sort SRR1552448.bam -o SRR1552448.sorted.bam
samtools index SRR1552448.sorted.bam
bowtie2 -x mm10 -U SRR1552449.fastq -S SRR1552449.sam
samtools view -bS SRR1552449.sam > SRR1552449.bam
samtools sort SRR1552449.bam -o SRR1552449.sorted.bam
samtools index SRR1552449.sorted.bam
bowtie2 -x mm10 -U SRR1552450.fastq -S SRR1552450.sam
samtools view -bS SRR1552450.sam > SRR1552450.bam
samtools sort SRR1552450.bam -o SRR1552450.sorted.bam
samtools index SRR1552450.sorted.bam
bowtie2 -x mm10 -U SRR1552451.fastq -S SRR1552451.sam
samtools view -bS SRR1552451.sam > SRR1552451.bam
samtools sort SRR1552451.bam -o SRR1552451.sorted.bam
samtools index SRR1552451.sorted.bam
bowtie2 -x mm10 -U SRR1552452.fastq -S SRR1552452.sam
samtools view -bS SRR1552452.sam > SRR1552452.bam
samtools sort SRR1552452.bam -o SRR1552452.sorted.bam
samtools index SRR1552452.sorted.bam
bowtie2 -x mm10 -U SRR1552453.fastq -S SRR1552453.sam
samtools view -bS SRR1552453.sam > SRR1552453.bam
samtools sort SRR1552453.bam -o SRR1552453.sorted.bam
samtools index SRR1552453.sorted.bam
bowtie2 -x mm10 -U SRR1552454.fastq -S SRR1552454.sam
samtools view -bS SRR1552454.sam SRR1552454.bam
samtools sort SRR1552454.bam -o SRR1552454.sorted.bam
samtools index SRR1552454.sorted.bam
bowtie2 -x mm10 -U SRR1552455.fastq -S SRR1552455.sam
samtools view -bS SRR1552455.sam > SRR1552455.bam
samtools sort SRR1552455.bam -o SRR1552455.sorted.bam
samtools index SRR1552455.sorted.bam
```
## Renaming to be consistent with GEO
The files we have just created are named according to their SRA identifier. However, these names are not very useful for analysis. The Gene Expression Omnibus (GEO) entry for the dataset has the mapping information between SRA and sample identifers.
```{r message=FALSE}
library(GEOquery)
tmp <- getGEO("GSE60450")
gseInf <- pData(tmp[[1]])
gseInf
```
We obtain a new name for each bam file by joining the metadata from SRA and GEO.
```{r}
library(dplyr)
sraInf <- mutate(sraInf, bam=paste0(run, ".sorted.bam"))
gseInf <- mutate(gseInf, experiment = basename(as.character(supplementary_file_2)),
newbam = gsub("Sample name: ","", description),
newbam = gsub("-",".",newbam,fixed=TRUE),
newbam = paste0(newbam, ".bam"))
gseInf
combinedInf <- left_join(gseInf, sraInf, by="experiment")
combinedInf %>% select(description,description.1,experiment,bam,newbam)
combinedInf
```
The base R function `file.symblink` can be used to create *symbolic links* from one file to another; thus retaining the original file name and avoid creating a complete copy of each file. Such links are often used in NGS data when we don't want to create copies of files that are potentially rather large. With this approach, when we want to access `MCL1.LA.bam` (for example), the file system will know to actually access `SRR1552444.sorted.bam`.
```{r}
for(i in seq_along(combinedInf$bam)){
file.symlink(combinedInf$bam[i], combinedInf$newbam[i])
file.symlink(paste0(combinedInf$bam[i],".bai"), paste0(combinedInf$newbam[i],".bai"))
}
list.files()
```
## Alignment using Rsubread
Alignment could also be performed using `Rsubread` as we did in the first practical session. The only difference is to use the entire `mm10` genome and point to the newly-downloaded fastq files.
```{r eval=FALSE}
library(Rsubread)
buildindex("mm10",reference="mm10.fa")
fastqfiles <- list.files(pattern=".fastq")
align("mm10",readfile1=fastqfiles)
```