forked from bioinformatics-core-shared-training/RNAseq-R
-
Notifications
You must be signed in to change notification settings - Fork 2
/
setup.Rmd
189 lines (117 loc) · 6.98 KB
/
setup.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "Course Setup"
author: "Mark Dunning"
date: "February 2020"
output:
html_notebook:
toc: yes
toc_float: yes
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Course Setup for Day 1
In this course we will demonstrate how to run some standard analysis tools for RNA-seq data. The majority of Bioinformatics tools are built with a *command-line* environment in mind, rather than Windows or Mac OSX. To simplify the installation of these tools, we are providing some resources on the *cloud* that you can log into for the duration of the course.
## 1. Create an account at InstanceHub
*InstanceHub* is a tool created at The University of Sheffield for creating cloud resources for computing practicals. You will need to go to [instancehub.com](instancehub.com) and create an account. **Make sure that you specify the same email address that you signed-up to the course**.
## 2. Launch the Lab
Choose the menu option *View my labs* on the left-hand menu. The lab **Introduction to the Command Line for Bioinformatics** should be visible. Click the *Participate* button.
![](images/instance-hub1.png)
## 3. Connect to the lab
Press the *Start Lab* (green) button and wait whilst the lab loads...
![](images/instance-hub2.png)
Once *Start Lab* has been replaced by *Disconnect*, the *Connection Information* tab will be updated with an IP address etc.
![](images/instance-hub3.png)
Enter the following address in your web browser
Replacing **IP_ADDRESS** with the numbers next to **Instance IP** in the *Connection Information* box.
```
http://IP_ADDRESS:6080
```
e.g.
```
http://3.8.149.23:6080
```
**Do not click Disconnect**
## 4. Open a termninal window
You should be presented with a unix desktop environment that we can use to learn about the command-line and tools for processing RNA-seq data.
To open a *Terminal* (in order to enter Unix commands), there is a "Start menu" in the bottom-left corner from which you can select *System Tools* and then *LXTerminal*
![](images/terminal-launch.png)
## 5. Change user
Finally, in order to follow the materials we need to change the user that we logged in as from `root` to `dcuser`.
Type the command into the terminal window **exactly** as it appears below
```
su - dcuser
```
![](images/change-user.png)
# Setup on your own machine
Both Mac OSX and Windows 10 have the ability to run some of the commands presented in this course to navigate around a file system, copy files and list directories. However, you may prefer to practice in a "safe" environment, such as that used during the workshop. Furthermore, the NGS tools presented may be difficult to install.
You can launch the same computing environment on your own machine using a tool called *Docker*.
Docker is an open platform for developers to build and ship applications, whether on laptops, servers in a data center, or the cloud.
- Or, it is a (relatively) painless way for you to install and try out Bioinformatics software.
- You can think of it as an isolated environment inside your exising operating system where you can install and run software without messing with the main OS
+ Really useful for testing software
+ Clear benefits for working reproducibly
- Instead of just distributing the code used for a paper, you can effectively share the computer you did the analysis on
- For those of you that have used Virtual Machines, it is a similar concept
## Installing Docker
### Mac
- [Mac OSX - 10.10.3 or newer](https://www.docker.com/docker-mac)
- [Older Macs](https://download.docker.com/mac/stable/DockerToolbox.pkg)
### Windows
- [Windows 10 Professional](https://www.docker.com/docker-windows)
- [Other Windows](https://download.docker.com/win/stable/DockerToolbox.exe)
Once you have installed Docker using the insructions above, you can open a terminal (Mac) or command prompt (Windows) and type the following to run the Day 1 environment
```
docker run --rm -p 6080:80 markdunning/rnaseq-toolbox
```
Entering the address in your web browser should display the environment
```
http://localhost:6080
```
### Using the environment to analyse your own data
With the default settings, the computing environment is isolated from your own laptop; we can neither bring files that we create back to our own OS, or analyse our own data.
However, adding an `-v` argument allows certain folders on your own OS to be visible within the enviroment.
Assuming the files I want to analyse are to be found in the folder `PATH_TO_FASTQ`, the following command would map that directory to the folder `/home/dcuser/rnaseq_data`
```
docker run --rm -p 6080:80 -v /PATH_TO_FASTQ/:/home/dcuser/rnaseq_data markdunning/rnaseq-toolbox
```
At the terminal, we should be able to see our files with the `ls` command
```
ls /home/dcuser/rnaseq_data
```
### Analysis on the UoS cluster
As described above, `docker` can be used to run the environment presented in this course on a personal laptop. However, there are some security implications of docker that prohibits it being installed on a high-performance computing system. An alternative is `singularity`, which allows computing environments to be distributed as a single image file. We have created such a singularity image and made it available on sharc (The University of Sheffield's HPC).
Firstly, login to sharc in the usual manner and then open an interactive shell.
```{bash eval=FALSE}
qrsh
```
The singularity image is available at the following location:-
```{bash eval=FALSE}
ls /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif
```
**The commands presented in the Introduction to Shell section (e.g. `ls`, `cd`, `pwd`...) are already available when you first login to `sharc`**. However, the tools specific to NGS analysis will require you to add the path to the singularity image before the command.
For example, to run `fastqc` you can navigate to your directory containing your fastq files with the usual `cd` command.
```{bash eval=FALSE}
cd /path/to/your/fastq/files
singularity exec /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif fastqc *.fastq.gz
```
Other commands that require you to list give the path to the singularity image are as follows:-
```{bash eval=FALSE}
singularity exec /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif multiqc
```
```{bash eval=FALSE}
singularity exec /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif hisat2-build
```
```{bash eval=FALSE}
singularity exec /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif hisat2
```
```{bash eval=FALSE}
singularity exec /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif samtools
```
```{bash eval=FALSE}
singularity exec /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif featureCounts
```
```{bash eval=FALSE}
singularity exec /shared/bioinformatics_core1/Shared/software/singularity/command_line_20200212.sif salmon
```