-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.md2
154 lines (82 loc) · 4.08 KB
/
README.md2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# UEA BCRE pipelines
![logo](misc/logo.png)
<br />
<!-- TABLE OF CONTENTS -->
## Table of Contents
<br />
* [Introduction](#Introduction)
- [Quality Control](#Quality-Control)
- [Pipelines](#Pipelines-available)
* [Usage](#Usage)
- [Software Requirements](#Software-Requirements)
- [Example](#Example)
<br />
## Introduction
<br />
At UEA bob champion genomics we have developed a number of pipelines specialising for processing sequencing data. We have developed a system that allows the easy use of these pipelines by downloading the github repository, moving fastq files into input directory and configuring which analyses to perform by configuring the "XXX.config" file. The pipeline has been written using the nextflow workflow management.
<br />
Irrespective of the pipelines chosen, all workflows perform similar analyses:
<br />
![figure-1](misc/figure1.png)
<br />
### Quality Control
<br />
The quality control pipeline contains a static backbone that is present with all data types (black). Dependent on the type of data used extra tools are added onto this backbone (shown in figure 2 below). Therefore, if you are planning to process exome dna-seq data for somatic mutations then you would select "exome-somatic_QC" in the config file so it know to include tools to measure hybridisation stats.
<br />
![figure-2](misc/figure2.png)
### Pipelines available
After alignment and QC, the following pipelines are available to process your samples. For more information about the steps involved in these - click on the links below.
DNA-seq:
- Exome-germline (Freebayes and GATK HaplotypeCaller)
- Exome-somatic (Sanger-cgpWXS and GATK Mutect2)
- Whole-genome germline (Freebayes and GATK HaplotypeCaller)
- Whole-genome somatic (Sanger-cgpWGS and GATK Mutect2)
- Whole-genome structural variants (Sanger-XXX)
RNA-seq:
- mRNA RNA-seq (Hisat2)
## Usage
### Software Requirements
As a mininum you will require the following dependencies:
- Singularity (v3+)
- Nextflow (v19+)
- Python 3+ (with pyYAML)
- git and git account
- An account on the HPC..!!! you do it this way...??
### Input names
Currently, the pipeline can only receive fastq.gz input files and these require their names to be formatted to allow the pipeline to run smoothly. Below shows the correct format with an example:
![figure-3](misc/figure3.png)
The entries can be named anything, it is the dashes that are important to seperate these fields in the pipeline. Also, if there is one sample from different lanes, ensure that the sample field are the same in both.
### Tutorial
#### 1. Download repo
In order to test the pipeline download clone this repository to your working directory using:
```
git clone https://github.com/R-Cardenas/pipelines_clean.git
```
This will download all the scripts in addition to some test data directly into the input folder so that the pipeline can be tested (The input folder is where your fastq files are to go also.)
#### 2. Modify the master config file file
In the home directory of the repo there is a file called master_user_config.yaml. This is the file you are to edit to configure the pipeline.
Below is a simplified version of the config file:
note: hpc 'no' has not yet been configured. Pipelines can only be run on the HPC for now.
```
samples: "dna-exome-germline"
genome_assemble: "hg38"
hpc: "yes"
merged_lanes: "no"
pipelines: "freebayes gatk_haplotypecaller"
```
# Technical Information
First it is reccommended to read how a basic nextflow pipeline is built (https://www.nextflow.io/docs/latest/getstarted.html)
The pipelines we have currently are shown below. Click on each individual link to access each README md. For the user README, which explains how to use the pipelines please click here (link).
DNA-seq:
Mapping:
- cgpMAP ([link](DNAseq/DNAseq_README.md))
Somatic variant discovery Exome:
- cgpWXS (link)
- GATK mutect2 (link)
Somatic variant discovery WGS:
- cgpWGS (link)
- GATK mutect2 (link)
Germline variant discovery:
- Nextflow Sarek (exome/WGS; link)
RNA-seq:
- Nextflow RNA-seq pipeline (link)