Aloft is a Bash script script for running the excellent Arriba RNA-Seq fusion detector in parallel, it takes charge of:
- Downloading and compiling an Arriba release (if required).
- Downloading references and annotation (via Arriba's
download_references.sh
) - Producing STAR indexes if need be.
- Creating the known fusion events file from the COSMIC [Complete Fusion Export] (https://cancer.sanger.ac.uk/cosmic/download), and fixing none HGNC compliant gene symbols.
- Running the STAR aligner in parallel on a set of samples via GNU parallel.
- Running Arriba in parallel - as above.
- Running SAMtools in parallel - enables viewing of STAR derived BAM with IGV.
- Running Arriba's outstanding plotting script
draw_fusions.R
in parallel over all samples as with previous stages.
This script was inspired by the demo script run_arriba.sh
supplied with Arriba. Aloft implements the recommended Arriba workflow. The only difference being STAR alignment is output to disk rather than piped into Arriba, so that it can be subsequently sorted indexed and saved for manual inspection of fusions in say IGV.
- Linux or *nix like OS, with working make, Wget, Bash, GNU sed, GNU gawk, GNU grep, gzip and Perl. - Tested on Ubuntu 18.04 LTS
- STAR aligner
- SAMtools
- GNU Parallel
- R and Bioconductor specifically packages: GenomicRanges, circlize and GenomicAlignments.
- The COSMIC Complete Fusion Export file.
- Some RNA-Seq FASTQ files to analyse.
aloftConfig.sh
contains various settings which will be used for execution along with comments, this Bash script is sourced by the main aloft.sh
so it will inherit variables defined here. Please review this before launching an analysis run.
Samples are defined in a tab delimited flat file taking the form of:
sample_1 sample_1_R1.fastq.gz sample_1_R2.fastq.gz
sample_2 sample_2_R1.fastq.gz sample_2_R2.fastq.gz
sample_3 sample_3_R1.fastq.gz sample_3_R2.fastq.gz
Here the first column defines the sample ID. If more than one pair of FASTQ files exists for each sample simply add these on as extra tab delimited columns. The path to these files should not be present in the sample sheet just their names. As the path can be given below.
Having made such a file you, and reviewed the settings in aloftConfig.sh
you can run the pipeline like so:
aloft.sh <tab delimited sample sheet> <input FASTQ path> <output dir for run>
Out of the box aloft is configured to use 16 cores and will consume about 64GB of RAM during execution, the core count of various stages and number of concurrent jobs RAM allocated etc can be adjusted in aloftConfig.sh
.