Skip to content
This repository has been archived by the owner on Apr 6, 2021. It is now read-only.

Dr. Gabor Marth Lab Rotation - Proposed RNA Sequencing Analysis Pipeline

Notifications You must be signed in to change notification settings

jkunisak/rna_sequencing_workflow_marthlab_rotation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

RNA Sequencing Pipeline for Bulk and Single Cell Data (Marth Lab Rotation)

This is a pipeline that can use raw fastq files form either cohort-level bulk or single cell rna sequencing. The outputs will include a gene-level differential expression matrix (DESeq2) as well as a list of differentially activated pathways identified through a Gene Set Enrichment Analysis (GSEA). The pipeline also includes steps for removal of adapter features (trimmomatic), demultiplexing in the case of single cell sequencing, genomic/transcriptomic alignment (Hisat2/Salmon respectively), filtering of mapped reads (samtools), and normalization (Salmon/NOISeq/edgeR). Visualizations will be produced to illustrate data quality, verification of proper normalization, differential expression, and GSEA.

The basic workflow of the pipeline is depicted below: alt text

Getting Started

Installing Necessary Software

To get started (assuming the user works in the Marth lab), a few tools need to be installed. The pipeline requires that the tools be installed in a software folder that exists in the home directory (~). The fastqc, hisat2, and samtoolsmodules will be used and loaded automatically when running the pipeline.

The first tool that needs to be installed is trimmomatic to remove adapter sequences from single cell rna sequencing data. To install trimmomatic, please use the following command:

First make the software directory if it doesn't exist:

mkdir ~/software

Next, install trimmomatic (v0.38) in the software directory. Use the following commands to download the program from source and unzip the file:

cd ~/software
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.38.zip
unzip Trimmomatic-0.38.zip

The second tool that needs to be installed is the most up-to-date version of salmon. Please refer to https://salmon.readthedocs.io/en/latest/building.html#installation for more specific installation instructions.

cd ~/software
wget https://github.com/COMBINE-lab/salmon/releases/download/v0.11.2/salmon-0.11.2-linux_x86_64.tar.gz
tar -xvf salmon-0.11.2-linux_x86_64.tar.gz
cd salmon_reference
mkdir build
cd build
make
make install

Defining the R library

Follow the instructions provided here to create a R library folder. Once you create the Rlibs folder, copy the contents of the Rlibs directory provided on this GitHub page to that newly created directory.

cp $github_Rlibs_directory $new_Rlibs_directory

A list of the necessary packages include:

  1. data.table
  2. DESeq2 - differential expression
  3. stringr
  4. Rsubread - GSEA
  5. edgeR - normalization, differential expression
  6. tximport
  7. GSEABase - GSEA
  8. GSVA - GSEA, differential expression
  9. genefilter - normalization
  10. limma - GSEA, differential expression
  11. MSigDB ==> requires dplyr to install - GSEA
  12. biomaRt
  13. org.Hs.eg.db
  14. NOISeq - normalization

Establish Directory Structure

The next step is to orgnaize the directories. Within the home directory ($homeDIR), there must exist a parent directory that contains a directory of fastq files, an output directory, a directory for the reference files, and a scripts directory that houses the scripts used in the pipeline.

  1. Copy the raw fastq files into the fastq_files directory (see below)
  2. Subdirectories in the output folder will be made automatically by the pipeline
  3. Files in the reference directory include the reference genome, transcriptome, and index files for alignment. Please see commands below to obtain and index these files.
  4. scripts directory contains all of the R scripts that the pipeline uses. Get these files from the GitHub repo.

See the commands below to make and supply the necessary files to the directory:

mkdir ~/$homeDIR/fastq_files
mkdir ~/$homeDIR/output
mkdir ~/$homeDIR/reference
mkdir ~/$homeDIR/scripts

## Get the reference genome file (build 37)
mkdir ~/$homeDIR/reference/hg19_fasta
cd ~/$homeDIR/reference/hg19_fasta
echo {1..22} | parallel --eta -j+0 'rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr{}.fa.gz ~/$homeDIR/reference/hg19_fasta/'
cat
 ~/$homeDIR/reference/hg19_fasta/*.fa.gz > ~/$homeDIR/reference/hg19_fasta/hg19.fa
rm -rf ~/$homeDIR/reference/hg19_fasta/*.fa.gz

## Index the reference genome file
samtools faidx ~/$homeDIR/reference/hg19_fasta/hg19.fa

## Get the reference transcriptome file (build 37: gtf and cdna fasta)
mkdir ~/$homeDIR/reference/cdna_fasta
cd ~/$homeDIR/reference/cdna_fasta
rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-75/fasta/homo_sapiens/cdna/*
gunzip *.gz

cd ~/$homeDIR/reference/GRCH37_gtf
rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

## Build the hisat2 index file
module load hisat2
mkdir $homeDIR/reference/hisat2
hisat2_extract_splice_sites.py $homeDIR/reference/GRCh37_gtf/Homo_sapiens.GRCh37.75.gtf > $homeDIR/reference/hisat2/GRCh37.splicesite
hisat2_extract_exons.py $homeDIR/reference/GRCh37_gtf/Homo_sapiens.GRCh37.75.gtf > $homeDIR/reference/hisat2/GRCh37.exon
hisat2-build --ss $homeDIR/reference/hisat2/GRCh37.splicesite --exon $homeDIR/reference/hisat2/GRCh37.exon ~/scRNA-seq_Expression_Analysis/reference/hg19_fasta/hg19.fa $homeDIR/reference/hisat2/test

## Build the salmon index file
mkdir $homeDIR/reference/salmon_reference
~/software/salmon-0.11.1-linux_x86_64/bin/salmon index -t $homeDIR/reference/cdna_fasta/Homo_sapiens.GRCh37.75.cdna.all.fa -i $homeDIR/reference/salmon_reference/salmon_transcript_index --type quasi -k 31

Directory Structure:
~/
|-- software (extracted files)
|-------|-- Trimmomatic-0.38
|-------|-- salmon-0.11.1-linux_x86_64
|-- scRNA-seq_Expression_Analysis (Set $homeDIR to this directory)
|-------|-- fastq_files
|-------|-------|-- This directory contains all of the raw fastq files with ${sampleName}_1.fq.gz format
|-------|-- output
|-------|-------|-- fastqc_output
|-------|-------|-- trimmed_fastq_files
|-------|-------|-- trimmed_fastqc
|-------|-------|-- hisat2_alignment
|-------|-------|-- hisat2_alignment_coverage_metrics
|-------|-------|-- samtools_flagstat
|-------|-------|-- featureCounts
|-------|-------|-- salmon
|-------|-------|-- DESeq2
|-------|-- reference
|-------|-------|-- cdna_fasta
|-------|-------|-- GRCH37_gtf
|-------|-------|-- hg19_fasta
|-------|-------|-- hisat2
|-------|-------|-- salmon_reference
|-------|-- scripts
|-------|-------|-- adapter_detection.R
|-------|-------|-- create_batch_file.R
|-------|-------|-- Rsubread_featureCounts.R
|-------|-------|-- NOISeq.R
|-------|-------|-- Rlibs (R library directory with all of the installed packages)

Required Input Parameters

  1. homeDIR: defines the main directory that contains the fastq_files, output, reference, and scripts folders. The parent directory to the homeDIR should contain the software folder.
  2. trimmomaticFastaPath: defines the fasta file that trimmomatic uses to remove adapter sequences (if necessary). Defaults to NULL. Options for this parameter include the fasta files provided in the trimmomatic/adapters/ directory within the software folder. The available files (n=6) are listed below:
ls ~/software/Trimmomatic-0.38/adapters
NexteraPE-PE.fa  TruSeq2-PE.fa  TruSeq2-SE.fa  TruSeq3-PE-2.fa  TruSeq3-PE.fa  TruSeq3-SE.fa

Running the Tests

Example command:

bash ~/$homeDIR/scripts/draft1.sh $homeDIR NULL

Useful Papers and Resources to Better Understand RNA Sequencing

https://hemberg-lab.github.io/scRNA.seq.course/tabula-muris.html
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0190152
https://github.com/griffithlab/rnaseq_tutorial
https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

About

Dr. Gabor Marth Lab Rotation - Proposed RNA Sequencing Analysis Pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published