By the end of this class, you should be able to:
- identify appropriate methods and quality assessment for mapping RNAseq reads
- select appropriate assemblies and annotations for RNAseq experiments
- differentiate between gene and transcript level analyses
- identify approaches to quantifying gene expression from mapped reads
FIXME: how do some of the different read mapping tools differ? are there other tools we should mention?
-
TopHat
- TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
- Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
- EMBL-EBI bioinformatics training
- TopHat aligns in two steps: 1) unspliced reads are mapped to locate exons (with Bowtie) 2) unmapped reads are then split and aligned independently to identify exon junctions
-
BWA
- Fast and accurate short read alignment with Burrows–Wheeler transform
- BWA GitHub:BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads
-
STAR2
- STAR: ultrafast universal RNA-seq aligner
- Mapping RNA-seq Reads with STAR:
- One of the foundational steps in the RNA-seq data analysis is mapping (alignment) of the large sets of sequenced reads to a reference genome. This task presents more challenges than alignment of genomic DNA reads because RNA sequences are often spliced, i.e. derived from the non-contiguous regions of the genome.
- In addition to detecting of annotated and novel splice junctions, STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. STAR can align spliced sequences of any length with moderate error rates providing scalability for emerging sequencing technologies
-
Salmon (and other pseudoalignment methods)
- Combine lab - Salmon Overview
- Salmon uses new algorithms (specifically, coupling the concept of quasi-mapping with a two-phase inference procedure) to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory
- Combine lab Salmon GitHub Repo
- Salmon achieves its accuracy and speed via a number of different innovations, including the use of selective-alignment (accurate but fast-to-compute proxies for traditional read alignments), and massively-parallel stochastic collapsed variational inference
- Combine lab - Salmon Overview
-
Other pseudoaligners
- Kallisto
- program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads
- Kallisto
-
Comparing read mapping tools
- Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data - NYU Genomics core
- Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data
- In contrast to the microarray field, where data processing converged over the years into a well-defined set of broadly accepted workflows, the number of RNA-seq data processing workflows is still increasing, with none accepted as the standard so far. RNA-seq data processing workflows typically come in two different flavours.
- there are methods that align reads directly to a reference genome, followed by quantification of mapped reads (e.g. Tophat-Cufflinks5, Tophat-HTSeq6, 7 and STAR-HTSeq7, 8)
- there are the so-called pseudoalignment methods (e.g. Salmon9 and Kallisto10) that break up reads into k-mers before assigning them to transcripts. This results in a substantial gain in speed compared to the alignment based workflows.
- The workflows also differ in how they estimate expression abundance, with some enabling quantification on transcript level (i.e. Cufflinks, Salmon and Kallisto) while others are restricted to gene level quantification
- All workflows show a good concordance with RT-qPCR expression measurements and no workflow outperforms the others. Of note, each workflow revealed a small but specific set of genes with inconsistent expression measurements, reproducibly identified in independent datasets. These genes were typically smaller, had fewer exons and were lower expressed compared to genes with consistent expression measurements. Careful validation is warranted when evaluating RNA-seq based expression profiles for this specific set of genes. -Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis
RNAseQC https://software.broadinstitute.org/cancer/cga/rna-seqc
FIXME: what do each of these quality assessments tell us?
- percent alignment
- From EMBL-EBI online training: The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA.
- paired alignment (vs singletons): For paired alignment we expect both strands to align to the genome. R1 is the forward direction and R2 is the reverse direction. Sometimes only one of the pairs will align. We call the result a singleton. (sourced from various biostar threads)
- strandedness: to assess the performance of strand-specific library construction methods, the percentage of sense-derived reads is given for each end of the read pair. Whereas a non-strand-specific protocol would give values of 50%/50%, strand-specific protocols typically yield 99%/1% or 1%/99% for this metric.
- gene body coverage: Calculate the RNA-seq reads coverage over gene body
- marks PCR duplicates (but this can be misleading because this varies for RNAseq experiments)
FIXME: do we count genes or transcripts? What does this tell us biologically?
- A gene is a fuzzy, partially defined concept of a sequence of nucleotides that encode RNA or protein
- Transcripts are specific nucelotide sequences that encode RNA/protein
- Each transcript maps to a gene, genes often have multiple transcipts due to alternative splicing
- Usually genes are counted (does not take multiple isoforms into account) because this is easier and more straight-forward
are read counts from different algorithms/software consistent? do any differences matter? (multimapping differences are main source of disagreement in results)
IMAGE: PCA plots for replicates
- The Harvard training has a series of PCA plots with various different factors colored to show how things cluster
FIXME: main methods of quantifying expression (mostly in humans, but also other biomedically-relevant model systems)
EXERCISE: assess quality of mapped data give some images/summary stats
EXERCISE: choose reference genome appropriate for a tricky analysis (data combination?)
EXERCISE: choose whether gene or transcript level is more appropriate, interpreting read quantification?
review objectives
preview next class's objectives