Skip to content

Latest commit

 

History

History
executable file
·
98 lines (75 loc) · 8.93 KB

class2.md

File metadata and controls

executable file
·
98 lines (75 loc) · 8.93 KB

Bulk RNAseq analysis

Class 2: Read mapping and quantification

Objectives

By the end of this class, you should be able to:

  • identify appropriate methods and quality assessment for mapping RNAseq reads
  • select appropriate assemblies and annotations for RNAseq experiments
  • differentiate between gene and transcript level analyses
  • identify approaches to quantifying gene expression from mapped reads

Methods of mapping reads

IMAGE: https://www.ebi.ac.uk/training/online/courses/functional-genomics-ii-common-technologies-and-data-analysis-methods/wp-content/uploads/sites/70/2020/05/Figure19.png

FIXME: how do some of the different read mapping tools differ? are there other tools we should mention?

  • TopHat

  • BWA

    • Fast and accurate short read alignment with Burrows–Wheeler transform
    • BWA GitHub:BWA is a software package for mapping DNA sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM. The first algorithm is designed for Illumina sequence reads up to 100bp, while the rest two for longer sequences ranged from 70bp to a few megabases. BWA-MEM and BWA-SW share similar features such as the support of long reads and chimeric alignment, but BWA-MEM, which is the latest, is generally recommended as it is faster and more accurate. BWA-MEM also has better performance than BWA-backtrack for 70-100bp Illumina reads
  • STAR2

    • STAR: ultrafast universal RNA-seq aligner
    • Mapping RNA-seq Reads with STAR:
      • One of the foundational steps in the RNA-seq data analysis is mapping (alignment) of the large sets of sequenced reads to a reference genome. This task presents more challenges than alignment of genomic DNA reads because RNA sequences are often spliced, i.e. derived from the non-contiguous regions of the genome.
      • In addition to detecting of annotated and novel splice junctions, STAR is capable of discovering more complex RNA sequence arrangements, such as chimeric and circular RNA. STAR can align spliced sequences of any length with moderate error rates providing scalability for emerging sequencing technologies
  • Salmon (and other pseudoalignment methods)

    • Combine lab - Salmon Overview
      • Salmon uses new algorithms (specifically, coupling the concept of quasi-mapping with a two-phase inference procedure) to provide accurate expression estimates very quickly (i.e. wicked-fast) and while using little memory
    • Combine lab Salmon GitHub Repo
      • Salmon achieves its accuracy and speed via a number of different innovations, including the use of selective-alignment (accurate but fast-to-compute proxies for traditional read alignments), and massively-parallel stochastic collapsed variational inference
  • Other pseudoaligners

    • Kallisto
      • program for quantifying abundances of transcripts from bulk and single-cell RNA-Seq data, or more generally of target sequences using high-throughput sequencing reads
  • Comparing read mapping tools

    • Salmon & kallisto: Rapid Transcript Quantification for RNA-Seq Data - NYU Genomics core
    • Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data
      • In contrast to the microarray field, where data processing converged over the years into a well-defined set of broadly accepted workflows, the number of RNA-seq data processing workflows is still increasing, with none accepted as the standard so far. RNA-seq data processing workflows typically come in two different flavours.
      • there are methods that align reads directly to a reference genome, followed by quantification of mapped reads (e.g. Tophat-Cufflinks5, Tophat-HTSeq6, 7 and STAR-HTSeq7, 8)
      • there are the so-called pseudoalignment methods (e.g. Salmon9 and Kallisto10) that break up reads into k-mers before assigning them to transcripts. This results in a substantial gain in speed compared to the alignment based workflows.
      • The workflows also differ in how they estimate expression abundance, with some enabling quantification on transcript level (i.e. Cufflinks, Salmon and Kallisto) while others are restricted to gene level quantification
      • All workflows show a good concordance with RT-qPCR expression measurements and no workflow outperforms the others. Of note, each workflow revealed a small but specific set of genes with inconsistent expression measurements, reproducibly identified in independent datasets. These genes were typically smaller, had fewer exons and were lower expressed compared to genes with consistent expression measurements. Careful validation is warranted when evaluating RNA-seq based expression profiles for this specific set of genes. -Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis

Assessing quality of mapped reads

RNAseQC https://software.broadinstitute.org/cancer/cga/rna-seqc

FIXME: what do each of these quality assessments tell us?

  • percent alignment
    • From EMBL-EBI online training: The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA.
  • paired alignment (vs singletons): For paired alignment we expect both strands to align to the genome. R1 is the forward direction and R2 is the reverse direction. Sometimes only one of the pairs will align. We call the result a singleton. (sourced from various biostar threads)
  • strandedness: to assess the performance of strand-specific library construction methods, the percentage of sense-derived reads is given for each end of the read pair. Whereas a non-strand-specific protocol would give values of 50%/50%, strand-specific protocols typically yield 99%/1% or 1%/99% for this metric.
  • gene body coverage: Calculate the RNA-seq reads coverage over gene body
  • marks PCR duplicates (but this can be misleading because this varies for RNAseq experiments)

Reference genomes, assemblies, and annotations

Quantifying gene expression

FIXME: do we count genes or transcripts? What does this tell us biologically?

  • A gene is a fuzzy, partially defined concept of a sequence of nucleotides that encode RNA or protein
  • Transcripts are specific nucelotide sequences that encode RNA/protein
    • Each transcript maps to a gene, genes often have multiple transcipts due to alternative splicing
  • Usually genes are counted (does not take multiple isoforms into account) because this is easier and more straight-forward

are read counts from different algorithms/software consistent? do any differences matter? (multimapping differences are main source of disagreement in results)

IMAGE: PCA plots for replicates

  • The Harvard training has a series of PCA plots with various different factors colored to show how things cluster

FIXME: main methods of quantifying expression (mostly in humans, but also other biomedically-relevant model systems)

Putting it together

EXERCISE: assess quality of mapped data give some images/summary stats

EXERCISE: choose reference genome appropriate for a tricky analysis (data combination?)

EXERCISE: choose whether gene or transcript level is more appropriate, interpreting read quantification?

Wrapping up

review objectives

preview next class's objectives