From d9b409dca3f380e96de24e282f0bcbacbc14799b Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Sat, 25 Jan 2025 22:43:28 +0100 Subject: [PATCH 01/29] clean sentences until summary of 3.1(toggle) --- jupyter-book/glossary.md | 5 +++ .../introduction/raw_data_processing.md | 34 ++++++++++++++----- 2 files changed, 30 insertions(+), 9 deletions(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 34887c0f..41d0d9ea 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -55,6 +55,10 @@ DNA Doublets Reads obtained from droplet based assays might be mistakenly associated to a single cell while the RNA expression origins from two or more cells (a doublet). +Downstream analysis +downstream analyses + A phase of data analysis that follows the initial processing of raw data. In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics. + Dropout A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`. The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression. Dropouts are one of the reasons why scRNA-seq data is sparse. @@ -119,5 +123,6 @@ Trajectory inference Also known as pseudotemporal ordering. The computational recovery of dynamic processes by ordering cells by similarity or other means. Unique Molecular Identifier (UMI) +unique molecular identifiers (UMIs) Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing. UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates. ``` diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 2e21cba1..5d917a93 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -2,8 +2,7 @@ # Raw data processing -In this section, we discuss some of the fundamental issues surrounding what is commonly called "preprocessing" of single-cell and single-nucleus RNA-sequencing (sc/snRNA-seq) data. Though this is common terminology, it seems a bit of a misnomer, as this process involves several -steps that make important decisions about how to deal with and represent the data that can enable or preclude subsequent analyses. Here, we will primarily refer to this phase of processing as "raw data processing", and our focus will be on the phase of data analysis that begins with lane-demultiplexed FASTQ files, and ends with a count matrix representing the estimated number of distinct molecules arising from each gene within each quantified cell, potentially stratified by the inferred splicing status of each molecule ({numref}`raw-proc-fig-overview`). +Here, we discuss the fundamental aspects of "preprocessing" for single-cell and single-nucleus RNA-sequencing (sc/snRNA-seq) data. While "preprocessing" is a common terminology, it can be misleading. The process involves critical decisions about handling and representing the data that directly impact subsequent analyses. For clarity, we will refer to this phase of processing as "raw data processing", and our focus will be on the phase of data analysis that begins with lane-demultiplexed FASTQ files and ends with a count matrix. This matrix represents the estimated number of distinct molecules derived from each gene per quantified cell, sometimes categorized by the inferred splicing status of each molecule ({numref}`raw-proc-fig-overview`). :::{figure-md} raw-proc-fig-overview Chapter Overview @@ -11,19 +10,36 @@ steps that make important decisions about how to deal with and represent the dat An overview of the topics discussed in this chapter. In the plot, "txome" stands for transcriptome. ::: -This count matrix then serves as the input for the myriad methods that have been developed for various analyses carried out with scRNA-seq data {cite}`raw:Zappia2021`, from methods for normalization, integration, and filtering through methods for inferring cell types, developmental trajectories, and expression dynamics. Given that it serves as the starting point for all of these analyses, a robust and accurate estimation of this matrix is a foundational and critical step to support and enable accurate and reliable subsequent analyses. Fundamental misestimation in raw data processing can contribute to invalid inferences in higher-level analyses and can preclude discoveries based on the signal present in the raw data, which has been missed, diminished, or distorted by raw data processing. As we will see in this section, despite the intuitive nature of the input and output we seek from this step in the processing pipeline, several important and difficult challenges arise during raw data processing, and improved computational methodology for dealing with these challenges is still being actively developed. In particular, we will cover the fundamental steps in raw data processing, including read alignment/mapping, cell barcode (CB) identification and correction, and the estimation of molecule counts through the resolution of unique molecular identifiers (UMIs). We will also mention the choices and challenges that surface when performing these processing steps. +The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`raw:Zappia2021`, including normalization, integration, and filtering through methods for cell type identification, developmental trajectory inference, and expression dynamics studies. A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges, which are active areas of computational development. -```{admonition} A note on what precedes raw data processing -The decision of where to begin discussing raw data processing is somewhat arbitrary. We note that while, here, we consider starting with lane-demultiplexed FASTQ files as the _raw_ input, even this already represents data that has been processed -and transformed from raw measurements. Further, some of the processing that precedes the generation of the FASTQ files is relevant to challenges that may arise in subsequent processing. For example, the process of base calling and base quality estimation affects the errors that arise in the FASTQ representation of the sequenced fragments, as well as the instrument's estimation of the confidence of each called nucleotide. Further, issues can arise upstream of FASTQ generation, such as index hopping {cite}`farouni2020model`, though these issues can be largely mitigated both with _in silico_ approaches {cite}`farouni2020model` and through enhanced protocols such as [dual indexing](https://www.10xgenomics.com/blog/sequence-with-confidence-understand-index-hopping-and-how-to-resolve-it). In this section, however, we will not consider upstream effects such as these, and instead will consider the FASTQ files, derived from, e.g., BCL files via [appropriate tools](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/bcl2fastq-direct), as the raw input under consideration. +In this section, we focus on key steps of raw data processing: + +1. Read alignment/mapping +2. Cell barcode (CB) identification and correction +3. Estimation of molecule counts through {term}`unique molecular identifiers (UMIs)` + +We also discuss the challenges and trade-offs involved in each step. + +```{admonition} A note on Preceding Steps + +The starting point for raw data processing is somewhat arbitrary. For this discussion, we treat lane-demultiplexed FASTQ files as the _raw_ input. However, these files are derived from earlier steps, such as base calling and base quality estimation, which can influence downstream processing. For example, base-calling errors and index hopping {cite}`farouni2020model` can introduce inaccuracies in FASTQ data. These issues can be mitigated with computational approaches {cite}farouni2020model or experimental enhancements like [dual indexing](https://www.10xgenomics.com/blog/sequence-with-confidence-understand-index-hopping-and-how-to-resolve-it). + +Here, we do not delve into the upstream processes, but consider the FASTQ files, derived from, e.g., BCL files via [appropriate tools](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/bcl2fastq-direct), as the raw input under consideration. ``` ## Raw data quality control -Once raw FASTQ files have been obtained, the quality of the reads themselves can be quickly diagnosed by running a quality control (QC) tool, such as [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), to assess read quality, sequence content, etc. If run successfully, `FastQC` generates a QC report for each input FASTQ file. Overall, this report summarizes the quality score, base content, and certain relevant statistics to spot potential problems originating from library preparation or sequencing. +After obtaining raw FASTQ files, it is important to evaluate the quality of the sequencing reads. A quick and effective way to perform this is by using quality control (QC) tools like [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). `FastQC` generates a detailed report for each FASTQ file, summarizing key metrics such as quality scores, base content, and other statistics that help identify potential issues arising from library preparation or sequencing. + +While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads—it is still good practice to run an independent QC check. This provides additional metrics that are often useful for identifying broader quality issues. + +For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the [`FastQC` manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. + +In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). + +It is important to note that many QC metrics in FastQC reports are most meaningful only for biological reads—those derived from gene transcripts. For single-cell datasets, such as 10x Chromium v2 and v3, this typically corresponds to read 2 (the files containing `R2` in their filename), which contain transcript-derived sequences. In contrast, technical reads, which contain barcode and UMI sequences, often do not exhibit biologically typical sequence or GC content. However, certain metrics, like the fraction of `N` base calls, are still relevant for all reads. -Although nowadays, single-cell raw data processing tools internally evaluate some quality checks that are important for single-cell data, such as the N content of sequences and the fraction of mapped reads, it is still a good habit to run a quick quality check before processing the data, as it evaluates other useful QC metrics. -For readers who are interested in knowing what the `FastQC` report for a FASTQ file from an RNA-seq sample looks like, in the following toggle content, we use the example `FastQC` report of [good](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) together with [bad](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the [`FastQC` manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) to demonstrate the modules in the `FastQC` report. The detailed description of these modules can be found on the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Finally, most of the metrics reported in such quality control reports make sense only for the "biological" reads of a single-cell dataset (i.e. the read being drawn from gene transcripts). For example, for 10x Chromium v2 and v3 datasets, this would be read 2 (the files containing `R2` in their filename). This is because the technical reads are comprised primarily of barcode and UMI sequences which are not drawn from the underlying transcriptome and which we do not expect to have typical biologically plausible sequence or GC content, though certain metrics like the fraction of `N` base calls are still relevant. +By running an initial quality check using tools like `FastQC`, researchers can identify potential problems early and ensure the raw data is suitable for subsequent processing and analysis. ```{toggle} From 88970426e9c063971f240430196eaf7879eb1627 Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Mon, 27 Jan 2025 17:25:38 +0100 Subject: [PATCH 02/29] change until 3.2.1 and in glossary --- jupyter-book/glossary.md | 35 +++++++-- .../introduction/raw_data_processing.md | 71 +++++++++++++------ 2 files changed, 78 insertions(+), 28 deletions(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 41d0d9ea..b49f5ab0 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -1,6 +1,10 @@ # Glossary ```{glossary} +Adapter sequences +adapter sequences + Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing. These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing. However, if adapters are not properly removed or trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses. + Algorithm Algorithms A pre-defined set of instructions to solve a problem. @@ -30,10 +34,6 @@ Cell Cell barcode See {term}`barcode` -Cluster -Clusters - A group of a population or data points that share similarities. In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`). - Cell type annotation The process of labeling groups of {term}`clusters` of cells by {term}`cell type`. Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference. @@ -46,6 +46,14 @@ Cell state Chromatin The complex of DNA and proteins efficiently packaging the DNA inside the nucleus and involved in regulating gene expression. +Cluster +Clusters + A group of a population or data points that share similarities. In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`). + +Complementary DNA (cDNA) +cDNA + DNA synthesized from an RNA template by the enzyme reverse transcriptase. cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis. + Demultiplexing The process of determining which sequencing reads belong to which cell using {term}`barcodes`. @@ -68,6 +76,10 @@ Drop-seq FASTQ reads Sequencing reads that are saved in the FASTQ format. FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells. +Flowcell +flowcell + A consumable device used in sequencing platforms, such as those from Illumina, where DNA or RNA fragments are sequenced. It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments. During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation. The flowcell enables high-throughput sequencing by allowing millions of fragments to be sequenced simultaneously. + Gene expression matrix A cell (barcode) by gene (scverse ecosystem) or gene by cell (barcode) matrix storing counts in the cell values. @@ -80,18 +92,21 @@ Indrop Library Also known as sequencing library. A pool of DNA fragments with attached sequencing adapters. +Locus +Loci +loci + Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located. In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region. Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data. + MuData A Python package for multimodal annotated data matrices. The primary data structure in the scverse ecosystem for multimodal data. +Muon muon A Python package for multi-modal single-cell analysis in Python by scverse. Negative binomial distribution A discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified number of failures. -Pipeline - Also often times denoted as workflow. A pre-specified selection of steps that are commonly executed in order. - RNA Ribonucleic acid. Single-stranded nucleic acid present in all living cells that encodes and regulates gene expression. @@ -101,6 +116,9 @@ RT-qPCR PCR Polymercase chain reaction (PCR) is a method to amplify sequences to create billions of copies. PCR requires primers, which are short synthetic {term}`DNA` fragments, to select the genome segments to be amplified and subsequently multiple rounds of {term}`DNA` synthesis to amplify the targeted segments. +Pipeline + Also often times denoted as workflow. A pre-specified selection of steps that are commonly executed in order. + Poisson distribution Discrete probability distribution denoting the probability of a specified number of events occurring in a fixed interval of time or space with the events occurring independently at a known constant mean rate. @@ -116,6 +134,9 @@ scanpy scverse A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools. See: https://scverse.org/ +signal-to-noise ratio + A measure of the clarity of a signal relative to background noise. In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data. A high SNR indicates that the signal is strong and reliable compared to the noise, resulting in better data quality. Conversely, a low SNR means the noise may interfere with or reduce the accuracy of the sequencing results. + Spike-in RNA RNA transcripts of known sequence and quantity to calibrate measurements in RNA hybridization steps for RNA-seq. diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 5d917a93..34f98b79 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -41,11 +41,11 @@ It is important to note that many QC metrics in FastQC reports are most meaningf By running an initial quality check using tools like `FastQC`, researchers can identify potential problems early and ensure the raw data is suitable for subsequent processing and analysis. -```{toggle} +```{dropdown} Example FastQC Reports and Tutorials **0. Summary** -The summary panel on the left-hand side of the HTML report shows the module names included in the report, together with a sign used for quick evaluation of the module results. However, because `FastQC` uses uniform thresholds for files generated from all kinds of sequencing platforms and underlying biological material, we sometimes see warnings (orange exclamation) and failures(red crosses) for good data and passes (green ticks) for questionable data. So, all modules should be carefully evaluated before concluding about the data quality. +The summary panel on the left side of the HTML report displays the module names along with symbols that provide a quick assessment of the module results. However, `FastQC` applies uniform thresholds across all sequencing platforms and biological materials. As a result, warnings (orange exclamation marks) or failures (red crosses) may appear for high-quality data, while questionable data might receive passes (green ticks). Therefore, each module should be carefully reviewed before drawing conclusions about data quality. :::{figure-md} raw-proc-fig-fastqc-summary Summary @@ -55,7 +55,7 @@ The summary panel of a bad example. **1. Basic Statistics** -The basic statistics module contains the overview information and statistics of the reads in the input FASTQ file, such as the filename, total number of sequences, number of poor quality sequences, sequence length, and the overall %GC of all bases in all sequences. Good single-cell data usually have a very little number of poor quality sequences and most often have uniform sequence length. The GC content should match the overall GC content of the genome or transcriptome of the sequenced species. +The basic statistics module provides an overview of key information and statistics for the input FASTQ file, including the filename, total number of sequences, number of poor-quality sequences, sequence length, and the overall GC content (%GC) across all bases in all sequences. High-quality single-cell data typically have very few poor-quality sequences and exhibit a uniform sequence length. Additionally, the GC content should align with the expected GC content of the genome or transcriptome of the sequenced species. :::{figure-md} raw-proc-fig-fastqc-basic-statistics Basic Statistics @@ -65,7 +65,11 @@ A good basic statistics report example. **2. Per Base Sequence Quality** -The per base sequence quality view contains a BoxWhisker type plot for each position in the read, which shows the range of quality scores across all bases at each position along all reads in the file. The x-axis represents the positions in the read, and the y-axis shows the quality scores. For single-cell data of good quality, all yellow boxes in the view, which represent the inter-quantile range of the quality scores of positions, should fall into the green (calls of good quality) area. So do all the whiskers, which represent the 10\% and 90\% of the distribution. It is not unexpected to see that the quality scores drop along the body of the reads, and some base calls of the last positions fall into the orange (calls of reasonable quality) area because of the decrease in the signal-to-noise ratio that is common in most sequencing-by-synthesis methods. Still, boxes should fall outside of the red (calls of poor quality) area. If poor quality calls are observed, one may consider performing quality trimming of their reads. [A detailed explanation](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html) of the sequencing error profiles is provided by the [HBC training program](https://hbctraining.github.io/main/). +The per-base sequence quality view displays a box-and-whisker plot for each position in the read, illustrating the range of quality scores across all bases at each position. The x-axis represents the positions within the read, while the y-axis shows the quality scores. + +For high-quality single-cell data, the yellow boxes—representing the interquartile range of quality scores—should fall within the green area (indicating good quality calls). Similarly, the whiskers, which represent the 10th and 90th percentiles of the distribution, should also remain within the green area. It is common to observe a gradual drop in quality scores along the length of the read, with some base calls at the last positions falling into the orange area (reasonable quality) due to a decreasing {term}`signal-to-noise ratio`, a characteristic of sequencing-by-synthesis methods. However, the boxes should not extend into the red area (poor quality calls). + +If poor-quality calls are observed, quality trimming may be necessary. [A more detailed explanation](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html) of sequencing error profiles can be found in the [HBC training program](https://hbctraining.github.io/main/). :::{figure-md} raw-proc-fig-fastqc-per-read-sequence-quality per read sequence quality @@ -75,7 +79,9 @@ A good (left) and a bad (right) per-read sequence quality graph. **3. Per Tile Sequence Quality** -Using an Illumina library, the per tile sequence quality plot shows the deviation from the average quality for the reads in each flowcell tile that was sequenced. The "hotter" the color, the larger the deviation. For high-quality data, we should see blue over the entire plot, meaning that the flowcell has consistent quality in all tiles. If only part of a flowcell has poor quality, some hot colors will appear in the plot. In that case, the sequencing step might have encountered transient problems, such as bubbles going through the flowcell or smudges and debris inside the flowcell lane. For further diagnoses, please refer to [QC Fail](https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/) and [common reasons for warnings](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html) from `FastQC`. +Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` tile. The plot uses a color gradient to represent deviations, where “hotter” colors indicate larger deviations. High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell. + +If hot colors appear in certain areas, it suggests that only part of the flowcell experienced poor quality. This could result from transient issues during sequencing, such as bubbles passing through the flowcell or smudges and debris within the flowcell lane. For further investigation, consult resources like [QC Fail](https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/) and the [common reasons for warnings](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html) provided in the `FastQC` manual. :::{figure-md} raw-proc-fig-fastqc-per-tile-sequence-quality per tile sequence quality @@ -85,7 +91,7 @@ A good (left) and a bad (right) per tile sequence quality view. **4. Per Sequence Quality Scores** -The per sequence quality score plot shows the distribution of the average quality score of each read in the file. The x-axis shows the average quality scores, and the y-axis represents the frequency of each quality score. For good data, this plot should have only one peak at the tail. If some other peaks show up, a subset of reads might have some quality issue. +The per-sequence quality score plot displays the distribution of average quality scores for each read in the file. The x-axis represents the average quality scores, while the y-axis shows the frequency of each score. For high-quality data, the plot should have a single peak near the high-quality end of the scale. If additional peaks appear, it may indicate a subset of reads with quality issues. :::{figure-md} raw-proc-fig-fastqc-per-sequence-quality-scores per sequence quality scores @@ -95,7 +101,7 @@ A good (left) and a bad (right) per sequence quality score plot. **5. Per Base Sequence Content** -The per base sequence content plot reports the percent of each base position of all reads in the file for which each of the four nucleotides has been called. For single-cell data, it is not uncommon to see fluctuations at the start of the lines because the first bases of reads represent the sequence of the priming sites, which may not be perfectly random, as explained on the [QC Fail website](https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/). This happens frequently in RNA-seq libraries, even though `FastQC` will call a warning or failure. +The per-base sequence content plot shows the percentage of each nucleotide (A, T, G, C) called at each base position across all reads in the file. For single-cell data, it is common to observe fluctuations at the start of the reads. This occurs because the initial bases represent the sequence of the priming sites, which are often not perfectly random. This is a frequent occurrence in RNA-seq libraries, even though `FastQC` may flag it with a warning or failure, as noted on the [QC Fail website](https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/). :::{figure-md} raw-proc-fig-fastqc-per-base-sequence-content per base sequence content @@ -105,7 +111,9 @@ A good (left) and bad (right) per base sequence content plot. **6. Per Sequence GC Content** -The per sequence GC content plot shows the distribution of GC content over all of the reads in red and a theoretical (expected) distribution in blue. The central peak of the observed distribution should correspond to the overall GC content of the underlying transcriptome. Sometimes we see a wider or narrower distribution than the theoretical distribution because the GC content of the transcriptome might differ from the genome, which, in theory, should follow the distribution shown in blue. Such differences in the spread of the distributions are not uncommon, even though it may trigger a warning or failure. However, a complex distribution usually indicates a contaminated library. It is worth noting, however, that a GC content plot can be somewhat complicated to interpret in transcriptomics experiments, as the expected GC content distribution depends not only on the sequence content of the underlying transcriptome, but also on the expression of the genes in the sample, which are unknown. +The per-sequence GC content plot displays the GC content distribution across all reads (in red) compared to a theoretical distribution (in blue). The central peak of the observed distribution should align with the overall GC content of the transcriptome. However, the observed distribution may appear wider or narrower than the theoretical one due to differences between the transcriptome's GC content and the genome's expected GC distribution. Such variations are common and may trigger a warning or failure in `FastQC`, even if the data is acceptable. + +A complex or irregular distribution in this plot, however, often indicates contamination in the library. It is also important to note that interpreting GC content in transcriptomics can be challenging. The expected GC distribution depends not only on the sequence composition of the transcriptome but also on gene expression levels in the sample, which are typically unknown beforehand. As a result, some deviation from the theoretical distribution is not unusual in RNA-seq data. :::{figure-md} raw-proc-fig-fastqc-per-sequence-gc-content Per Sequence GC Content @@ -115,7 +123,7 @@ A good (left) and a bad (right) per sequence GC content plot. The plot on the le **7. Per Base N Content** -The per base N content plot shows the percent of bases at each position for which an ``N`` was called, which will occur when the sequencer has insufficient confidence to make a base call. In a high-quality library, we should not expect any noticeable non-zero ``N`` content throughout the line. +The per-base N content plot displays the percentage of bases at each position that were called as ``N``, indicating that the sequencer lacked sufficient confidence to assign a specific nucleotide. In a high-quality library, the ``N`` content should remain consistently at or near zero across the entire length of the reads. Any noticeable non-zero ``N`` content may indicate issues with sequencing quality or library preparation. :::{figure-md} raw-proc-fig-fastqc-per-base-n-content @@ -126,7 +134,7 @@ A good (left) and a bad (right) per base N content plot. **8. Sequence Length Distribution** -The Sequence length distribution graph shows the distribution of the read lengths. In most single-cell chemistries, all reads will be of the same length. If trimming was performed before quality assessment, there may be some small variation in read lengths. +The sequence length distribution graph displays the distribution of read lengths across all sequences in the file. For most single-cell sequencing chemistries, all reads are expected to have the same length, resulting in a single peak in the graph. However, if quality trimming was applied before the quality assessment, some variation in read lengths may be observed. Small differences in read lengths due to trimming are normal and should not be a cause for concern if expected. :::{figure-md} raw-proc-fig-fastqc-sequence-length-distribution Sequence Length Distribution @@ -136,7 +144,9 @@ A good (left) and a bad (right) sequence length distribution plot. **9. Sequence Duplication Levels** -The sequence duplication level plot shows the distribution of the degree of duplication for read sequences (the blue line) and those after deduplication. As most single-cell platform requires multiple rounds of PCR, highly expressed genes usually express a large number of transcripts, and FastQC itself is not UMI aware, it is common that a small subset of sequences may have a large number of duplications. This may trigger a warning or failure for this module, but it does not necessarily represent a quality problem with the data. Still, the majority of sequences should have a low duplication level. +The sequence duplication level plot illustrates the distribution of duplication levels for read sequences, represented by the blue line, both before and after deduplication. In single-cell platforms, multiple rounds of {term}`PCR` are typically required, and highly expressed genes naturally produce a large number of transcripts. Additionally, since `FastQC` is not UMI-aware (i.e., it does not account for unique molecular identifiers), it is common for a small subset of sequences to show high duplication levels. + +While this may trigger a warning or failure in this module, it does not necessarily indicate a quality issue with the data. However, the majority of sequences should still exhibit low duplication levels, reflecting a diverse and well-prepared library. :::{figure-md} raw-proc-fig-fastqc-sequence-duplication-levels Sequence Duplication Levels @@ -146,7 +156,9 @@ A good (left) and a bad (right) per sequence duplication levels plot. **10. Overrepresented Sequences** -The overrepresented sequences module lists all read sequences that make up more than $0.1\%$ of the total number of sequences. We might see some overrepresented sequences from the highly expressed genes after PCR amplification, but most sequences should not be overrepresented. Note that if the possible source of the overrepresented sequences is not "No Hit", it might indicate that the library is contaminated by the corresponding type of source. +The overrepresented sequences module identifies read sequences that constitute more than 0.1% of the total reads. In single-cell sequencing, some overrepresented sequences may arise from highly expressed genes amplified during PCR. However, the majority of sequences should not be overrepresented. + +If the source of an overrepresented sequence is identified (i.e., not listed as "No Hit"), it could indicate potential contamination in the library from the corresponding source. Such cases warrant further investigation to ensure data quality. :::{figure-md} raw-proc-fig-fastqc-overrepresented-sequences Overrepresented Sequences @@ -156,7 +168,7 @@ An overrepresented sequence table. **11. Adapter Content** -The adapter content module shows the cumulative percentage of the fraction of reads in which each of the adapter sequences has been observed at each base position. Ideally, we should not see any abundant adapter sequences in the data. +The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences` at each base position. High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses. Ideally, no significant adapter content should be present in the data. If adapter sequences are abundant, additional trimming may be necessary to improve data quality. :::{figure-md} raw-proc-fig-fastqc-adapter-content Adapter Content @@ -172,19 +184,36 @@ Multiple FastQC reports can be combined into a single report using the tool [`Mu ## Alignment and mapping -Mapping or alignment is a fundamental step in single-cell raw data processing. It refers to the process of determining the potential loci of origin of each sequenced fragment (e.g., the set of genomic or transcriptomic loci that are similar to the read). Depending on the sequencing protocol, the resulting raw sequence file contains the cell-level information, commonly known as cell barcodes (CB), the unique molecule identifier (UMI), and the raw cDNA sequence (read sequence) generated from the molecule. As the first step in the raw data processing of a single-cell sample ({numref}`raw-proc-fig-overview`), executing mapping or alignment accurately is instrumental for all downstream analyses, since incorrect read-to-transcript/gene mapping can lead to wrong or inaccurate matrices. +Mapping or Alignment is a critical step in single-cell raw data processing. It involves determining the potential {term}`loci` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence. This step is essential for correctly assigning reads to their source regions. + +In single-cell sequencing protocols, the raw sequence files typically include: + +- Cell {term}`Barcodes` (CB): Unique identifiers for individual cells. +- Unique Molecular Identifiers (UMIs): Tags that distinguish individual molecules to account for amplification bias. +- Raw {term}`cDNA` Sequences: The actual read sequences generated from the molecules. + +As the first step in raw data processing ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses. Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices, ultimately compromising the quality of subsequent analyses. + +While mapping read sequences to reference sequences _far_ predates the development of scRNA-seq, the sheer scale of modern scRNA-seq datasets—often involving hundreds of millions to billions of reads—makes this step particularly computationally intensive. Many existing RNA-seq aligners are protocol-agnostic and do not inherently account for features specific to scRNA-seq, such as cell barcodes, UMIs, or their positions and lengths. As a result, additional tools are often required for steps like demultiplexing and UMI resolution {cite}`Smith2017`. + +This reliance on separate tools introduces additional computational overhead. It often necessitates storing large intermediate files, which significantly increases disk space usage. Moreover, the extra input and output operations required to process these files further contribute to longer runtime requirements, making the mapping stage both resource-intensive and time-consuming. -While mapping read sequences to reference sequences _far_ predates the advent of scRNA-seq, the current and quickly growing scale of scRNA-seq samples (typically hundreds of millions to billions of reads) makes this stage particularly computationally intensive. Additionally, using pre-existing RNA-seq aligners that are agnostic to any specific scRNA-seq protocol -- like the length and position of cell barcodes, UMI, etc. -- requires making use of separate tools for performing other steps such as demultiplexing and UMI resolutions {cite}`Smith2017`. This added overhead can be computationally cumbersome. Further, it typically carries a substantial disk space requirement for storing intermediate files, and the extra input and output operations required for processing such intermediate files further increase runtime requirements. +To address the challenges of aligning and mapping scRNA-seq data, several specialized tools have been developed that handle the additional processing requirements automatically or internally. These tools include: -To this end, several dedicated tools have been built specifically for aligning or mapping scRNA-seq data, which handle these additional processing requirements automatically and/or internally. Tools such as `Cell Ranger` (commercial software from 10x Genomics) {cite}`raw:Zheng2017`, `zUMIs` {cite}`zumis`, `alevin` {cite}`Srivastava2019`, `RainDrop` {cite}`niebler2020raindrop`, `kallisto|bustools` {cite}`Melsted2021`, `STARsolo` {cite}`Kaminow2021` and `alevin-fry` {cite}`raw:He2022` provide dedicated treatment for aligning scRNA-seq reads along with parsing of technical read content (CBs and UMIs), as well as methods for demultiplexing and UMI resolution. -While all provide relatively simplified interfaces to the user, these tools use a variety of different approaches internally, with some generating traditional intermediate files (e.g., BAM files) and subsequently processing them, and others either working entirely in memory or by making use of reduced intermediate representations to reduce the input/output burden. +- `Cell Ranger` (commercial software from 10x Genomics) {cite}`raw:Zheng2017` +- `zUMIs` {cite}`zumis` +- `alevin` {cite}`Srivastava2019` +- `RainDrop` {cite}`niebler2020raindrop` +- `kallisto|bustools` {cite}`Melsted2021` +- `STARsolo` {cite}`Kaminow2021` +- `alevin-fry` {cite}`raw:He2022` -%A further benefit of all these tools is their reliance on pre-existing mapping/alignment engine that the tool relies on. As a result, the pre-existing user base and the tutorials make these tools more amenable to new users. +These tools provide specialized capabilities for aligning scRNA-seq reads, parsing technical read content (e.g., cell barcodes and UMIs), demultiplexing, and UMI resolution. Although they offer simplified user interfaces, their internal methodologies differ significantly. Some tools generate traditional intermediate files, such as BAM files, which are processed further, while others operate entirely in memory or use compact intermediate representations to minimize input/output operations and reduce computational overhead. -While the specific algorithms, data structures, and time and space trade-offs made by different alignment and mapping approaches can vary greatly, we can roughly categorize existing approaches along two axes: +While these tools vary in their specific algorithms, data structures, and trade-offs in time and space complexity, their approaches can generally be categorized along two axes: -- The type of mapping they perform and -- The type of reference sequence against which they map reads. +1. **The type of mapping they perform**, and +2. **The type of reference sequence against which they map reads**. (raw-proc:types-of-mapping)= From b9d813df3b48a977281a7a478b3e10b0a5409d61 Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Tue, 28 Jan 2025 14:57:18 +0100 Subject: [PATCH 03/29] add newlines --- .../introduction/raw_data_processing.md | 153 +++++++++++++----- 1 file changed, 117 insertions(+), 36 deletions(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 34f98b79..7b18b36e 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -2,7 +2,10 @@ # Raw data processing -Here, we discuss the fundamental aspects of "preprocessing" for single-cell and single-nucleus RNA-sequencing (sc/snRNA-seq) data. While "preprocessing" is a common terminology, it can be misleading. The process involves critical decisions about handling and representing the data that directly impact subsequent analyses. For clarity, we will refer to this phase of processing as "raw data processing", and our focus will be on the phase of data analysis that begins with lane-demultiplexed FASTQ files and ends with a count matrix. This matrix represents the estimated number of distinct molecules derived from each gene per quantified cell, sometimes categorized by the inferred splicing status of each molecule ({numref}`raw-proc-fig-overview`). +Here, we discuss the fundamental aspects of "preprocessing" for single-cell and single-nucleus RNA-sequencing (sc/snRNA-seq) data. While "preprocessing" is a common terminology, it can be misleading. +The process involves critical decisions about handling and representing the data that directly impact subsequent analyses. +For clarity, we will refer to this phase of processing as "raw data processing", and our focus will be on the phase of data analysis that begins with lane-demultiplexed FASTQ files and ends with a count matrix. +This matrix represents the estimated number of distinct molecules derived from each gene per quantified cell, sometimes categorized by the inferred splicing status of each molecule ({numref}`raw-proc-fig-overview`). :::{figure-md} raw-proc-fig-overview Chapter Overview @@ -10,7 +13,9 @@ Here, we discuss the fundamental aspects of "preprocessing" for single-cell and An overview of the topics discussed in this chapter. In the plot, "txome" stands for transcriptome. ::: -The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`raw:Zappia2021`, including normalization, integration, and filtering through methods for cell type identification, developmental trajectory inference, and expression dynamics studies. A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges, which are active areas of computational development. +The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`raw:Zappia2021`, including normalization, integration, and filtering through methods for cell type identification, developmental trajectory inference, and expression dynamics studies. +A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. +Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges, which are active areas of computational development. In this section, we focus on key steps of raw data processing: @@ -22,22 +27,32 @@ We also discuss the challenges and trade-offs involved in each step. ```{admonition} A note on Preceding Steps -The starting point for raw data processing is somewhat arbitrary. For this discussion, we treat lane-demultiplexed FASTQ files as the _raw_ input. However, these files are derived from earlier steps, such as base calling and base quality estimation, which can influence downstream processing. For example, base-calling errors and index hopping {cite}`farouni2020model` can introduce inaccuracies in FASTQ data. These issues can be mitigated with computational approaches {cite}farouni2020model or experimental enhancements like [dual indexing](https://www.10xgenomics.com/blog/sequence-with-confidence-understand-index-hopping-and-how-to-resolve-it). +The starting point for raw data processing is somewhat arbitrary. For this discussion, we treat lane-demultiplexed FASTQ files as the _raw_ input. +However, these files are derived from earlier steps, such as base calling and base quality estimation, which can influence downstream processing. +For example, base-calling errors and index hopping {cite}`farouni2020model` can introduce inaccuracies in FASTQ data. +These issues can be mitigated with computational approaches {cite}farouni2020model or experimental enhancements like [dual indexing](https://www.10xgenomics.com/blog/sequence-with-confidence-understand-index-hopping-and-how-to-resolve-it). Here, we do not delve into the upstream processes, but consider the FASTQ files, derived from, e.g., BCL files via [appropriate tools](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/bcl2fastq-direct), as the raw input under consideration. ``` ## Raw data quality control -After obtaining raw FASTQ files, it is important to evaluate the quality of the sequencing reads. A quick and effective way to perform this is by using quality control (QC) tools like [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). `FastQC` generates a detailed report for each FASTQ file, summarizing key metrics such as quality scores, base content, and other statistics that help identify potential issues arising from library preparation or sequencing. +After obtaining raw FASTQ files, it is important to evaluate the quality of the sequencing reads. +A quick and effective way to perform this is by using quality control (QC) tools like [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). +`FastQC` generates a detailed report for each FASTQ file, summarizing key metrics such as quality scores, base content, and other statistics that help identify potential issues arising from library preparation or sequencing. -While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads—it is still good practice to run an independent QC check. This provides additional metrics that are often useful for identifying broader quality issues. +While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads—it is still good practice to run an independent QC check. +This provides additional metrics that are often useful for identifying broader quality issues. -For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the [`FastQC` manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. +For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the [`FastQC` manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. +Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). -It is important to note that many QC metrics in FastQC reports are most meaningful only for biological reads—those derived from gene transcripts. For single-cell datasets, such as 10x Chromium v2 and v3, this typically corresponds to read 2 (the files containing `R2` in their filename), which contain transcript-derived sequences. In contrast, technical reads, which contain barcode and UMI sequences, often do not exhibit biologically typical sequence or GC content. However, certain metrics, like the fraction of `N` base calls, are still relevant for all reads. +It is important to note that many QC metrics in FastQC reports are most meaningful only for biological reads—those derived from gene transcripts. +For single-cell datasets, such as 10x Chromium v2 and v3, this typically corresponds to read 2 (the files containing `R2` in their filename), which contain transcript-derived sequences. +In contrast, technical reads, which contain barcode and UMI sequences, often do not exhibit biologically typical sequence or GC content. +However, certain metrics, like the fraction of `N` base calls, are still relevant for all reads. By running an initial quality check using tools like `FastQC`, researchers can identify potential problems early and ensure the raw data is suitable for subsequent processing and analysis. @@ -45,7 +60,10 @@ By running an initial quality check using tools like `FastQC`, researchers can i **0. Summary** -The summary panel on the left side of the HTML report displays the module names along with symbols that provide a quick assessment of the module results. However, `FastQC` applies uniform thresholds across all sequencing platforms and biological materials. As a result, warnings (orange exclamation marks) or failures (red crosses) may appear for high-quality data, while questionable data might receive passes (green ticks). Therefore, each module should be carefully reviewed before drawing conclusions about data quality. +The summary panel on the left side of the HTML report displays the module names along with symbols that provide a quick assessment of the module results. +However, `FastQC` applies uniform thresholds across all sequencing platforms and biological materials. +As a result, warnings (orange exclamation marks) or failures (red crosses) may appear for high-quality data, while questionable data might receive passes (green ticks). +Therefore, each module should be carefully reviewed before drawing conclusions about data quality. :::{figure-md} raw-proc-fig-fastqc-summary Summary @@ -55,7 +73,9 @@ The summary panel of a bad example. **1. Basic Statistics** -The basic statistics module provides an overview of key information and statistics for the input FASTQ file, including the filename, total number of sequences, number of poor-quality sequences, sequence length, and the overall GC content (%GC) across all bases in all sequences. High-quality single-cell data typically have very few poor-quality sequences and exhibit a uniform sequence length. Additionally, the GC content should align with the expected GC content of the genome or transcriptome of the sequenced species. +The basic statistics module provides an overview of key information and statistics for the input FASTQ file, including the filename, total number of sequences, number of poor-quality sequences, sequence length, and the overall GC content (%GC) across all bases in all sequences. +High-quality single-cell data typically have very few poor-quality sequences and exhibit a uniform sequence length. +Additionally, the GC content should align with the expected GC content of the genome or transcriptome of the sequenced species. :::{figure-md} raw-proc-fig-fastqc-basic-statistics Basic Statistics @@ -65,9 +85,13 @@ A good basic statistics report example. **2. Per Base Sequence Quality** -The per-base sequence quality view displays a box-and-whisker plot for each position in the read, illustrating the range of quality scores across all bases at each position. The x-axis represents the positions within the read, while the y-axis shows the quality scores. +The per-base sequence quality view displays a box-and-whisker plot for each position in the read, illustrating the range of quality scores across all bases at each position. +The x-axis represents the positions within the read, while the y-axis shows the quality scores. -For high-quality single-cell data, the yellow boxes—representing the interquartile range of quality scores—should fall within the green area (indicating good quality calls). Similarly, the whiskers, which represent the 10th and 90th percentiles of the distribution, should also remain within the green area. It is common to observe a gradual drop in quality scores along the length of the read, with some base calls at the last positions falling into the orange area (reasonable quality) due to a decreasing {term}`signal-to-noise ratio`, a characteristic of sequencing-by-synthesis methods. However, the boxes should not extend into the red area (poor quality calls). +For high-quality single-cell data, the yellow boxes—representing the interquartile range of quality scores—should fall within the green area (indicating good quality calls). +Similarly, the whiskers, which represent the 10th and 90th percentiles of the distribution, should also remain within the green area. +It is common to observe a gradual drop in quality scores along the length of the read, with some base calls at the last positions falling into the orange area (reasonable quality) due to a decreasing {term}`signal-to-noise ratio`, a characteristic of sequencing-by-synthesis methods. +However, the boxes should not extend into the red area (poor quality calls). If poor-quality calls are observed, quality trimming may be necessary. [A more detailed explanation](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html) of sequencing error profiles can be found in the [HBC training program](https://hbctraining.github.io/main/). @@ -79,9 +103,13 @@ A good (left) and a bad (right) per-read sequence quality graph. **3. Per Tile Sequence Quality** -Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` tile. The plot uses a color gradient to represent deviations, where “hotter” colors indicate larger deviations. High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell. +Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` tile. +The plot uses a color gradient to represent deviations, where “hotter” colors indicate larger deviations. +High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell. -If hot colors appear in certain areas, it suggests that only part of the flowcell experienced poor quality. This could result from transient issues during sequencing, such as bubbles passing through the flowcell or smudges and debris within the flowcell lane. For further investigation, consult resources like [QC Fail](https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/) and the [common reasons for warnings](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html) provided in the `FastQC` manual. +If hot colors appear in certain areas, it suggests that only part of the flowcell experienced poor quality. +This could result from transient issues during sequencing, such as bubbles passing through the flowcell or smudges and debris within the flowcell lane. +For further investigation, consult resources like [QC Fail](https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/) and the [common reasons for warnings](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html) provided in the `FastQC` manual. :::{figure-md} raw-proc-fig-fastqc-per-tile-sequence-quality per tile sequence quality @@ -91,7 +119,10 @@ A good (left) and a bad (right) per tile sequence quality view. **4. Per Sequence Quality Scores** -The per-sequence quality score plot displays the distribution of average quality scores for each read in the file. The x-axis represents the average quality scores, while the y-axis shows the frequency of each score. For high-quality data, the plot should have a single peak near the high-quality end of the scale. If additional peaks appear, it may indicate a subset of reads with quality issues. +The per-sequence quality score plot displays the distribution of average quality scores for each read in the file. +The x-axis represents the average quality scores, while the y-axis shows the frequency of each score. +For high-quality data, the plot should have a single peak near the high-quality end of the scale. +If additional peaks appear, it may indicate a subset of reads with quality issues. :::{figure-md} raw-proc-fig-fastqc-per-sequence-quality-scores per sequence quality scores @@ -101,7 +132,10 @@ A good (left) and a bad (right) per sequence quality score plot. **5. Per Base Sequence Content** -The per-base sequence content plot shows the percentage of each nucleotide (A, T, G, C) called at each base position across all reads in the file. For single-cell data, it is common to observe fluctuations at the start of the reads. This occurs because the initial bases represent the sequence of the priming sites, which are often not perfectly random. This is a frequent occurrence in RNA-seq libraries, even though `FastQC` may flag it with a warning or failure, as noted on the [QC Fail website](https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/). +The per-base sequence content plot shows the percentage of each nucleotide (A, T, G, C) called at each base position across all reads in the file. +For single-cell data, it is common to observe fluctuations at the start of the reads. +This occurs because the initial bases represent the sequence of the priming sites, which are often not perfectly random. +This is a frequent occurrence in RNA-seq libraries, even though `FastQC` may flag it with a warning or failure, as noted on the [QC Fail website](https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/). :::{figure-md} raw-proc-fig-fastqc-per-base-sequence-content per base sequence content @@ -111,19 +145,29 @@ A good (left) and bad (right) per base sequence content plot. **6. Per Sequence GC Content** -The per-sequence GC content plot displays the GC content distribution across all reads (in red) compared to a theoretical distribution (in blue). The central peak of the observed distribution should align with the overall GC content of the transcriptome. However, the observed distribution may appear wider or narrower than the theoretical one due to differences between the transcriptome's GC content and the genome's expected GC distribution. Such variations are common and may trigger a warning or failure in `FastQC`, even if the data is acceptable. +The per-sequence GC content plot displays the GC content distribution across all reads (in red) compared to a theoretical distribution (in blue). +The central peak of the observed distribution should align with the overall GC content of the transcriptome. +However, the observed distribution may appear wider or narrower than the theoretical one due to differences between the transcriptome's GC content and the genome's expected GC distribution. +Such variations are common and may trigger a warning or failure in `FastQC`, even if the data is acceptable. -A complex or irregular distribution in this plot, however, often indicates contamination in the library. It is also important to note that interpreting GC content in transcriptomics can be challenging. The expected GC distribution depends not only on the sequence composition of the transcriptome but also on gene expression levels in the sample, which are typically unknown beforehand. As a result, some deviation from the theoretical distribution is not unusual in RNA-seq data. +A complex or irregular distribution in this plot, however, often indicates contamination in the library. +It is also important to note that interpreting GC content in transcriptomics can be challenging. +The expected GC distribution depends not only on the sequence composition of the transcriptome but also on gene expression levels in the sample, which are typically unknown beforehand. +As a result, some deviation from the theoretical distribution is not unusual in RNA-seq data. :::{figure-md} raw-proc-fig-fastqc-per-sequence-gc-content Per Sequence GC Content -A good (left) and a bad (right) per sequence GC content plot. The plot on the left is from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/). The plot on the right is taken from [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html). +A good (left) and a bad (right) per sequence GC content plot. +The plot on the left is from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/). +The plot on the right is taken from [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html). ::: **7. Per Base N Content** -The per-base N content plot displays the percentage of bases at each position that were called as ``N``, indicating that the sequencer lacked sufficient confidence to assign a specific nucleotide. In a high-quality library, the ``N`` content should remain consistently at or near zero across the entire length of the reads. Any noticeable non-zero ``N`` content may indicate issues with sequencing quality or library preparation. +The per-base N content plot displays the percentage of bases at each position that were called as ``N``, indicating that the sequencer lacked sufficient confidence to assign a specific nucleotide. +In a high-quality library, the ``N`` content should remain consistently at or near zero across the entire length of the reads. +Any noticeable non-zero ``N`` content may indicate issues with sequencing quality or library preparation. :::{figure-md} raw-proc-fig-fastqc-per-base-n-content @@ -134,7 +178,10 @@ A good (left) and a bad (right) per base N content plot. **8. Sequence Length Distribution** -The sequence length distribution graph displays the distribution of read lengths across all sequences in the file. For most single-cell sequencing chemistries, all reads are expected to have the same length, resulting in a single peak in the graph. However, if quality trimming was applied before the quality assessment, some variation in read lengths may be observed. Small differences in read lengths due to trimming are normal and should not be a cause for concern if expected. +The sequence length distribution graph displays the distribution of read lengths across all sequences in the file. +For most single-cell sequencing chemistries, all reads are expected to have the same length, resulting in a single peak in the graph. +However, if quality trimming was applied before the quality assessment, some variation in read lengths may be observed. +Small differences in read lengths due to trimming are normal and should not be a cause for concern if expected. :::{figure-md} raw-proc-fig-fastqc-sequence-length-distribution Sequence Length Distribution @@ -144,9 +191,11 @@ A good (left) and a bad (right) sequence length distribution plot. **9. Sequence Duplication Levels** -The sequence duplication level plot illustrates the distribution of duplication levels for read sequences, represented by the blue line, both before and after deduplication. In single-cell platforms, multiple rounds of {term}`PCR` are typically required, and highly expressed genes naturally produce a large number of transcripts. Additionally, since `FastQC` is not UMI-aware (i.e., it does not account for unique molecular identifiers), it is common for a small subset of sequences to show high duplication levels. +The sequence duplication level plot illustrates the distribution of duplication levels for read sequences, represented by the blue line, both before and after deduplication. In single-cell platforms, multiple rounds of {term}`PCR` are typically required, and highly expressed genes naturally produce a large number of transcripts. +Additionally, since `FastQC` is not UMI-aware (i.e., it does not account for unique molecular identifiers), it is common for a small subset of sequences to show high duplication levels. -While this may trigger a warning or failure in this module, it does not necessarily indicate a quality issue with the data. However, the majority of sequences should still exhibit low duplication levels, reflecting a diverse and well-prepared library. +While this may trigger a warning or failure in this module, it does not necessarily indicate a quality issue with the data. +However, the majority of sequences should still exhibit low duplication levels, reflecting a diverse and well-prepared library. :::{figure-md} raw-proc-fig-fastqc-sequence-duplication-levels Sequence Duplication Levels @@ -156,9 +205,12 @@ A good (left) and a bad (right) per sequence duplication levels plot. **10. Overrepresented Sequences** -The overrepresented sequences module identifies read sequences that constitute more than 0.1% of the total reads. In single-cell sequencing, some overrepresented sequences may arise from highly expressed genes amplified during PCR. However, the majority of sequences should not be overrepresented. +The overrepresented sequences module identifies read sequences that constitute more than 0.1% of the total reads. +In single-cell sequencing, some overrepresented sequences may arise from highly expressed genes amplified during PCR. +However, the majority of sequences should not be overrepresented. -If the source of an overrepresented sequence is identified (i.e., not listed as "No Hit"), it could indicate potential contamination in the library from the corresponding source. Such cases warrant further investigation to ensure data quality. +If the source of an overrepresented sequence is identified (i.e., not listed as "No Hit"), it could indicate potential contamination in the library from the corresponding source. +Such cases warrant further investigation to ensure data quality. :::{figure-md} raw-proc-fig-fastqc-overrepresented-sequences Overrepresented Sequences @@ -168,7 +220,10 @@ An overrepresented sequence table. **11. Adapter Content** -The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences` at each base position. High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses. Ideally, no significant adapter content should be present in the data. If adapter sequences are abundant, additional trimming may be necessary to improve data quality. +The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences` at each base position. +High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses. +Ideally, no significant adapter content should be present in the data. +If adapter sequences are abundant, additional trimming may be necessary to improve data quality. :::{figure-md} raw-proc-fig-fastqc-adapter-content Adapter Content @@ -184,7 +239,9 @@ Multiple FastQC reports can be combined into a single report using the tool [`Mu ## Alignment and mapping -Mapping or Alignment is a critical step in single-cell raw data processing. It involves determining the potential {term}`loci` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence. This step is essential for correctly assigning reads to their source regions. +Mapping or Alignment is a critical step in single-cell raw data processing. +It involves determining the potential {term}`loci` of origin for each sequenced fragment, such as the genomic or transcriptomic locations that closely match the read sequence. +This step is essential for correctly assigning reads to their source regions. In single-cell sequencing protocols, the raw sequence files typically include: @@ -192,13 +249,19 @@ In single-cell sequencing protocols, the raw sequence files typically include: - Unique Molecular Identifiers (UMIs): Tags that distinguish individual molecules to account for amplification bias. - Raw {term}`cDNA` Sequences: The actual read sequences generated from the molecules. -As the first step in raw data processing ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses. Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices, ultimately compromising the quality of subsequent analyses. +As the first step in raw data processing ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses. +Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices, ultimately compromising the quality of subsequent analyses. -While mapping read sequences to reference sequences _far_ predates the development of scRNA-seq, the sheer scale of modern scRNA-seq datasets—often involving hundreds of millions to billions of reads—makes this step particularly computationally intensive. Many existing RNA-seq aligners are protocol-agnostic and do not inherently account for features specific to scRNA-seq, such as cell barcodes, UMIs, or their positions and lengths. As a result, additional tools are often required for steps like demultiplexing and UMI resolution {cite}`Smith2017`. +While mapping read sequences to reference sequences _far_ predates the development of scRNA-seq, the sheer scale of modern scRNA-seq datasets—often involving hundreds of millions to billions of reads—makes this step particularly computationally intensive. +Many existing RNA-seq aligners are protocol-agnostic and do not inherently account for features specific to scRNA-seq, such as cell barcodes, UMIs, or their positions and lengths. +As a result, additional tools are often required for steps like demultiplexing and UMI resolution {cite}`Smith2017`. -This reliance on separate tools introduces additional computational overhead. It often necessitates storing large intermediate files, which significantly increases disk space usage. Moreover, the extra input and output operations required to process these files further contribute to longer runtime requirements, making the mapping stage both resource-intensive and time-consuming. +This reliance on separate tools introduces additional computational overhead. +It often necessitates storing large intermediate files, which significantly increases disk space usage. +Moreover, the extra input and output operations required to process these files further contribute to longer runtime requirements, making the mapping stage both resource-intensive and time-consuming. -To address the challenges of aligning and mapping scRNA-seq data, several specialized tools have been developed that handle the additional processing requirements automatically or internally. These tools include: +To address the challenges of aligning and mapping scRNA-seq data, several specialized tools have been developed that handle the additional processing requirements automatically or internally. +These tools include: - `Cell Ranger` (commercial software from 10x Genomics) {cite}`raw:Zheng2017` - `zUMIs` {cite}`zumis` @@ -208,7 +271,9 @@ To address the challenges of aligning and mapping scRNA-seq data, several specia - `STARsolo` {cite}`Kaminow2021` - `alevin-fry` {cite}`raw:He2022` -These tools provide specialized capabilities for aligning scRNA-seq reads, parsing technical read content (e.g., cell barcodes and UMIs), demultiplexing, and UMI resolution. Although they offer simplified user interfaces, their internal methodologies differ significantly. Some tools generate traditional intermediate files, such as BAM files, which are processed further, while others operate entirely in memory or use compact intermediate representations to minimize input/output operations and reduce computational overhead. +These tools provide specialized capabilities for aligning scRNA-seq reads, parsing technical read content (e.g., cell barcodes and UMIs), demultiplexing, and UMI resolution. +Although they offer simplified user interfaces, their internal methodologies differ significantly. +Some tools generate traditional intermediate files, such as BAM files, which are processed further, while others operate entirely in memory or use compact intermediate representations to minimize input/output operations and reduce computational overhead. While these tools vary in their specific algorithms, data structures, and trade-offs in time and space complexity, their approaches can generally be categorized along two axes: @@ -219,15 +284,31 @@ While these tools vary in their specific algorithms, data structures, and trade- ### Types of mapping -Here we consider three main types of mapping algorithms that are most commonly applied in the context of mapping sc/snRNA-seq data: spliced alignment, contiguous alignment, and variations of lightweight mapping. +We focus on three main types of mapping algorithms commonly used for mapping sc/snRNA-seq data: spliced alignment, contiguous alignment, and variations of lightweight mapping. -First, let us draw a distinction here between alignment-based approaches and lightweight mapping-based approaches ({numref}`raw-proc-fig-alignment-mapping`). Alignment approaches apply a range of different heuristics to determine the potential loci from which reads may arise and then subsequently score, at each putative locus, the best nucleotide-level alignment between the read and reference, typically using dynamic programming-based approaches. Using dynamic programming algorithms to solve the alignment problem has a long and rich history. The exact type of dynamic programming algorithm used depends on the type of alignment being sought. [Global alignment](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) is applicable for the case where the entirety of the query and reference sequence are to be aligned. On the other hand, [local alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) is applicable when, possibly, only a subsequence of the query is expected to match a subsequence of the reference. Typically, the models most applicable for short-read alignment are neither fully global nor fully local, but fall into a category of semi-global alignment where the majority of the query is expected to align to some substring of the reference (this is often called "fitting" alignment). Moreover, it is sometimes common to allow soft-clipping of the alignment so that the penalties for mismatches, insertions, or deletions appearing at the very start or end of the read are ignored or down-weighted. This is commonly done using ["extension" alignment](https://github.com/smarco/WFA2-lib#-33-alignment-span). Though these modifications change the specific rules used in the dynamic programming recurrence and traceback, they do not fundamentally change its overall complexity. +First, we distinguish between alignment-based approaches and lightweight mapping-based approaches ({numref}`raw-proc-fig-alignment-mapping`). +Alignment-based methods use various heuristics to identify potential loci from which reads may originate and then score the best nucleotide-level alignment between the read and reference, typically using dynamic programming algorithms. +These algorithms have a long and rich history, with the specific algorithm used depending on the type of alignment required. -Apart from these general alignment techniques, a number of more sophisticated modifications and heuristics have been designed to make the alignment process more practical and efficient in the context of aligning genomic sequencing reads. For example, `banded alignment` {cite}`chao1992aligning` is a popular heuristic included in the alignment implementation of many popular tools, which avoids computing large parts of the dynamic programming table if the user is uninterested in alignment scores below a certain threshold. Likewise, other heuristics like X-drop {cite}`zhang2000` and Z-drop {cite}`li2018minimap2` are popular for pruning un-promising alignments early. Recent algorithmic progress, such as wavefront alignment {cite}`marco2021fast,marco2022optimal`, allows for determining optimal alignments in asymptotically (and practically) shorter time and smaller space if high-scoring (low divergence) alignments exist. Finally, considerable effort has also been devoted to optimizing data layout and computation in a way that takes advantage of instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`,and to expressing the alignment dynamic programming recurrences in a manner that is highly amenable to data parallelism and vectorization (e.g., as in the difference encoding of {cite:t}`Suzuki2018`). Most widely-used alignment tools make use of highly-optimized and vectorized alignment implementations. +For example, [global alignment](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) aligns the entirety of the query and reference sequences, while [local alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) focuses on aligning subsequences. Short-read alignment often employs a semi-global approach, also known as "fitting" alignment, where most of the query aligns to a substring of the reference. +Additionally, "soft-clipping" may be used to reduce penalties for mismatches, insertions, or deletions at the start or end of the read, achieved through ["extension" alignment](https://github.com/smarco/WFA2-lib#-33-alignment-span). +While these variations modify the rules of the dynamic programming recurrence and traceback, they do not fundamentally alter its overall complexity. -In addition to the alignment score, a backtrace of the actual alignment that yields this score may be obtained. Such information is typically encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report"), a compressed alphanumeric representation of the alignment, within the SAM or BAM file that is output. An example of a `CIGAR` string is `3M2D4M`, which represents that the alignment has three matches or mismatches, followed by a deletion of length two from the read (bases present in the reference but not the read), followed by four more matches or mismatches. Other variants of the `CIGAR` string can also delineate between matches or mismatches. For example, `3=2D2=2X` is a valid extended `CIGAR` string encoding the same alignment as just described, except that it makes clear that the three bases before the deletion constitute matches and that after the deletion, there are two matched bases followed by two mismatched bases. A detailed description of the `CIGAR` string format can be found in [the SAMtools manual](https://samtools.github.io/hts-specs/SAMv1.pdf) or [the SAM wiki page of UMICH](https://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F). +In addition to general alignment techniques, several sophisticated modifications and heuristics have been developed to enhance the practical efficiency of aligning genomic sequencing reads. For example,`banded alignment` {cite}`chao1992aligning` is a popular heuristic used by many tools to avoid computing large portions of the dynamic programming table when alignment scores below a threshold are not of interest. +Other heuristics, like X-drop {cite}`zhang2000` and Z-drop {cite}`li2018minimap2`, efficiently prune unpromising alignments early in the process. Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022optimal, enable the determination of optimal alignments in significantly reduced time and space, particularly when high-scoring alignments are present. +Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`, and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding {cite:t}`Suzuki2018`. +Most widely-used alignment tools incorporate these highly optimized, vectorized implementations. -At the cost of computing such scores, alignment-based approaches know the quality of each potential mapping of a read, which they can then use to filter reads that align well to the reference from mappings that arise as the result of low complexity or "spurious" matches between the read and reference. Alignment-based approaches include both traditional "full-alignment" approaches, as implemented in tools such as `STAR`{cite}`dobin2013star` and `STARsolo`{cite}`Kaminow2021` as well as approaches like _selective-alignment_ implemented in `salmon`{cite}`Srivastava2020Alignment` and `alevin`{cite}`Srivastava2019`, which score mappings but omit the computation of the optimal alignment's backtrace. +In addition to the alignment score, the backtrace of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report"). +This alphanumeric representation is typically stored in the SAM or BAM file output. +For example, the `CIGAR` string `3M2D4M` indicates that the alignment has three matches or mismatches, followed by a deletion of length two (representing bases present in the reference but not the read), and then four more matches or mismatches. +Extended `CIGAR` strings can provide additional details, such as distinguishing between matches, mismatches, or insertions. +For instance, `3=2D2=2X` encodes the same alignment as the previous example but specifies that the three bases before the deletion are matches, followed by two matched bases and two mismatched bases after the deletion. +A detailed description of the `CIGAR` string format can be found in [the SAMtools manual](https://samtools.github.io/hts-specs/SAMv1.pdf) or [the SAM wiki page of UMICH](https://genome.sph.umich.edu/wiki/SAM#What_is_a_CIGAR.3F). + +Alignment-based approaches, though computationally expensive, provide a quality score for each potential mapping of a read. +This score allows them to distinguish between high-quality alignments and low-complexity or "spurious" matches between the read and reference. +These approaches include traditional "full-alignment" methods, such as those implemented in tools like `STAR` {cite}`dobin2013star` and `STARsolo` {cite}`Kaminow2021`, as well as _selective-alignment_ methods, like those in `salmon` {cite}`Srivastava2020Alignment` and `alevin` {cite}`Srivastava2019`, which score mappings but skip the computation of the optimal alignment’s backtrace. :::{figure-md} raw-proc-fig-alignment-mapping Alignment vs Mapping From 00a8b55c201e95ec0e1f59e4bdf69568849e53fc Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Fri, 31 Jan 2025 13:10:05 +0100 Subject: [PATCH 04/29] clean up to cell barcode correction --- jupyter-book/glossary.md | 105 ++++++++--- .../introduction/raw_data_processing.md | 167 ++++++++++++++---- 2 files changed, 210 insertions(+), 62 deletions(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index b49f5ab0..681f6374 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -3,7 +3,9 @@ ```{glossary} Adapter sequences adapter sequences - Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing. These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing. However, if adapters are not properly removed or trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses. + Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing. + These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing. + However, if adapters are not properly removed or trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses. Algorithm Algorithms @@ -11,31 +13,44 @@ Algorithms AnnData AnnDatas - A Python package for annotated data matrices. The primary data structure used in the scverse ecosystem. + A Python package for annotated data matrices. + The primary data structure used in the scverse ecosystem. + +Backtrace +backtrace + In the context of sequence alignment, a backtrace refers to the step-by-step reconstruction of the alignment path that produces the optimal alignment score. + This path shows how the sequences are aligned, including matches, mismatches, insertions, deletions, and gaps. Barcode Barcodes Bar code Bar codes - Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell. Reads are later grouped by their barcode during raw data processing steps. + Short DNA barcode fragments ("tags") that are used to identify reads originating from the same cell. + Reads are later grouped by their barcode during raw data processing steps. Batch effect - Technical confounding factors in an experiment that cause dataset distribution shifts. Usually lead to inaccurate conclusions if the causes of the batch effects are correlated with outcomes of interest in an experiment and should be accounted for (usually removed). + Technical confounding factors in an experiment that cause dataset distribution shifts. + Usually lead to inaccurate conclusions if the causes of the batch effects are correlated with outcomes of interest in an experiment and should be accounted for (usually removed). Benchmark An (independent) comparison of performance of several tools with respect to pre-defined metrics. Bulk RNA sequencing - Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze. + Contrary to single-cell sequencing, bulk sequencing measures the average expression values of several cells. + Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze. Cell - The fundamental unit of life. Consists of cytoplasm enclosed within a membrane that contains many biomolecules such as proteins and nucleic acids. Cells acquire specific functions, transition to cell types, divide, communicate and keep the organism going. Learning about the structure, activity and communication of cells helps deciphering biology. + The fundamental unit of life. + Consists of cytoplasm enclosed within a membrane that contains many biomolecules such as proteins and nucleic acids. + Cells acquire specific functions, transition to cell types, divide, communicate and keep the organism going. + Learning about the structure, activity and communication of cells helps deciphering biology. Cell barcode See {term}`barcode` Cell type annotation - The process of labeling groups of {term}`clusters` of cells by {term}`cell type`. Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference. + The process of labeling groups of {term}`clusters` of cells by {term}`cell type`. + Commonly done based on {term}`cell type` specific markers, automatically with classifiers or by mapping against a reference. Cell type Cells that share common morphological or phenotypic features. @@ -48,37 +63,48 @@ Chromatin Cluster Clusters - A group of a population or data points that share similarities. In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`). + A group of a population or data points that share similarities. + In single-cell, clusters usually share a common function or marker gene expression that is used for annotation (see {term}`cell type annotation`). Complementary DNA (cDNA) cDNA - DNA synthesized from an RNA template by the enzyme reverse transcriptase. cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis. + DNA synthesized from an RNA template by the enzyme reverse transcriptase. + cDNA is commonly used in RNA-seq library preparation because it is more stable than RNA and allows the captured transcripts to be amplified and sequenced for gene expression analysis. Demultiplexing The process of determining which sequencing reads belong to which cell using {term}`barcodes`. DNA - DNA is the acronym of Deoxyribonucleic acid. It is the organic chemical storing hereditary information and instructions for protein synthesis. DNA gets transcribed into {term}`RNA`. + DNA is the acronym of Deoxyribonucleic acid. + It is the organic chemical storing hereditary information and instructions for protein synthesis. + DNA gets transcribed into {term}`RNA`. Doublets Reads obtained from droplet based assays might be mistakenly associated to a single cell while the RNA expression origins from two or more cells (a doublet). Downstream analysis downstream analyses - A phase of data analysis that follows the initial processing of raw data. In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics. + A phase of data analysis that follows the initial processing of raw data. + In the context of scRNA-seq, this includes tasks such as normalization, integration, filtering, cell type identification, trajectory inference, and studying expression dynamics. Dropout - A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`. The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression. Dropouts are one of the reasons why scRNA-seq data is sparse. + A gene with low expression that is observed in one cell, but not in other cells of the same {term}`cell type`. + The reason for dropouts are commonly low amounts of mRNA expression in cells and the general stochasticity of mRNA expression. + Dropouts are one of the reasons why scRNA-seq data is sparse. Drop-seq A protocol for scRNA-seq that separates cells into nano-liter sized aqueous droplets enabling large-scale profiling. FASTQ reads - Sequencing reads that are saved in the FASTQ format. FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells. + Sequencing reads that are saved in the FASTQ format. + FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells. Flowcell flowcell - A consumable device used in sequencing platforms, such as those from Illumina, where DNA or RNA fragments are sequenced. It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments. During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation. The flowcell enables high-throughput sequencing by allowing millions of fragments to be sequenced simultaneously. + A consumable device used in sequencing platforms, such as those from Illumina, where DNA or RNA fragments are sequenced. + It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments. + During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation. + The flowcell enables high-throughput sequencing by allowing millions of fragments to be sequenced simultaneously. Gene expression matrix A cell (barcode) by gene (scverse ecosystem) or gene by cell (barcode) matrix storing counts in the cell values. @@ -95,10 +121,13 @@ Library Locus Loci loci - Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located. In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region. Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data. + Specific position or region on a genome or transcriptome where a particular sequence or genetic feature is located. + In sequencing, loci refer to the potential origins of a read or fragment, such as a gene, exon, or intergenic region. + Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data. MuData - A Python package for multimodal annotated data matrices. The primary data structure in the scverse ecosystem for multimodal data. + A Python package for multimodal annotated data matrices. + The primary data structure in the scverse ecosystem for multimodal data. Muon muon @@ -107,17 +136,13 @@ muon Negative binomial distribution A discrete probability distribution that models the number of successes in a sequence of independent and identically distributed Bernoulli trials before a specified number of failures. -RNA - Ribonucleic acid. Single-stranded nucleic acid present in all living cells that encodes and regulates gene expression. - -RT-qPCR - Quantitative reverse transcription PCR (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR. - PCR - Polymercase chain reaction (PCR) is a method to amplify sequences to create billions of copies. PCR requires primers, which are short synthetic {term}`DNA` fragments, to select the genome segments to be amplified and subsequently multiple rounds of {term}`DNA` synthesis to amplify the targeted segments. + Polymercase chain reaction (PCR) is a method to amplify sequences to create billions of copies. + PCR requires primers, which are short synthetic {term}`DNA` fragments, to select the genome segments to be amplified and subsequently multiple rounds of {term}`DNA` synthesis to amplify the targeted segments. Pipeline - Also often times denoted as workflow. A pre-specified selection of steps that are commonly executed in order. + Also often times denoted as workflow. + A pre-specified selection of steps that are commonly executed in order. Poisson distribution Discrete probability distribution denoting the probability of a specified number of events occurring in a fixed interval of time or space with the events occurring independently at a known constant mean rate. @@ -126,24 +151,46 @@ Promoter Sequence of DNA to which proteins bind to initiate and control transcription. Pseudotime - Latent and therefore unobserved dimension reflecting cells' progression through transitions. Pseudotime is usually related to real time events, but not necessarily the same. + Latent and therefore unobserved dimension reflecting cells' progression through transitions. + Pseudotime is usually related to real time events, but not necessarily the same. + +RNA + Ribonucleic acid. Single-stranded nucleic acid present in all living cells that encodes and regulates gene expression. + +RT-qPCR + Quantitative reverse transcription PCR (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR. scanpy A Python package for single-cell analysis in Python by scverse. scverse - A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools. See: https://scverse.org/ + A consortium for fundamental single-cell tools in the life sciences that are maintaining computational analysis tools like scanpy, muon and scvi-tools. + See: https://scverse.org/ signal-to-noise ratio - A measure of the clarity of a signal relative to background noise. In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data. A high SNR indicates that the signal is strong and reliable compared to the noise, resulting in better data quality. Conversely, a low SNR means the noise may interfere with or reduce the accuracy of the sequencing results. + A measure of the clarity of a signal relative to background noise. + In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data. + A high SNR indicates that the signal is strong and reliable compared to the noise, resulting in better data quality. + Conversely, a low SNR means the noise may interfere with or reduce the accuracy of the sequencing results. Spike-in RNA RNA transcripts of known sequence and quantity to calibrate measurements in RNA hybridization steps for RNA-seq. +Splice Junctions +splice junctions + Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing. These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA. + Trajectory inference - Also known as pseudotemporal ordering. The computational recovery of dynamic processes by ordering cells by similarity or other means. + Also known as pseudotemporal ordering. + The computational recovery of dynamic processes by ordering cells by similarity or other means. Unique Molecular Identifier (UMI) unique molecular identifiers (UMIs) - Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing. UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates. + Specific type of molecular barcodes aiding with error correction and increased accuracy during sequencing. + UMIs unique tag molecules in sample libraries enabling estimation of PCR duplication rates. + +Untranslated Region (UTR) +UTR + A segment of an mRNA transcript that is transcribed but not translated into protein. + UTRs are located at both ends of the coding sequence. ``` diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 7b18b36e..ff54993a 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -294,12 +294,13 @@ For example, [global alignment](https://en.wikipedia.org/wiki/Needleman%E2%80%93 Additionally, "soft-clipping" may be used to reduce penalties for mismatches, insertions, or deletions at the start or end of the read, achieved through ["extension" alignment](https://github.com/smarco/WFA2-lib#-33-alignment-span). While these variations modify the rules of the dynamic programming recurrence and traceback, they do not fundamentally alter its overall complexity. -In addition to general alignment techniques, several sophisticated modifications and heuristics have been developed to enhance the practical efficiency of aligning genomic sequencing reads. For example,`banded alignment` {cite}`chao1992aligning` is a popular heuristic used by many tools to avoid computing large portions of the dynamic programming table when alignment scores below a threshold are not of interest. +In addition to general alignment techniques, several sophisticated modifications and heuristics have been developed to enhance the practical efficiency of aligning genomic sequencing reads. +For example,`banded alignment` {cite}`chao1992aligning` is a popular heuristic used by many tools to avoid computing large portions of the dynamic programming table when alignment scores below a threshold are not of interest. Other heuristics, like X-drop {cite}`zhang2000` and Z-drop {cite}`li2018minimap2`, efficiently prune unpromising alignments early in the process. Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022optimal, enable the determination of optimal alignments in significantly reduced time and space, particularly when high-scoring alignments are present. Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`, and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding {cite:t}`Suzuki2018`. Most widely-used alignment tools incorporate these highly optimized, vectorized implementations. -In addition to the alignment score, the backtrace of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report"). +In addition to the alignment score, the {term}`backtrace` of the actual alignment that produces this score is often encoded as a `CIGAR` string (short for "Concise Idiosyncratic Gapped Alignment Report"). This alphanumeric representation is typically stored in the SAM or BAM file output. For example, the `CIGAR` string `3M2D4M` indicates that the alignment has three matches or mismatches, followed by a deletion of length two (representing bases present in the reference but not the read), and then four more matches or mismatches. Extended `CIGAR` strings can provide additional details, such as distinguishing between matches, mismatches, or insertions. @@ -316,85 +317,185 @@ These approaches include traditional "full-alignment" methods, such as those imp An abstract overview of the alignment-based method and lightweight mapping-based method. ::: -Alignment-based approaches can be further categorized into spliced-alignment and contiguous-alignment approaches (currently, there are no lightweight-mapping approaches that perform spliced mapping). Spliced-alignment approaches are capable of aligning a sequence read over several distinct segments of a reference, allowing potentially large gaps between the regions of the reference that align well with the corresponding parts of the read. These alignment approaches are typically applied when aligning RNA-seq reads to the genome, since the alignment procedure must be able to align reads that span across one or more splice junctions of the transcript, where a sequence that is contiguous in the read may be separated by intron and exon subsequences (potentially spanning kilobases of sequence) in the reference. Spliced alignment is a challenging problem, particularly in cases where only a small region of the read spans a splicing junction, since very little informative sequence may be available to place the segment of the read that overhangs the splice junction by only a small amount. On the other hand, contiguous alignment seeks a contiguous substring of the reference that aligns well against the read. In such alignment problems, though small insertions and deletions may be allowed, large gaps such as those observed in spliced alignments are typically not tolerated. +Alignment-based approaches can be categorized into spliced-alignment and contiguous-alignment methods. +Currently, no lightweight-mapping approaches perform spliced mapping. -Finally, we can draw a distinction between alignment-based methods such as those described above and lightweight-mapping methods, which include approaches such as pseudoalignment {cite}`Bray2016`, quasi-mapping {cite}`srivastava2016rapmap`, and pseudoalignment with structural constraints {cite}`raw:He2022`. Such approaches generally achieve superior speed by avoiding nucleotide-level alignment between the read and reference sequences. Instead, they base the determination of the reported mapping loci of a read on a separate set of rules and heuristics that may look only at the set of matching k-mers or other types of exact matches (e.g., via a consensus rule) and, potentially, their orientations and relative positions with respect to each other on both the read and reference (e.g., chaining). This can lead to substantially increased speed and mapping throughput, but also precludes easily-interpretable score-based assessments of whether or not the matches between the read and reference constitute a good match capable of supporting a high-quality alignment. +**Spliced-alignment methods** allow a sequence read to align across multiple distinct segments of a reference, allowing potentially large gaps between aligned regions. +These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions`. +In such cases, a contiguous sequence in the read may be separated by intron and exon subsequence in the reference, potentially spanning kilobases of sequence. +Spliced alignment is especially challenging when only a small portion of a read overlaps a splice junction, as limited sequence information is available to accurately place the overhanging segment. + +**Contiguous-alignment methods**, in contrast, require a continuous substring of the reference to align well with the read. +While small insertions and deletions may be tolerated, large gaps—such as those in spliced alignments—are generally not allowed. + +Alignment-based methods, such as spliced and contiguous alignment, can be distinguished from **lightweight-mapping methods**, which include approaches like **pseudoalignment** {cite}`Bray2016`, **quasi-mapping** {cite}`srivastava2016rapmap`, and **pseudoalignment with structural constraints** {cite}`raw:He2022`. + +Lightweight-mapping methods achieve significantly higher speed by bypassing nucleotide-level alignment between the read and reference sequences. +Instead, they determine mapping loci based on alternative rules and heuristics, such as identifying matching k-mers or other exact matches. +These methods may also consider the orientation and relative positions of these matches on both the read and reference (e.g., through chaining). + +While this approach greatly improves speed and throughput, it does not provide easily-interpretable score-based assessments to determine the quality of a match, making it more difficult to assess alignment confidence. (raw-proc:mapping-references)= ### Mapping against different reference sequences -While different choices can be made among the mapping algorithms themselves, different choices can _also_ be made about the reference against which the read is mapped. We consider three main categories of reference sequence against which a method might map: +In addition to selecting a mapping algorithm, choices can _also_ be made regarding the reference sequence against which the reads are mapped. +There are three main categories of reference sequences: -- The full (likely annotated) reference genome -- The annotated transcriptome -- An augmented transcriptome +- Full reference genome (typically annotated) +- Annotated transcriptome +- Augmented transcriptome -It is also worth noting that, currently, not all combinations of mapping algorithms and reference sequences are possible. For example, lightweight mapping-based algorithms do not currently exist that can perform spliced mapping of reads against a reference genome. +Currently, not all combinations of mapping algorithms and reference sequences are possible. For instance, lightweight-mapping algorithms do not yet support spliced mapping of reads against a reference genome. (raw-proc:genome-mapping)= #### Mapping to the full genome -The first type of reference, against which a method may map, consists of the entire genome of the target organism, usually with the annotated transcripts considered during mapping. Tools like `zUMIs` {cite}`zumis`, `Cell Ranger` {cite}`raw:Zheng2017` and `STARsolo` {cite}`Kaminow2021` take this approach. Since many of the reads arise from spliced transcripts, such an approach also requires the use of a splice-aware alignment algorithm where the alignment for a read can be split across one or more splicing junctions. The main benefits of this approach are that reads arising from any location in the genome, not just from annotated transcripts, can be accounted for. Since these approaches require constructing a genome-wide index, there is little marginal cost for reporting not only the reads that map to known spliced transcripts but also reads that overlap or align within introns, making the alignment cost when using such approaches very similar for both single-cell and single-nucleus data. A final benefit is that even reads residing outside of the annotated transcripts, exons, and introns can be accounted for by such methods, which can enable _post hoc_ augmentation of the set of quantified loci (e.g., as is done by {cite:t}`Pool2022` by adding expressed UTR extensions to transcript annotations in a sample-specific and data-driven manner) and potentially increase gene detection and quantification sensitivity. +The first type of reference used for mapping is the **entire genome** of the target organism, typically with annotated transcripts considered during mapping. +Tools such as `zUMIs` {cite}`zumis`, `Cell Ranger` {cite}`raw:Zheng2017`, and `STARsolo` {cite}`Kaminow2021` follow this approach. +Since many reads originate from **spliced transcripts**, this method requires a **splice-aware alignment algorithm** capable of splitting alignments across one or more splice junctions. + +A key advantage of this approach is that it accounts for reads arising from any location in the genome, not just those from annotated transcripts. +Additionally, because a **genome-wide index** is constructed, there is minimal additional cost in reporting not only reads that map to known spliced transcripts but also those that overlap introns or align within non-coding regions, making this method equally effective for **single-cell** and **single-nucleus** data. +Another benefit is that even reads mapping outside annotated transcripts, exons, or introns can still be accounted for, enabling **_post hoc_ augmentation** of the quantified loci. +For instance, methods such as those described by {cite:t}`Pool2022` incorporate expressed {term}`UTR` extensions in a sample-specific, data-driven manner, potentially increasing gene detection and improving quantification sensitivity. + +While spliced alignment against the full genome offers versatility, it also comes with certain trade-offs. +One major limitation is the high memory requirements of commonly used alignment tools in the single-cell space. +Many of these tools are based on the **STAR** aligner {cite}`dobin2013star`, due to its speed and versatillity, and require substantial computational resources. +For a human-scale genome, constructing and storing the index can demand over $32$ GB of memory. +Using a sparse [suffix array](https://en.wikipedia.org/wiki/Suffix_array) can nearly halve the final index size, but this comes at the cost of reduced alignment speed and still requires significant memory for initial construction. + +Additionally, spliced alignment is inherently more complex than contiguous alignment. +Because current spliced-alignment tools must explicitly compute a score for each read, this approach has a higher computational cost compared to alternatives. -On the other hand, the versatility of the strategy of performing spliced alignment against the full genome comes with certain trade-offs. First, the most commonly-used alignment tools that adopt this strategy in the single-cell space have substantial memory requirements. Due to its speed and versatility, most of these tools are based upon the STAR {cite}`dobin2013star` aligner. Yet, for a human-scale genome, constructing and storing the index can require upwards of $32$ GB of memory. The use of a sparse [suffix array](https://en.wikipedia.org/wiki/Suffix_array) can reduce the final index size by close to a factor of $2$, but this comes at a reduction in alignment speed, and it still requires larger memory for the initial construction. Second, the increased difficulty of spliced alignment compared to contiguous alignment and the fact that current spliced-alignment tools must explicitly compute a score for each read alignment, means that this approach comes at an increased computational cost compared to the alternatives. Finally, such an approach requires the genome of the organism under study is available. While this is not problematic for the most commonly-studied reference organisms and is rarely an issue, it can make using such tools difficult for non-model organisms where only a transcriptome assembly may be available. +Finally, spliced alignment requires an available reference genome for the organism under study. +While this is rarely an issue for well-characterized model organisms, it can pose challenges when working with non-model organisms, where only a transcriptome assembly may be available. (raw-proc:txome-mapping)= #### Mapping to the spliced transcriptome -To reduce the computational overhead of spliced alignment to a genome, the widely-adopted alternative is to use just the sequences of the annotated transcripts themselves as a reference. Since the majority of single-cell experiments are generated from model organisms (such as mouse or human), which have well-annotated transcriptomes, such transcriptome-based quantification methods may provide similar read coverage to their genome-based counterparts. Compared to the genome, transcriptome sequences are much smaller in size, minimizing the computational resources required for running the mapping pipeline. Additionally, since the relevant splicing patterns are already represented in the transcript sequences themselves, this approach dispenses with the need to solve the difficult spliced-alignment problem. Instead, one can simply search for contiguous alignments or mappings for the read. Since only contiguous mappings need to be discovered, both alignment and lightweight mapping techniques are amenable to transcriptome references, and both approaches have been used in popular tools that adopt the spliced transcriptome as the target of reference mapping. +To reduce the computational overhead of spliced alignment to a genome, a widely adopted alternative is to use only the annotated transcript sequences as the reference. +Since most single-cell experiments are conducted on model organisms like mouse or human, which have well-annotated transcriptomes, transcriptome-based quantification can achieve similar read coverage to genome-based methods. -However, while such approaches can greatly reduce the memory and time required for alignment and mapping, they will fail to capture reads that arise from outside of the spliced transcriptome. Such an approach is therefore not adequate when processing single-nucleus data. Even in single-cell experiments, reads arising from outside of the spliced transcriptome can constitute a substantial fraction of all data, and there is growing evidence that such reads should be incorporated into subsequent analysis {cite}`technote_10x_intronic_reads,Pool2022`. Further, when paired with lightweight-mapping approaches, short matches shared between the spliced transcriptome and the reference sequences that truly give rise to a read may lead to spurious read mappings, which, in turn, can lead to spurious and even biologically implausible estimates of the expression of certain genes {cite}`Kaminow2021,Bruning2022Comparative,raw:He2022`. +Compared to the genome, transcriptome sequences are much smaller, significantly reducing the computational resources needed for mapping. +Additionally, because splicing patterns are already represented in transcript sequences, this approach eliminates the need for complex spliced alignment. +Instead, one can simply search for contiguous alignments or mappings for the read. +Instead, reads can be mapped using contiguous alignments, making both alignment-based and lightweight-mapping techniques suitable for transcriptome references. +As a result, both approaches are commonly used in popular tools that perform reference mapping against the spliced transcriptome. + +While these approaches significantly reduce the memory and time required for alignment and mapping, they fail to capture reads that arise from outside the spliced transcriptome. +As a result, they are not suitable for processing single-nucleus data. +Even in single-cell experiments, reads arising from outside of the spliced transcriptome can constitute a substantial fraction of all data, and there is growing evidence that such reads should be incorporated into subsequent analysis {cite}`technote_10x_intronic_reads,Pool2022`. +Even in single-cell experiments, a substantial fraction of reads may arise from regions outside the spliced transcriptome, and increasing evidence suggests that incorporating these reads into downstream analyses can be beneficial {cite}`technote_10x_intronic_reads,Pool2022`. +Additionally, when paired with lightweight-mapping methods, short sequences shared between the spliced transcriptome and the actual genomic regions that generated a read can lead to spurious mappings. This, in turn, may result in misleading and even biologically implausible gene expression estimates {cite}`Kaminow2021,Bruning2022Comparative,raw:He2022`. (raw-proc:aug-txome-mapping)= #### Mapping to an augmented transcriptome -To deal with the fact that sequenced reads may arise from outside of spliced transcripts, it is possible to augment the spliced transcript sequences with other reference sequences that may be expected to give rise to reads (e.g., full-length unspliced transcripts or excised intronic sequences). Transcriptome references, when augmented with further sequences such as introns, allow reference indices typically smaller than those required for the full genome while simultaneously retaining the ability to search only for contiguous read alignments. This means they can potentially enable both faster and less memory-hungry alignment than when mapping against the full genome while still accounting for many of the reads that arise from outside of the spliced transcriptome. Finally, by mapping to an expanded collection of reference sequences, not only are the mapping locations of more reads recovered compared to mapping against just the spliced transcriptome, but when using lightweight mapping-based approaches, spurious mappings can be greatly reduced {cite}`raw:He2022`. Such an expanded or augmented transcriptome is commonly used among approaches (those that do not map to the full genome) when they need to quantify single-nucleus data or prepare data for RNA-velocity analysis {cite}`Soneson2021Preprocessing`. Therefore, such augmented references can be constructed for all of the common methods that don't make use of spliced alignment to the full genome {cite}`Srivastava2019,Melsted2021,raw:He2022`. +To account for reads originating outside spliced transcripts, the spliced transcript sequences can be augmented with additional reference sequences, such as full-length unspliced transcripts or excised intronic sequences. +By incorporating these elements, augmented transcriptome references maintain a smaller index than the full genome while still allowing for contiguous read alignments. +This enables faster and more memory-efficient mapping compared to full-genome alignment, while still capturing many reads that would otherwise be missed when mapping solely to the spliced transcriptome. + +Additionally, expanding the reference set improves mapping accuracy. +More reads can be confidently assigned compared to using only the spliced transcriptome, and when combined with lightweight mapping approaches, spurious mappings can be significantly reduced {cite}`raw:He2022`. +Augmented transcriptomes are widely used in methods that do not map to the full genome, particularly for single-nucleus data processing and RNA velocity analysis {cite}`Soneson2021Preprocessing` (see {doc}`../trajectories/rna_velocity`). +These augmented references can be constructed for all common methods that do not rely on spliced alignment to the full genome {cite}`Srivastava2019,Melsted2021,raw:He2022`. + +{cite:t}`raw:He2022` argue that this approach is valuable even for standard single-cell RNA-seq data and recommend constructing an augmented _splici_ reference (spliced + intronic) for mapping and quantification. +The _splici_ reference is built using the spliced transcriptome sequence alongside sequences representing the merged intronic intervals of annotated genes. +Each reference sequence is labeled with its splicing status, and mapping results are processed using splicing status-aware methods for {ref}`raw-proc:umi-resolution`. -{cite:t}`raw:He2022` argue that such an approach is valuable even when processing standard single-cell RNA-seq data and recommend constructing an augmented _splici_ (meaning spliced + intronic) reference for mapping and quantification. The _splici_ reference is constructed using the spliced transcriptome sequence along with the sequences containing the merged intervals corresponding to the introns of the annotated genes. Each reference is then labeled with its annotated splicing status, and the mapping to this reference is subsequently paired with splicing status-aware methods for {ref}`raw-proc:umi-resolution`. The main benefits of this approach are that it admits the use of lightweight-mapping methods while substantially reducing the prevalence of spurious mappings. It enables the detection of reads of both spliced and unspliced origin, which can increase the sensitivity of subsequent analysis {cite}`technote_10x_intronic_reads,Pool2022`, and, since splicing status is tracked during quantification and reported separately in the output, it unifies the processing pipeline for single-cell, single-nucleus, and RNA-velocity preprocessing. +This approach offers several key benefits. +It allows the use of lightweight mapping methods while significantly reducing spurious mappings. +Additionally, it enables the detection of both spliced and unspliced reads, improving sensitivity in downstream analyses {cite}`technote_10x_intronic_reads,Pool2022`. +Since splicing status is tracked and reported separately, this method also unifies the preprocessing pipeline across single-cell, single-nucleus, and RNA velocity analyses, making it a versatile solution for transcript quantification. (raw-proc:cb-correction)= ## Cell barcode correction -Droplet-based single-cell segregation systems, such as those provided by 10x Genomics, have become an important tool for studying the cause and consequences of cellular heterogeneity. In this segregation system, the RNA material of each captured cell is extracted within a water-based droplet encapsulation along with a barcoded bead. These beads tag the RNA content of individual cells with unique oligonucleotides, called cell barcodes (CBs), that are later sequenced along with the fragments of the cDNAs that are reversely transcribed from the RNA content. The beads contain high-diversity DNA barcodes enabling parallel barcoding of the cell's molecular content and _in silico_ demultiplexing of the sequencing reads into individual cellular bins. +Droplet-based single-cell segregation systems, such as those provided by 10x Genomics, have become an important tool for studying the cause and consequences of cellular heterogeneity. +In this segregation system, the RNA material of each captured cell is extracted within a water-based droplet encapsulation along with a **barcoded bead**. +These beads tag the RNA content of individual cells with unique oligonucleotides, called cell barcodes (CBs), that are later sequenced along with the fragments of the cDNAs that are reversely transcribed from the RNA content. +The beads contain high-diversity DNA barcodes, allowing for parallel barcoding of a cell’s molecular content and _in silico_ demultiplexing of sequencing reads into individual cellular bins. ```{admonition} A note on alignment orientation -Depending on the chemistry of the sample being analyzed and the processing options provided by the user, it is not necessarily the case that all sequenced fragments aligning to the reference will be considered for quantification and barcode correction. -One commonly-applied criterion for filtering is alignment orientation. Specifically, certain chemistries specify protocols such that the aligned reads should only derive from (i.e. map back to) the underlying transcripts in a specific orientation. -For example, in 10x Genomics 3' Chromium chemistries, we expect the biological read to align to the underlying transcript's forward strand, though anti-sense reads do exist {cite}`technote_10x_intronic_reads`. Thus, mappings of the reads in the reverse-complement orientation to the underlying sequences may be ignored or filtered out at the user's discretion. If a chemistry follows such a so-called "stranded" protocol, this should be documented. +Depending on the sample chemistry and user-defined processing options, not all sequenced fragments that align to the reference are necessarily considered for quantification and barcode correction. +One commonly-applied criterion for filtering is alignment orientation. +Specifically, certain chemistries specify protocols such that the aligned reads should only derive from (i.e. map back to) the underlying transcripts in a specific orientation. +For example, in 10x Genomics 3' Chromium chemistries, we expect the biological read to align to the underlying transcript's forward strand, though anti-sense reads do exist {cite}`technote_10x_intronic_reads`. +As a result, reads mapped in the reverse-complement orientation to the reference sequences may be ignored or filtered out based on user-defined settings. +If a chemistry follows such a so-called "stranded" protocol, this should be documented. ``` ### Type of errors in barcoding -The tag, sequence, and demultiplex method for single-cell profiling generally works well. However, the number of observed cell barcodes (CBs) in a droplet-based library can significantly differ from the number of originally encapsulated cells by several fold. A few main sources of error can lead to such observation: +The tag, sequence, and demultiplexing method used for single-cell profiling is generally effective. +However, in droplet-based libraries, the number of observed cell barcodes (CBs) can differ significantly—often by several fold—from the number of originally encapsulated cells. +This discrepancy arises from several key sources of error: -- Doublet / Multiplet: A single barcode can be associated with two or more cells and lead to undercounting of cells. -- Empty Droplet: A droplet can be captured with no encapsulated cell, where the ambient RNA is tagged with the barcode and can be sequenced; this leads to overcounting of cells. -- Sequence error: Errors can arise during the PCR amplification or sequencing process and can contribute to both under and over-counting of the cells. +- Doublets/Multiplets: A single barcode may be associated with multiple cells, leading to an undercounting of cells. +- Empty Droplets: Some droplets contain no encapsulated cells, and ambient RNA can become tagged with a barcode and sequenced, resulting in overcounting of cells. +- Sequence Errors: Errors introduced during PCR amplification or sequencing can distort barcode counts, contributing to both under- and over-counting. -Computational tools for demultiplexing the {term}`RNA`-seq reads into cell-specific bins use a wide range of diagnostic indicators to filter out artefactual or low-quality data. For example, numerous methods exist for the removal of ambient {term}`RNA` contamination {cite}`raw:Young2020,Muskovic2021,Lun2019`, doublet detection {cite}`DePasquale2019,McGinnis2019,Wolock2019,Bais2019` and cell barcodes correction for sequence errors based on nucleotide sequence similarity. +To address these issues, computational tools for demultiplexing RNA-seq reads into cell-specific bins use various diagnostic indicators to filter out artefactual or low-quality data. +Numerous methods exist for removing ambient RNA contamination {cite}`raw:Young2020,Muskovic2021,Lun2019`, detecting doublets {cite}`DePasquale2019,McGinnis2019,Wolock2019,Bais2019`, and correcting cell barcode errors based on nucleotide sequence similarity. Several common strategies are used for cell barcode identification and correction. -- Correction against a known list of _potential_ barcodes: Certain chemistries, such as 10x Chromium, draw CBs from a known pool of potential barcode sequences. Thus, the set of barcodes observed in any sample is expected to be a subset of this known list, often called a "whitelist", "permit list", or "pass list". In this case, a common strategy is to assume each barcode that exactly matches some element from the known list is correct and for all other barcodes to be correct against the known list of barcodes (i.e., to find barcodes from the permit list that are some small Hamming distance or edit distance away from the observed barcodes). This approach leverages the known permit list to allow efficient correction of many barcodes that have been potentially corrupted. However, one difficulty with this approach is that a given corrupted barcode may have multiple possible corrections in the permit list (i.e., its correction may be ambiguous). In fact, if one considers a barcode that is taken from the [10x Chromium v3 permit list](https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/3M-february-2018.txt.gz) and mutated at a single position to a barcode not in the list, there is a $\sim 81\%$ chance that it sits at Hamming distance $1$ from two or more barcodes in the permit list. The probability of such collisions can be reduced by only considering correcting against barcodes from the known permit list, which, themselves, occur exactly in the given sample (or even only those that occur exactly in the given sample above some nominal frequency threshold). Also, information such as the base quality at the "corrected" position can be used to potentially break ties in the case of ambiguous corrections. Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded. - -- Knee or elbow-based methods: If a set of potential barcodes is unknown — or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list — one can adopt a method based on the observation that the list of "true" or high-quality barcodes in a sample is likely those associated with the greatest number of reads. - To do this, one can construct the cumulative frequency plot of the barcodes, in which barcodes are sorted in descending order of the number of distinct reads or UMIs with which they are associated. Often, this ranked cumulative frequency plot will contain a "knee" or "elbow" – an inflection point that can be used to characterize frequently occurring barcodes from infrequent (and therefore likely erroneous) barcodes. Many methods exist for attempting to identify such an inflection point {cite}`Smith2017,Lun2019,raw:He2022` as a likely point of discrimination between properly captured cells and empty droplets. Subsequently, the set of barcodes that appear "above" the knee can be treated as a permit list against which the rest of the barcodes may be corrected, as in the first method list above. Such an approach is flexible as it can be applied in chemistries that have an external permit list and those that don't. Further parameters of the knee-finding algorithms can be altered to yield more or less restrictive selected barcode sets. Yet, such an approach can have certain drawbacks, like a tendency to be overly conservative and sometimes failing to work robustly in samples where no clear knee is present. - -- Filtering and correction based on an expected cell count provided by the user: These approaches seek to estimate a robust list of high-quality or present barcodes in the cases where the CB frequency distribution may not have a clear knee or exhibit bimodality due to technical artifacts. In such an approach, the user provides an estimate of the expected number of assayed cells. Then, the barcodes are ordered by descending frequency, the frequency $f$ at a robust quantile index near the expected cell count is obtained, and all cells having a frequency within a small constant fraction of $f$ (e.g., $\ge \frac{f}{10}$) are considered as valid barcodes. Again, the remaining barcodes are corrected against this valid list by attempting to correct uniquely to one of these valid barcodes based on sequence similarity. - -- Filtering based on a forced number of valid cells: Perhaps the simplest approach, although potentially problematic, is for the user to directly provide the index in the sorted frequency plot above which barcodes will be considered valid. All barcodes with a frequency greater than or equal to the frequency at the selected index are considered valid and treated as constituting the permit list. The remaining set of barcodes is then corrected against this list using the same approach described in the other methods above. If there are at least as many distinct barcodes as the number of cells the user requests, then this many cells will always be selected. Of course, such an approach is only reasonable when the user has a good reason to believe that the threshold frequency should be set around the provided index. +1. **Correction against a known list of _potential_ barcodes**: + Certain chemistries, such as 10x Chromium, draw CBs from a known pool of potential barcode sequences. + Thus, the set of barcodes observed in any sample is expected to be a subset of this known list, often called a "whitelist", "permit list", or "pass list". + In this case, the standard approach assumes that: + +- Any barcode matching an entry in the known list is correct. +- Any barcode not in the list is corrected by finding the closest match from the permit list, typically using {term}`Hamming distance` or {term}`edit distance`. + This strategy allows for efficient barcode correction but has limitations. + If a corrupted barcode closely resembles multiple barcodes in the permit list, its correction becomes ambiguous. + For example, for a barcode taken from the [10x Chromium v3 permit list](https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/3M-february-2018.txt.gz) and mutated at a single position to a barcode not in the list, there is an $\sim 81\%$ probability that it sits at hamming distance $1$ from two or more barcodes in the permit list. + The probability of such collisions can be reduced by considering correcting _only_ against barcodes from the known permit list, which, themselves, occur exactly in the given sample (or even only those that occur exactly in the given sample above some nominal frequency threshold). + Also, information such as the base quality at the "corrected" position can be used to potentially break ties in the case of ambiguous corrections. + Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded. + +##TODO## 2. **Knee or elbow-based methods**: +If a set of potential barcodes is unknown — or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list — one can use a method based on the observation that high-quality barcodes are those associated with the highest number of reads in the sample. +To achieve this, one can construct a cumulative frequency plot where barcodes are sorted in descending order based on the number of distinct reads or UMIs they are associated with. +Often, this ranked cumulative frequency plot will contain a "knee" or "elbow" – an inflection point that can be used to characterize frequently occurring barcodes from infrequent (and therefore likely erroneous) barcodes. +Many methods exist for attempting to identify such an inflection point {cite}`Smith2017,Lun2019,raw:He2022` as a likely point of discrimination between properly captured cells and empty droplets. +Subsequently, the set of barcodes that appear "above" the knee can be treated as a permit list against which the rest of the barcodes may be corrected, as in the first method list above. +Such an approach is flexible as it can be applied in chemistries that have an external permit list and those that don't. +Further parameters of the knee-finding algorithms can be altered to yield more or less restrictive selected barcode sets. +Yet, such an approach can have certain drawbacks, like a tendency to be overly conservative and sometimes failing to work robustly in samples where no clear knee is present. + +3. **Filtering and Correction Based on an Expected Cell Count**: + When barcode frequency distributions lack a clear knee or show bimodal patterns due to technical artifacts, barcode correction can be guided by a user-provided expected cell count. + In such an approach, the user provides an estimate of the expected number of assayed cells. + Then, the barcodes are ordered by descending frequency, the frequency $f$ at a robust quantile index near the expected cell count is obtained, and all cells having a frequency within a small constant fraction of $f$ (e.g., $\ge \frac{f}{10}$) are considered as valid barcodes. + Again, the remaining barcodes are corrected against this valid list by attempting to correct uniquely to one of these valid barcodes based on sequence similarity. + +4. **Filtering Based on a Forced Number of Valid Cells**: + The simplest approach, although potentially problematic, is for the user to manually specify the number of valid barcodes. + +- The user chooses an index in the sorted barcode frequency list. +- All barcodes above this threshold are considered valid. +- Remaining barcodes are corrected against this list using standard similarity-based correction methods. + While this guarantees selection of at least n cells, it assumes that the chosen threshold accurately reflects the number of real cells. + It is only reasonable if he user has a good reason to believe that the threshold frequency should be set around the provided index. %In the `alevin-fry` framework, the frequency of every observed cell barcode is generated, and there are four customizable options to select the high-quality cell barcodes for downstream analysis: ### Future challenges -While cellular barcoding of high-throughput single-cell profiling has been a tremendously successful approach, some challenges still remain, especially as the scale of experiments continues to grow. For example, the design of a robust method for selecting high-quality cell barcodes from the set of all the observations is still an active area of research, with distinct challenges arising, e.g., between single-cell and single-nucleus experiments. Also, as single-cell technologies have advanced to profile increasing numbers of cells, insufficient sequence diversity in the CB sequence can result in sequence corrections leading to CB collision. Addressing this latter problem may require more intelligent barcode design methods and continuing increases in the lengths of oligonucleotides used for cell barcoding. +While cellular barcoding of high-throughput single-cell profiling has been a tremendously successful approach, some challenges still remain, especially as the scale of experiments continues to grow. +For example, the design of a robust method for selecting high-quality cell barcodes from the set of all the observations is still an active area of research, with distinct challenges arising, e.g., between single-cell and single-nucleus experiments. +Also, as single-cell technologies have advanced to profile increasing numbers of cells, insufficient sequence diversity in the CB sequence can result in sequence corrections leading to CB collision. +Addressing this latter problem may require more intelligent barcode design methods and continuing increases in the lengths of oligonucleotides used for cell barcoding. (raw-proc:umi-resolution)= From 2ea159a56783495c41fc7c4df5db625c31617477 Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Fri, 31 Jan 2025 17:37:46 +0100 Subject: [PATCH 05/29] clean the rest of the chapter --- jupyter-book/glossary.md | 10 + .../introduction/raw_data_processing.md | 268 ++++++++++++++---- 2 files changed, 215 insertions(+), 63 deletions(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 681f6374..90b8b797 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -74,6 +74,9 @@ cDNA Demultiplexing The process of determining which sequencing reads belong to which cell using {term}`barcodes`. +directed graph + A directed graph (or digraph) is a graph consisting of a set of nodes (vertices) connected by edges (arcs), where each edge has a direction indicating a one-way relationship between nodes. + DNA DNA is the acronym of Deoxyribonucleic acid. It is the organic chemical storing hereditary information and instructions for protein synthesis. @@ -95,6 +98,9 @@ Dropout Drop-seq A protocol for scRNA-seq that separates cells into nano-liter sized aqueous droplets enabling large-scale profiling. +Edit distance + Edit distance (often referring to Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another. + FASTQ reads Sequencing reads that are saved in the FASTQ format. FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells. @@ -109,6 +115,10 @@ flowcell Gene expression matrix A cell (barcode) by gene (scverse ecosystem) or gene by cell (barcode) matrix storing counts in the cell values. +Hamming distance + A measure of the number of positions at which two strings of equal length differ. + It is commonly used in error detection and correction, including barcode correction in sequencing data. + Imputation The replacement of missing values with usually artificial values. diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index ff54993a..67642221 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -463,15 +463,15 @@ Several common strategies are used for cell barcode identification and correctio Also, information such as the base quality at the "corrected" position can be used to potentially break ties in the case of ambiguous corrections. Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded. -##TODO## 2. **Knee or elbow-based methods**: -If a set of potential barcodes is unknown — or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list — one can use a method based on the observation that high-quality barcodes are those associated with the highest number of reads in the sample. -To achieve this, one can construct a cumulative frequency plot where barcodes are sorted in descending order based on the number of distinct reads or UMIs they are associated with. -Often, this ranked cumulative frequency plot will contain a "knee" or "elbow" – an inflection point that can be used to characterize frequently occurring barcodes from infrequent (and therefore likely erroneous) barcodes. -Many methods exist for attempting to identify such an inflection point {cite}`Smith2017,Lun2019,raw:He2022` as a likely point of discrimination between properly captured cells and empty droplets. -Subsequently, the set of barcodes that appear "above" the knee can be treated as a permit list against which the rest of the barcodes may be corrected, as in the first method list above. -Such an approach is flexible as it can be applied in chemistries that have an external permit list and those that don't. -Further parameters of the knee-finding algorithms can be altered to yield more or less restrictive selected barcode sets. -Yet, such an approach can have certain drawbacks, like a tendency to be overly conservative and sometimes failing to work robustly in samples where no clear knee is present. +2. **Knee or elbow-based methods**: + If a set of potential barcodes is unknown — or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list — one can use a method based on the observation that high-quality barcodes are those associated with the highest number of reads in the sample. + To achieve this, one can construct a cumulative frequency plot where barcodes are sorted in descending order based on the number of distinct reads or UMIs they are associated with. + Often, this ranked cumulative frequency plot will contain a "knee" or "elbow" – an inflection point that can be used to characterize frequently occurring barcodes from infrequent (and therefore likely erroneous) barcodes. + Many methods exist for attempting to identify such an inflection point {cite}`Smith2017,Lun2019,raw:He2022` as a likely point of discrimination between properly captured cells and empty droplets. + Subsequently, the set of barcodes that appear "above" the knee can be treated as a permit list against which the rest of the barcodes may be corrected, as in the first method list above. + Such an approach is flexible as it can be applied in chemistries that have an external permit list and those that don't. + Further parameters of the knee-finding algorithms can be altered to yield more or less restrictive selected barcode sets. + Yet, such an approach can have certain drawbacks, like a tendency to be overly conservative and sometimes failing to work robustly in samples where no clear knee is present. 3. **Filtering and Correction Based on an Expected Cell Count**: When barcode frequency distributions lack a clear knee or show bimodal patterns due to technical artifacts, barcode correction can be guided by a user-provided expected cell count. @@ -501,31 +501,57 @@ Addressing this latter problem may require more intelligent barcode design metho ## UMI resolution -After cell barcode (CB) correction, reads have either been discarded or assigned to a corrected CB. Subsequently, we wish to quantify the abundance of each gene within each corrected CB. +After cell barcode (CB) correction, reads have either been discarded or assigned to a corrected CB. +Subsequently, we wish to quantify the abundance of each gene within each corrected CB. -Because of the amplification bias as discussed in {ref}`exp-data:transcript-quantification`, reads must be deduplicated, based upon their UMI, to assess the true count of sampled molecules. Additionally, several other complicating factors present challenges when attempting to perform this estimation. +Because of the amplification bias as discussed in {ref}`exp-data:transcript-quantification`, reads must be deduplicated, based upon their UMI, to assess the true count of sampled molecules. +Additionally, several other complicating factors present challenges when attempting to perform this estimation. -The UMI deduplication step aims to identify the set of reads and UMIs derived from each original, pre-PCR molecule in each cell captured and sequenced in the experiment. The result of this process is to allocate a molecule count to each gene in each cell, which is subsequently used in the downstream analysis as the raw expression estimate for this gene. We refer to this process of looking at the collection of observed UMIs and their associated mapped reads and attempting to infer the original number of observed molecules arising from each gene as the process of _UMI resolution_. +The UMI deduplication step aims to identify the set of reads and UMIs derived from each original, pre-PCR molecule in each cell captured and sequenced in the experiment. +The result of this process is to allocate a molecule count to each gene in each cell, which is subsequently used in the downstream analysis as the raw expression estimate for this gene. +We refer to this process of looking at the collection of observed UMIs and their associated mapped reads and attempting to infer the original number of observed molecules arising from each gene as the process of _UMI resolution_. -To simplify the explanation, in the following text, the reads that map to a reference, for example, a genomic locus of the gene, are called the reads of that reference, and their UMI tags are called the UMIs of that reference; the set of reads that are tagged by a UMI is called the reads of that UMI. A read can only be tagged by one UMI but can belong to multiple references if it maps to multiple references. Furthermore, as the molecule barcoding process for each cell in scRNA-seq is usually isolated and independent (apart from the issues related to accurately resolving cell barcodes raised earlier), without loss of generality, _UMI resolution_ will be explained for a specific cell. The same procedure will usually be applied to all cells independently. +To simplify the explanation, reads that map to a reference (e.g., a genomic locus of a gene) are referred to as the reads of that reference, and their UMI tags are called the UMIs of that reference. +The set of reads associated with a specific UMI is referred to as the reads of that UMI. + +A read can be tagged by only one UMI but may belong to multiple references if it maps to more than one. +Additionally, since molecule barcoding in scRNA-seq is typically isolated and independent for each cell (aside from the previously discussed challenges in resolving cell barcodes), _UMI resolution_ will be explained for a single cell without loss of generality. +This same procedure is generally applied to all cells independently. (raw-proc:need-for-umi-resolution)= ### The need for UMI resolution -In the ideal case, where the correct (unaltered) UMIs tag reads, the reads of each UMI uniquely map to a common reference gene, and there is a bijection between UMIs and pre-PCR molecules. Consequently, the UMI deduplication procedure is conceptually straightforward: the reads of a UMI are the PCR duplicates from a single pre-PCR molecule. The number of captured and sequenced molecules of each gene is the number of distinct UMIs observed for this gene. +In the ideal case, where the correct (unaltered) UMIs tag reads, the reads of each UMI uniquely map to a common reference gene, and there is a bijection between UMIs and pre-PCR molecules. +Consequently, the UMI deduplication procedure is conceptually straightforward: the reads of a UMI are the PCR duplicates from a single pre-PCR molecule. +The number of captured and sequenced molecules of each gene is the number of distinct UMIs observed for this gene. -However, the problems encountered in practice make the simple rules described above insufficient for identifying the gene origin of UMIs in general and necessitate the development of more sophisticated models. Here, we concern ourselves primarily with two challenges. +However, the problems encountered in practice make the simple rules described above insufficient for identifying the gene origin of UMIs in general and necessitate the development of more sophisticated models. +Here, we concern ourselves primarily with two challenges. -- The first problem we discuss is errors in UMIs. These occur when the sequenced UMI tag of reads contains errors introduced during PCR or the sequencing process. Common UMI errors include nucleotide substitutions during PCR and read errors during sequencing. Failing to address such UMI errors can inflate the estimated number of molecules {cite}`Smith2017,ziegenhain2022molecular`. +- **Errors in UMIs**: + These occur when the sequenced UMI tag of reads contains errors introduced during PCR or the sequencing process. + Common UMI errors include nucleotide substitutions during PCR and read errors during sequencing. + Failing to address such UMI errors can inflate the estimated number of molecules {cite}`Smith2017,ziegenhain2022molecular`. -- The second issue we discuss is multimapping, which arises in cases where a read or UMI belongs to multiple references, for example, multi-gene reads/UMIs. This happens if different reads of a UMI map to different genes, a read maps to multiple genes, or both. The consequence of this issue is that the gene origin of the multi-gene reads/UMIs is ambiguous, which results in uncertainty about the sampled pre-PCR molecule count of those genes. Simply discarding multi-gene reads/UMIs can lead to a loss of data or a biased estimate among genes that tend to produce multimapping reads, such as sequence-similar gene families{cite}`Srivastava2019`. +- **Multimapping**: + This issue arises in cases where a read or UMI belongs to multiple references(e.g., multi-gene reads/UMIs). This happens when different reads of a UMI map to different genes, when a read maps to multiple genes, or both. + The consequence of this issue is that the gene origin of the multi-gene reads/UMIs is ambiguous, which results in uncertainty about the sampled pre-PCR molecule count of those genes. + Simply discarding multi-gene reads/UMIs can lead to a loss of data or a biased estimate among genes that tend to produce multimapping reads, such as sequence-similar gene families {cite}`Srivastava2019`. -```{admonition} A note on UMI errors -UMI errors, especially those due to nucleotide substitutions and miscallings, are prevalent in single-cell experiments. {cite:t}`Smith2017` establish that the average number of bases different (edit distance) between the observed UMI sequences in the tested single-cell experiments is lower than randomly sampled UMI sequences, and the enrichment of low edit distances is well correlated with the degree of PCR amplification. Multimapping also exists in single-cell data and, depending upon the gene being considered, can occur at a non-trivial rate. {cite:t}`Srivastava2019` show that discarding the multimapping reads can negatively bias the predicted molecule counts. +```{admonition} A Note on UMI Errors +UMI errors, especially those due to nucleotide substitutions and miscallings, are prevalent in single-cell experiments. +{cite:t}`Smith2017` establish that the average number of bases different (edit distance) between the observed UMI sequences in the tested single-cell experiments is lower than randomly sampled UMI sequences, and the enrichment of low edit distances is well correlated with the degree of PCR amplification. +Multimapping also exists in single-cell data and, depending upon the gene being considered, can occur at a non-trivial rate. +{cite:t}`Srivastava2019` show that discarding the multimapping reads can negatively bias the predicted molecule counts. ``` -There exist other challenges that we do not focus upon here, such as "convergent" and "divergent" UMI collisions. We consider the case where the same UMI is used to tag two different pre-PCR molecules arising from the same gene, in the same cell, as a convergent collision. When two or more distinct UMIs arise from the same pre-PCR molecule, e.g., due to the sampling of multiple priming sites from this molecule, we consider this a divergent collision. We expect convergent UMI collisions to be rare and, therefore, their effect typically small. Further, transcript-level mapping information can sometimes be used to resolve such collisions{cite}`Srivastava2019`. Divergent UMI collisions occur primarily among introns of unspliced transcripts{cite}`technote_10x_intronic_reads`, and approaches to addressing the issues they raise are an area of active research{cite}`technote_10x_intronic_reads,Gorin2021`. +There exist other challenges that we do not focus upon here, such as "convergent" and "divergent" UMI collisions. +We consider the case where the same UMI is used to tag two different pre-PCR molecules arising from the same gene, in the same cell, as a convergent collision. +When two or more distinct UMIs arise from the same pre-PCR molecule, e.g., due to the sampling of multiple priming sites from this molecule, we consider this a divergent collision. +We expect convergent UMI collisions to be rare and, therefore, their effect typically small. +Further, transcript-level mapping information can sometimes be used to resolve such collisions{cite}`Srivastava2019`. +Divergent UMI collisions occur primarily among introns of unspliced transcripts {cite}`technote_10x_intronic_reads`, and approaches to addressing the issues they raise are an area of active research{cite}`technote_10x_intronic_reads,Gorin2021`. Given that the use of UMIs is near ubiquitous in high-throughput scRNA-seq protocols and the fact that addressing these errors improves the estimation of gene abundances, there has been much attention paid to the problem of UMI resolution in recent literature {cite}`Islam2013,Bose2015,raw:Macosko2015,Smith2017,Srivastava2019,Kaminow2021,Melsted2021,raw:He2022,calib,umic,zumis`. @@ -533,17 +559,32 @@ Given that the use of UMIs is near ubiquitous in high-throughput scRNA-seq proto ### Graph-based UMI resolution -As a result of the problems that arise when attempting to resolve UMIs, many methods have been developed to address the problem of UMI resolution. While there are a host of different approaches for UMI resolution, we will focus on a framework for representing problem instances, modified from a framework initially proposed by {cite:t}`Smith2017`, that relies upon the notion of a _UMI graph_. Each connected component of this graph represents a sub-problem wherein certain subsets of UMIs are collapsed (i.e., resolved as evidence of the same pre-PCR molecule). Many popular UMI resolution approaches can be interpreted in this framework by simply modifying precisely how the graph is refined and how the collapse or resolution procedure carried out over this graph works. +As a result of the problems that arise when attempting to resolve UMIs, many methods have been developed to address the problem of UMI resolution. +While there are a host of different approaches for UMI resolution, we will focus on a framework for representing problem instances, modified from a framework initially proposed by {cite:t}`Smith2017`, that relies upon the notion of a _UMI graph_. +Each connected component of this graph represents a sub-problem wherein certain subsets of UMIs are collapsed (i.e., resolved as evidence of the same pre-PCR molecule). +Many popular UMI resolution approaches can be interpreted in this framework by simply modifying precisely how the graph is refined and how the collapse or resolution procedure carried out over this graph works. -In the context of single-cell data, a UMI graph $G(V,E)$ is a directed graph with a node set $V$ and an edge set $E$. Each node $v_i \in V$ represents an equivalence class (EC) of reads, and the edge set $E$ encodes the relationship between the ECs. The equivalence relation $\sim_r$ defined on reads is based on their UMI and mapping information. We say reads $r_x$ and $r_y$ are equivalent, $r_x \sim_r r_y$, if and only if they have identical UMI tags and map to the same set of references. UMI resolution approaches may define a "reference" as a genomic locus{cite}`Smith2017`, transcript{cite}`Srivastava2019,raw:He2022` or gene{cite}`raw:Zheng2017,Kaminow2021`. Other UMI resolution approaches exist, for example, the reference-free model{cite}`umic` and the method of moments{cite}`Melsted2021`, but they may not be easily represented in this framework and are not discussed in further detail here. +In the context of single-cell data, a UMI graph $G(V,E)$ is a {term}`directed graph` with a node set $V$ and an edge set $E$. +Each node $v_i \in V$ represents an equivalence class (EC) of reads, and the edge set $E$ encodes the relationship between the ECs. +The equivalence relation $\sim_r$ defined on reads is based on their UMI and mapping information. +We say reads $r_x$ and $r_y$ are equivalent, $r_x \sim_r r_y$, if and only if they have identical UMI tags and map to the same set of references. +UMI resolution approaches may define a "reference" as a genomic locus{cite}`Smith2017`, transcript{cite}`Srivastava2019,raw:He2022` or gene{cite}`raw:Zheng2017,Kaminow2021`. +Other UMI resolution approaches exist, for example, the reference-free model{cite}`umic` and the method of moments{cite}`Melsted2021`, but they may not be easily represented in this framework and are not discussed in further detail here. -In the UMI graph framework, a UMI resolution approach can be divided into three major steps, each of which has different options that can be modularly composed by different approaches. The three steps are defining nodes, defining adjacency relationships, and resolving components. Additionally, these steps may sometimes be preceded (and/or followed) by filtering steps designed to discard or heuristically assign (by modifying the set of reference mappings reported) reads and UMIs exhibiting certain types of mapping ambiguity. +In the UMI graph framework, a UMI resolution approach can be divided into three major steps: +**defining nodes**, **defining adjacency relationships**, and **resolving components**. +Each of these steps has different options that can be modularly composed by different approaches. +Additionally, these steps may sometimes be preceded (and/or followed) by filtering steps designed to discard or heuristically assign (by modifying the set of reference mappings reported) reads and UMIs exhibiting certain types of mapping ambiguity. (raw-proc:umi-graph-node-def)= #### Defining nodes -As described above, a node $v_i \in V$ is an equivalence class of reads. Therefore, $V$ can be defined based on the full or filtered set of mapped reads and their associated _uncorrected_ UMIs. All reads that satisfy the equivalence relation $\sim_r$ based on their reference set and UMI tag are associated with the same vertex $v \in V$. An EC is a multi-gene EC if its UMI is a multi-gene UMI. Some approaches will avoid the creation of such ECs by filtering or heuristically assigning reads prior to node creation, while other approaches will retain and process these ambiguous vertices and attempt and resolve their gene origin via parsimony, probabilistic assignment, or based on a related rule or model {cite}`Srivastava2019,Kaminow2021,raw:He2022`. +As described above, a node $v_i \in V$ is an equivalence class of reads. +Therefore, $V$ can be defined based on the full or filtered set of mapped reads and their associated _uncorrected_ UMIs. +All reads that satisfy the equivalence relation $\sim_r$ based on their reference set and UMI tag are associated with the same vertex $v \in V$. +An EC is a multi-gene EC if its UMI is a multi-gene UMI. +Some approaches will avoid the creation of such ECs by filtering or heuristically assigning reads prior to node creation, while other approaches will retain and process these ambiguous vertices and attempt and resolve their gene origin via parsimony, probabilistic assignment, or based on a related rule or model {cite}`Srivastava2019,Kaminow2021,raw:He2022`. (raw-proc:umi-graph-edge-def)= @@ -558,24 +599,38 @@ Here we define the following functions on the node $v_i \in V$: - $m(v_i)$ is the reference set encoded in the mapping information, for $v_i$. - $D(v_i, v_j)$ is the distance between $u(v_i)$ and $u(v_j)$, where $v_j \in V$. -Given these function definitions, any two nodes $v_i, v_j \in V$ will be incident with a bi-directed edge if and only if $m(v_i) \cap m(v_j) \ne \emptyset$ and $D(v_i,v_j) \le \theta$, where $\theta$ is a distance threshold and is often set as $\theta=1${cite}`Smith2017,Kaminow2021,Srivastava2019`. Additionally, the bi-directed edge might be replaced by a directed edge incident from $v_i$ to $v_j$ if $c(v_i) \ge 2c(v_j) -1$ or vice versa{cite}`Smith2017,Srivastava2019`. Though these edge definitions are among the most common, others are possible, so long as they are completely defined by the $u$, $c$, $m$, and $D$ functions. With $V$ and $E$ in hand, the UMI graph $G = (V,E)$ is now defined. +Given these function definitions, any two nodes $v_i, v_j \in V$ will be incident with a bi-directed edge if and only if $m(v_i) \cap m(v_j) \ne \emptyset$ and $D(v_i,v_j) \le \theta$, where $\theta$ is a distance threshold and is often set as $\theta=1${cite}`Smith2017,Kaminow2021,Srivastava2019`. +Additionally, the bi-directed edge might be replaced by a directed edge incident from $v_i$ to $v_j$ if $c(v_i) \ge 2c(v_j) -1$ or vice versa{cite}`Smith2017,Srivastava2019`. +Though these edge definitions are among the most common, others are possible, so long as they are completely defined by the $u$, $c$, $m$, and $D$ functions. With $V$ and $E$ in hand, the UMI graph $G = (V,E)$ is now defined. (raw-proc:umi-graph-resolution-def)= #### Defining the graph resolution approach -Given the defined UMI graph, many different resolution approaches may be applied. A resolution method may be as simple as finding the set of connected components, clustering the graph, greedily collapsing nodes or contracting edges{cite}`Smith2017`, or searching for a cover of the graph by structures following certain rules (e.g., monochromatic arboresences{cite}`Srivastava2019`) to reduce the graph. As a result, each node in the reduced UMI graph, or each element in the cover in the case that the graph is not modified dynamically, represents a pre-PCR molecule. The collapsed nodes or covering sets are regarded as the PCR duplicates of that molecule. +Given the defined UMI graph, many different resolution approaches may be applied. +A resolution method may be as simple as finding the set of connected components, clustering the graph, greedily collapsing nodes or contracting edges{cite}`Smith2017`, or searching for a cover of the graph by structures following certain rules (e.g., monochromatic arboresences{cite}`Srivastava2019`) to reduce the graph. +As a result, each node in the reduced UMI graph, or each element in the cover in the case that the graph is not modified dynamically, represents a pre-PCR molecule. +The collapsed nodes or covering sets are regarded as the PCR duplicates of that molecule. -Different rules for defining the adjacency relationship and different approaches for graph resolution itself can seek to preserve different properties and can define a wide variety of distinct overall UMI resolution approaches. Note that for the approaches that aim to probabilistically resolve the ambiguity caused by multimapping, the resolved UMI graph might contain multi-gene ECs, and their gene origin will be resolved in the following step. +Different rules for defining the adjacency relationship and different approaches for graph resolution itself can seek to preserve different properties and can define a wide variety of distinct overall UMI resolution approaches. +For approaches that probabilistically resolve ambiguity caused by multimapping, the resolved UMI graph may contain multi-gene equivalence classes (ECs), with their gene origins determined in the next step. (raw-proc:umi-graph-quantification)= #### Quantification -The last step in UMI resolution is quantifying the abundance of each gene using the resolved UMI graph. For approaches that discard multi-gene ECs, the molecule count vector for the genes in the current cell being processed (or count vector for short) is generated by counting the number of ECs labeled with each gene. On the other hand, approaches that process, rather than discard, multi-gene ECs usually resolve the ambiguity by applying some statistical inference procedure. For example, {cite:t}`Srivastava2019` introduce an expectation-maximization (EM) approach for probabilistically assigning multi-gene UMIs, and related EM algorithms have also been introduced as optional steps in subsequent tools{cite}`Melsted2021,Kaminow2021,raw:He2022`. In this model, the collapsed-EC-to-gene assignments are latent variables, and the deduplicated molecule count of genes are the main parameters. Intuitively, evidence from gene-unique ECs will be used to help probabilistically apportion the multi-gene ECs. The EM algorithm seeks the parameters that together have the (locally) highest likelihood of generating the observed ECs. +The last step in UMI resolution is quantifying the abundance of each gene using the resolved UMI graph. +For approaches that discard multi-gene ECs, the molecule count vector for the genes in the current cell being processed (or count vector for short) is generated by counting the number of ECs labeled with each gene. +On the other hand, approaches that process, rather than discard, multi-gene ECs usually resolve the ambiguity by applying some statistical inference procedure. +For example, {cite:t}`Srivastava2019` introduce an expectation-maximization (EM) approach for probabilistically assigning multi-gene UMIs, and related EM algorithms have also been introduced as optional steps in subsequent tools{cite}`Melsted2021,Kaminow2021,raw:He2022`. +In this model, the collapsed-EC-to-gene assignments are latent variables, and the deduplicated molecule count of genes are the main parameters. +Intuitively, evidence from gene-unique ECs will be used to help probabilistically apportion the multi-gene ECs. +The EM algorithm seeks the parameters that together have the (locally) highest likelihood of generating the observed ECs. -Usually, the UMI resolution and quantification process described above will be performed separately for each cell, represented by a corrected CB, to create a complete count matrix for all genes in all cells. However, the relative paucity of per-cell information in high-throughput single-cell samples limits the evidence available when performing UMI resolution, which in turn limits the potential efficacy of model-based solutions like the statistical inference procedure described above. -Thus, further research here is certainly warranted. For example, {cite:t}`Srivastava2020-lf` introduced an approach that allows sharing information among transcriptionally similar cells to improve the quantification result further. +Usually, the UMI resolution and quantification process described above will be performed separately for each cell, represented by a corrected CB, to create a complete count matrix for all genes in all cells. +However, the relative paucity of per-cell information in high-throughput single-cell samples limits the evidence available when performing UMI resolution, which in turn limits the potential efficacy of model-based solutions like the statistical inference procedure described above. +Thus, further research here is certainly warranted. +For example, {cite:t}`Srivastava2020-lf` introduced an approach that allows sharing information among transcriptionally similar cells to improve the quantification result further. (raw-proc:count-qc)= @@ -583,13 +638,18 @@ Thus, further research here is certainly warranted. For example, {cite:t}`Srivas Once a count matrix has been generated, it is important to perform a quality control (QC) assessment. There are several distinct assessments that generally fall under the rubric of quality control. -Basic global metrics are often recorded and reported to help assess the overall quality of the sequencing measurement itself. These metrics consist of quantities such as the total fraction of mapped reads, the distribution of distinct UMIs observed per cell, the distribution of UMI deduplication rates, the distribution of detected genes per cell, etc. These and similar metrics are often recorded by the quantification tools themselves{cite}`raw:Zheng2017,Kaminow2021,Melsted2021,raw:He2022` since they arise naturally and can be computed during the process of read mapping, cell barcode correction, and UMI resolution. Likewise, there exist several tools to help organize and visualize these basic metrics, such as the [Loupe browser](https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/what-is-loupe-cell-browser), [alevinQC](https://github.com/csoneson/alevinQC), or a [kb_python report](https://github.com/pachterlab/kb_python), depending upon the quantification pipeline being used. Beyond these basic global metrics, at this stage of analysis, QC metrics are designed primarily to help determine which cells (CBs) have been sequenced "successfully", and which exhibit artifacts that warrant filtering or correction. +Basic global metrics are often recorded and reported to help assess the overall quality of the sequencing measurement itself. +These metrics consist of quantities such as the total fraction of mapped reads, the distribution of distinct UMIs observed per cell, the distribution of UMI deduplication rates, the distribution of detected genes per cell, etc. +These and similar metrics are often recorded by the quantification tools themselves{cite}`raw:Zheng2017,Kaminow2021,Melsted2021,raw:He2022` since they arise naturally and can be computed during the process of read mapping, cell barcode correction, and UMI resolution. +Likewise, there exist several tools to help organize and visualize these basic metrics, such as the [Loupe browser](https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/what-is-loupe-cell-browser), [alevinQC](https://github.com/csoneson/alevinQC), or a [kb_python report](https://github.com/pachterlab/kb_python), depending upon the quantification pipeline being used. +Beyond these basic global metrics, at this stage of analysis, QC metrics are designed primarily to help determine which cells (CBs) have been sequenced "successfully", and which exhibit artifacts that warrant filtering or correction. In the following toggle section, we discuss an example alevinQC report taken from the `alevinQC` [manual webpage](https://github.com/csoneson/alevinQC). ```{toggle} -Once `alevin` or `alevin-fry` quantifies the single-cell data, the quality of the data can be assessed through the R package [`alevinQC`](https://github.com/csoneson/alevinQC). The alevinQC report can be generated in PDF format or as R/Shiny applications, which summarizes various components of the single-cell library, such as reads, CBs, and UMIs. +Once `alevin` or `alevin-fry` quantifies the single-cell data, the quality of the data can be assessed through the R package [`alevinQC`](https://github.com/csoneson/alevinQC). +The alevinQC report can be generated in PDF format or as R/Shiny applications, which summarizes various components of the single-cell library, such as reads, CBs, and UMIs. **1. Metadata and summary tables** @@ -599,74 +659,140 @@ Once `alevin` or `alevin-fry` quantifies the single-cell data, the quality of th An example of the summary section of an alevinQC report. ::: -The first section of an alevinQC report shows a summary of the input files and the processing result, among which, the top left table displays the metadata provided by `alevin` (or `alevin-fry`) for the quantification results. For example, this includes the time of the run, the version of the tool, and the path to the input FASTQ and index files. The top right summary table provides the summary statistics for various components of the single-cell library, for example, the number of sequencing reads, the number of selected cell barcodes at various levels of filtering, and the total number of deduplicated UMIs. +The first section of an alevinQC report shows a summary of the input files and the processing result, among which, the top left table displays the metadata provided by `alevin` (or `alevin-fry`) for the quantification results. +For example, this includes the time of the run, the version of the tool, and the path to the input FASTQ and index files. +The top right summary table provides the summary statistics for various components of the single-cell library, for example, the number of sequencing reads, the number of selected cell barcodes at various levels of filtering, and the total number of deduplicated UMIs. **2. Knee plot, initial whitelist determination** :::{figure-md} raw-proc-fig-alevinqc-plots AlevinQC Plots -The figure shows the plots in the alevinQC report of an example single-cell dataset, of which the cells are filtered using the "knee" finding method. Each dot represents a corrected cell barcode with its corrected profile. +The figure shows the plots in the alevinQC report of an example single-cell dataset, of which the cells are filtered using the "knee" finding method. +Each dot represents a corrected cell barcode with its corrected profile. ::: -The first(top left) view in {numref}`raw-proc-fig-alevinqc-plots` shows the distribution of cell barcode frequency in decreasing order. In all plots shown above, each point represents a corrected cell barcode, with its x-coordinate corresponding to its cell barcode frequency rank. In the top left plot, the y-coordinate corresponds to the observed frequency of the corrected barcode. Generally, this plot shows a "knee"-like pattern, which can be used to identify the initial list of high-quality barcodes. The red dots in the plot represent the cell barcodes selected as the high-quality cell barcodes in the case that "knee"-based filtering was applied. In other words, these cell barcodes contain a sufficient number of reads to be deemed high-quality and likely derived from truly present cells. Suppose an external permit list is passed in the CB correction step, which implies no internal algorithm was used to distinguish high-quality cell barcodes. In that case, all dots in the plot will be colored red, as all these corrected cell barcodes are processed throughout the raw data processing pipeline and reported in the gene count matrix. One should be skeptical of the data quality if the frequency is consistently low across all cell barcodes. +The first(top left) view in {numref}`raw-proc-fig-alevinqc-plots` shows the distribution of cell barcode frequency in decreasing order. +In all plots shown above, each point represents a corrected cell barcode, with its x-coordinate corresponding to its cell barcode frequency rank. +In the top left plot, the y-coordinate corresponds to the observed frequency of the corrected barcode. +Generally, this plot shows a "knee"-like pattern, which can be used to identify the initial list of high-quality barcodes. +The red dots in the plot represent the cell barcodes selected as the high-quality cell barcodes in the case that "knee"-based filtering was applied. +In other words, these cell barcodes contain a sufficient number of reads to be deemed high-quality and likely derived from truly present cells. +Suppose an external permit list is passed in the CB correction step, which implies no internal algorithm was used to distinguish high-quality cell barcodes. +In that case, all dots in the plot will be colored red, as all these corrected cell barcodes are processed throughout the raw data processing pipeline and reported in the gene count matrix. +One should be skeptical of the data quality if the frequency is consistently low across all cell barcodes. **3. Barcode collapsing** -After identification of the barcodes that will be processed, either through an internal threshold (e.g., from the "knee"-based method) or through external whitelisting, `alevin` (or `alevin-fry`) performs cell barcode sequence correction. The barcode collapsing plot, the upper middle plot in the {numref}`raw-proc-fig-alevinqc-plots`, shows the number of reads assigned to a cell barcode after sequence correction of the cell barcodes versus prior to correction. Generally, we would see that all points fall close to the line representing $x = y$, which means that the reassignments in CB correction usually do not drastically change the profile of the cell barcodes. +After identification of the barcodes that will be processed, either through an internal threshold (e.g., from the "knee"-based method) or through external whitelisting, `alevin` (or `alevin-fry`) performs cell barcode sequence correction. +The barcode collapsing plot, the upper middle plot in the {numref}`raw-proc-fig-alevinqc-plots`, shows the number of reads assigned to a cell barcode after sequence correction of the cell barcodes versus prior to correction. +Generally, we would see that all points fall close to the line representing $x = y$, which means that the reassignments in CB correction usually do not drastically change the profile of the cell barcodes. **4. Knee Plot, number of genes per cell** -The upper right plot in {numref}`raw-proc-fig-alevinqc-plots` shows the distribution of the number of observed genes of all processed cell barcodes. Generally, a mean of $2,000$ genes per cell is considered modest but reasonable for the downstream analyses. One should double-check the quality of the data if all cells have a low number of observed genes. +The upper right plot in {numref}`raw-proc-fig-alevinqc-plots` shows the distribution of the number of observed genes of all processed cell barcodes. +Generally, a mean of $2,000$ genes per cell is considered modest but reasonable for the downstream analyses. +One should double-check the quality of the data if all cells have a low number of observed genes. **5. Quantification summary** -Finally, a series of quantification summary plots, the bottom plots in {numref}`raw-proc-fig-alevinqc-plots`, compare the cell barcode frequency, the total number of UMIs after deduplication and the total number of non-zero genes using scatter plots. In general, in each plot, the plotted data should demonstrate a positive correlation, and, if high-quality filtering (e.g., knee filtering) has been performed, the high-quality cell barcodes should be well separated from the rest. Moreover, one should expect all three plots to convey similar trends. If using an external permit list, all the dots in the plots will be colored red, as all these cell barcodes are processed and reported in the gene count matrix. Still, we should see the correlation between the plots and the separation of the dots representing high-quality cells to others. If all of these metrics are consistently low across cells or if these plots convey substantially different trends, then one should be concerned about the data quality. +Finally, a series of quantification summary plots, the bottom plots in {numref}`raw-proc-fig-alevinqc-plots`, compare the cell barcode frequency, the total number of UMIs after deduplication and the total number of non-zero genes using scatter plots. +In general, in each plot, the plotted data should demonstrate a positive correlation, and, if high-quality filtering (e.g., knee filtering) has been performed, the high-quality cell barcodes should be well separated from the rest. +Moreover, one should expect all three plots to convey similar trends. +If using an external permit list, all the dots in the plots will be colored red, as all these cell barcodes are processed and reported in the gene count matrix. +Still, we should see the correlation between the plots and the separation of the dots representing high-quality cells to others. +If all of these metrics are consistently low across cells or if these plots convey substantially different trends, then one should be concerned about the data quality. ``` ### Empty droplet detection -One of the first QC steps is determining which cell barcodes correspond to "high-confidence" sequenced cells. It is common in droplet-based protocols{cite}`raw:Macosko2015` that certain barcodes are associated with ambient {term}`RNA` instead of the {term}`RNA` of a captured cell. This happens when droplets fail to capture a cell. These empty droplets still tend to produce sequenced reads, though the characteristics of these reads look markedly different from those associated with barcodes corresponding to properly captured cells. Many approaches exist to assess whether a barcode likely corresponds to an empty droplet or not. One simple method is to examine the cumulative frequency plot of the barcodes, in which barcodes are sorted in descending order of the number of distinct UMIs with which they are associated. This plot often contains a "knee" that can be identified as a likely point of discrimination between properly captured cells and empty droplets{cite}`Smith2017,raw:He2022`. -While this "knee" method is intuitive and can often estimate a reasonable threshold, it has several drawbacks. For example, not all cumulative histograms display an obvious knee, and it is notoriously difficult to design algorithms that can robustly and automatically detect such knees. Finally, the total UMI count associated with a barcode may not, alone, be the best signal to determine if the barcode was associated with an empty or damaged cell. - -This led to the development of several tools specifically designed to detect empty or damaged droplets, or cells generally deemed to be of "low quality" {cite}`Lun2019,Heiser2021,Hippen2021,Muskovic2021,Alvarez2020,raw:Young2020`. These tools incorporate a variety of different measures of cell quality, including the frequencies of distinct UMIs, the number of detected genes, and the fraction of mitochondrial {term}`RNA`, and typically work by applying a statistical model to these features to classify high-quality cells from putative empty droplets or damaged cells. This means that cells can typically be scored, and a final filtering can be selected based on an estimated posterior probability that cells are not empty or compromised. While these models generally work well for single-cell {term}`RNA`-seq data, one may have to apply several additional filters or heuristics to obtain robust filtering in single-nucleus {term}`RNA`-seq data{cite}`Kaminow2021,raw:He2022`, like those exposed in the [`emptyDropsCellRanger`](https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDropsCellRanger.R) function of `DropletUtils`{cite}`Lun2019`. +One of the first QC steps is determining which cell barcodes correspond to "high-confidence" sequenced cells. +It is common in droplet-based protocols{cite}`raw:Macosko2015` that certain barcodes are associated with ambient {term}`RNA` instead of the RNA of a captured cell. +This happens when droplets fail to capture a cell. +These empty droplets still tend to produce sequenced reads, although the characteristics of these reads look markedly different from those associated with barcodes corresponding to properly captured cells. +Many approaches exist to assess whether a barcode likely corresponds to an empty droplet or not. +One simple method is to examine the cumulative frequency plot of the barcodes, in which barcodes are sorted in descending order of the number of distinct UMIs with which they are associated. +This plot often contains a "knee" that can be identified as a likely point of discrimination between properly captured cells and empty droplets{cite}`Smith2017,raw:He2022`. +While this "knee" method is intuitive and can often estimate a reasonable threshold, it has several drawbacks. +For example, not all cumulative histograms display an obvious knee, and it is notoriously difficult to design algorithms that can robustly and automatically detect such knees. +Finally, the total UMI count associated with a barcode may not, alone, be the best signal to determine if the barcode was associated with an empty or damaged cell. + +This led to the development of several tools specifically designed to detect empty or damaged droplets, or cells generally deemed to be of "low quality" {cite}`Lun2019,Heiser2021,Hippen2021,Muskovic2021,Alvarez2020,raw:Young2020`. +These tools incorporate a variety of different measures of cell quality, including the frequencies of distinct UMIs, the number of detected genes, and the fraction of mitochondrial RNA, and typically work by applying a statistical model to these features to classify high-quality cells from putative empty droplets or damaged cells. +This means that cells can typically be scored, and a final filtering can be selected based on an estimated posterior probability that cells are not empty or compromised. +While these models generally work well for single-cell {term}`RNA`-seq data, one may have to apply several additional filters or heuristics to obtain robust filtering in single-nucleus {term}`RNA`-seq data{cite}`Kaminow2021,raw:He2022`, like those exposed in the [`emptyDropsCellRanger`](https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDropsCellRanger.R) function of `DropletUtils`{cite}`Lun2019`. ### Doublet detection -In addition to determining which cell barcodes correspond to empty droplets or damaged cells, one may also wish to identify those cell barcodes that correspond to doublets or multiplets. When a given droplet captures two (doublets) or more (multiplets) cells, this can result in a skewed distribution for these cell barcodes in terms of quantities like the number of reads and UMIs they represent, as well as gene expression profiles they display. Many tools have also been developed to predict the doublet status of cell barcodes{cite}`DePasquale2019,McGinnis2019,Wolock2019,Bais2019,Bernstein2020`. Once detected, cells determined to likely be doublets and multiplets can be removed or otherwise adjusted for in the subsequent analysis. +In addition to determining which cell barcodes correspond to empty droplets or damaged cells, one may also wish to identify those cell barcodes that correspond to doublets or multiplets. +When a given droplet captures two (doublets) or more (multiplets) cells, this can result in a skewed distribution for these cell barcodes in terms of quantities like the number of reads and UMIs they represent, as well as gene expression profiles they display. +Many tools have also been developed to predict the doublet status of cell barcodes{cite}`DePasquale2019,McGinnis2019,Wolock2019,Bais2019,Bernstein2020`. +Once detected, cells determined to likely be doublets and multiplets can be removed or otherwise adjusted for in the subsequent analysis. (raw-proc:output-representation)= ## Count data representation -As one completes initial raw data processing and quality control and moves on to subsequent analyses, it is important to acknowledge and remember that the cell-by-gene count matrix is, at best, an approximation of the sequenced molecules in the original sample. At several stages of the raw data processing pipeline, heuristics are applied, and simplifications are made to enable the generation of this count matrix. For example, read mapping is imperfect, as is cell barcode correction. The accurate resolution of UMIs is particularly challenging, and issues related to UMIs attached to multimapping reads are often ignored, as is the fact that multiple priming sites, particularly among unspliced molecules, can violate the one molecule-to-one UMI relationship that is often assumed. - -In light of these challenges and the simplifications adopted to address them, there remains active research as to how best to represent the preprocessed data to downstream tools. For example, several quantification tools{cite}`Srivastava2019,Melsted2021,Kaminow2021,raw:He2022` implement an _optional_ EM algorithm, initially introduced in this context by {cite:t}`Srivastava2019`, that probabilistically apportions UMIs associated with reads that map to more than one gene. This, however, can result in non-integer count matrices that may be unexpected by certain downstream tools. Alternatively, one may choose to resolve UMIs to _gene groups_ instead of individual genes, retaining the multimapping information in the preprocessed output (it is worth noting that a similar representation, but at the transcript level, has been used for over a decade as a succinct internal representation in bulk RNA-seq transcript quantification tools{cite}`Turro2011,Nicolae2011,Patro2014,Bray2016,Patro2017,Ju2017`, and such a transcript-level representation has even been proposed for use in the clustering and dimensionality reduction of full-length single-cell RNA-seq data{cite}`Ntranos2016` and differential expression analysis of single-cell RNA-seq data{cite}`Ntranos2019`). In this case, instead of the resulting count matrix having dimensions $C \times G$, where $G$ is the number of genes in the quantified annotation, it will have dimension $C \times E$, where $E$ is the number of distinct _gene groups_ (commonly called equivalence class labels) across all cells in the given sample. By propagating this information to the output count matrix, one can avoid the application of heuristic or probabilistic UMI resolution methods that can result in loss of data, or bias, in the counts used in downstream analyses. Of course, to make use of this information, downstream analysis methods must expect the information in this format. Further, those downstream methods must typically have a way to resolve these counts, eventually, to the level of genes, as the abundance of _gene groups_ lacks the intuitive biological interpretability of gene abundance. Nonetheless, the benefits provided by such representations, in terms of conveying more complete and accurate information to downstream analysis tools, can be substantial, and tools taking advantage of such representations are being developed (e.g. [DifferentialRegulation](https://github.com/SimoneTiberi/DifferentialRegulation)); this is still an active area of research. +As one completes initial raw data processing and quality control and moves on to subsequent analyses, it is important to acknowledge and remember that the cell-by-gene count matrix is, at best, an approximation of the sequenced molecules in the original sample. +At several stages of the raw data processing pipeline, heuristics are applied, and simplifications are made to enable the generation of this count matrix. +For example, read mapping is imperfect, as is cell barcode correction. +Accurately resolving UMIs is particularly challenging, and issues related to UMIs attached to multimapping reads are often overlooked. +Additionally, multiple priming sites, especially in unspliced molecules, can violate the commonly assumed one molecule-to-one UMI relationship. + +In light of these challenges and the simplifications adopted to address them, there remains active research as to how best to represent the preprocessed data to downstream tools. +For example, several quantification tools{cite}`Srivastava2019,Melsted2021,Kaminow2021,raw:He2022` implement an _optional_ EM algorithm, initially introduced in this context by {cite:t}`Srivastava2019`, that probabilistically apportions UMIs associated with reads that map to more than one gene. +This, however, can result in non-integer count matrices that may be unexpected by certain downstream tools. +Alternatively, UMIs can be resolved to _gene groups_ instead of individual genes, preserving multimapping information in the preprocessed output. +Notably, a similar approach has been used at the transcript level for over a decade as a succinct internal representation in bulk RNA-seq transcript quantification tools {cite}`Turro2011,Nicolae2011,Patro2014,Bray2016,Patro2017,Ju2017`. +Additionally, transcript-level representations have been proposed for clustering and dimensionality reduction in full-length single-cell RNA-seq data {cite}`Ntranos2016` and for differential expression analysis in single-cell RNA-seq {cite}`Ntranos2019`. +In this case, instead of the resulting count matrix having dimensions $C \times G$, where $G$ is the number of genes in the quantified annotation, it will have dimension $C \times E$, where $E$ is the number of distinct _gene groups_ (commonly called equivalence class labels) across all cells in the given sample. +By propagating this information to the output count matrix, one can avoid the application of heuristic or probabilistic UMI resolution methods that can result in loss of data, or bias, in the counts used in downstream analyses. +Of course, to make use of this information, downstream analysis methods must expect the information in this format. +Further, those downstream methods must typically have a way to resolve these counts, eventually, to the level of genes, as the abundance of _gene groups_ lacks the intuitive biological interpretability of gene abundance. +Nonetheless, the benefits provided by such representations, in terms of conveying more complete and accurate information to downstream analysis tools, can be substantial, and tools taking advantage of such representations are being developed (e.g. [DifferentialRegulation](https://github.com/SimoneTiberi/DifferentialRegulation)); this is still an active area of research. ## Brief discussion -To close this chapter, we convey some observations and suggestions that have arisen from recent benchmarking and review studies surrounding some of the common preprocessing tools described above {cite}`You_2021,Bruning_2022`. It is, of course, important to note that the development of methods and tools for single-cell and single-nucleus RNA-seq raw data processing, as well as the continual evaluation of such methods, is an ongoing community effort. It is therefore often useful and reasonable, when performing your own analyses, to experiment with several different tools. +To close this chapter, we convey some observations and suggestions that have arisen from recent benchmarking and review studies surrounding some of the common preprocessing tools described above {cite}`You_2021,Bruning_2022`. +It is, of course, important to note that the development of methods and tools for single-cell and single-nucleus RNA-seq raw data processing, as well as the continual evaluation of such methods, is an ongoing community effort. +It is therefore often useful and reasonable, when performing your own analyses, to experiment with several different tools. -At the coarsest level, the most common tools can process data robustly and accurately. It has been suggested that with many common downstream analyses like clustering, and the methods used to perform them, the choice of preprocessing tool typically makes less difference than other steps in the analysis process {cite}`You_2021`. Nonetheless, it has also been observed that applying lightweight mapping restricted to the spliced transcriptome can increase the probability of spurious mapping and gene expression {cite}`Bruning_2022`. +At the coarsest level, the most common tools can process data robustly and accurately. +It has been suggested that with many common downstream analyses like clustering, and the methods used to perform them, the choice of preprocessing tool typically makes less difference than other steps in the analysis process {cite}`You_2021` +Nonetheless, it has also been observed that applying lightweight mapping restricted to the spliced transcriptome can increase the probability of spurious mapping and gene expression {cite}`Bruning_2022`. -Ultimately, the choice of a specific tool largely depends on the task at hand, and the constraints on available computational resources. If performing a standard single-cell analysis, lightweight mapping-based methods are a good choice since they are faster (often considerably so) and more memory frugal than existing alignment-based tools. If performing single-nucleus RNA-seq analysis, `alevin-fry` is an attractive option in particular, as it remains memory frugal and its index remains relatively small even as the transcriptome reference is expanded to include unspliced reference sequence. On the other hand, alignment-based methods are recommended if it is important to recover reads that map outside of the (extended) transcriptome or if the genomic mapping sites of reads are necessary for use in the relevant downstream tools or analyses (e.g., such as for differential transcript usage analysis with a tool like `sierra` {cite}`sierra`). Among the alignment-based pipelines, according to {cite:t}`Bruning_2022`, `STARsolo` should be favored over `Cell Ranger` because the former is much faster than the latter, and requires less memory, while it is also capable of producing almost identical results. +Ultimately, the choice of a specific tool largely depends on the task at hand, and the constraints on available computational resources. +If performing a standard single-cell analysis, lightweight mapping-based methods are a good choice since they are faster (often considerably so) and more memory frugal than existing alignment-based tools. +If performing single-nucleus RNA-seq analysis, `alevin-fry` is an attractive option in particular, as it remains memory frugal and its index remains relatively small even as the transcriptome reference is expanded to include unspliced reference sequence. +On the other hand, alignment-based methods are recommended when recovering reads that map outside the (extended) transcriptome is important or when genomic mapping sites are required for downstream analyses. +This is particularly relevant for tasks such as differential transcript usage analysis using tools like `sierra` {cite}`sierra`. +Among the alignment-based pipelines, according to {cite:t}`Bruning_2022`, `STARsolo` should be favored over `Cell Ranger` because the former is much faster than the latter, and requires less memory, while it is also capable of producing almost identical results. (raw-proc:example-workflow)= ## A real-world example -Given that we have covered the concepts underlying various approaches for raw data processing, we now turn our attention to demonstrating how a specific tool (in this case, `alevin-fry`) can be used to process a small example dataset. To start, we need the sequenced reads from a single-cell experiment in [FASTQ format](https://en.wikipedia.org/wiki/FASTQ_format) and the reference (e.g., transcriptome) against which the reads will be mapped. Usually, a reference includes the genome sequences and the corresponding gene annotations of the sequenced species in the [FASTA](https://en.wikipedia.org/wiki/FASTA_format) and [GTF](https://useast.ensembl.org/info/website/upload/gff.html) format, respectively. +Given that we have covered the concepts underlying various approaches for raw data processing, we now turn our attention to demonstrating how a specific tool (in this case, `alevin-fry`) can be used to process a small example dataset. +To start, we need the sequenced reads from a single-cell experiment in [FASTQ format](https://en.wikipedia.org/wiki/FASTQ_format) and the reference (e.g., transcriptome) against which the reads will be mapped. +Usually, a reference includes the genome sequences and the corresponding gene annotations of the sequenced species in the [FASTA](https://en.wikipedia.org/wiki/FASTA_format) and [GTF](https://useast.ensembl.org/info/website/upload/gff.html) format, respectively. -In this example, we will use _chromosome 5_ of the human genome and its related gene annotations as the reference, which is a subset of the Human reference, [GRCh38 (GENCODE v32/Ensembl 98) reference](https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#GRCh38_2020A) from the 10x Genomics reference build. Correspondingly, we extract the subset of reads that map to the generated reference from a [human brain tumor dataset](https://www.10xgenomics.com/resources/datasets/200-sorted-cells-from-human-glioblastoma-multiforme-3-lt-v-3-1-3-1-low-6-0-0) from 10x Genomics. +In this example, we will use _chromosome 5_ of the human genome and its related gene annotations as the reference, which is a subset of the Human reference, [GRCh38 (GENCODE v32/Ensembl 98) reference](https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#GRCh38_2020A) from the 10x Genomics reference build. +Correspondingly, we extract the subset of reads that map to the generated reference from a [human brain tumor dataset](https://www.10xgenomics.com/resources/datasets/200-sorted-cells-from-human-glioblastoma-multiforme-3-lt-v-3-1-3-1-low-6-0-0) from 10x Genomics. -[`Alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/){cite}`raw:He2022` is a fast, accurate, and memory-frugal single-cell and single-nucleus data processing tool. [Simpleaf](https://github.com/COMBINE-lab/simpleaf) is a program, written in [rust](https://www.rust-lang.org/), that exposes a unified and simplified interface for processing some of the most common protocols and data types using the `alevin-fry` pipeline. A nextflow-based [workflow](https://github.com/COMBINE-lab/quantaf) tool also exists to process extensive collections of single-cell data. Here we will first show how to process single-cell raw data using two `simpleaf` commands. Then, we describe the complete set of `salmon alevin` and `alevin-fry` commands to which these `simpleaf` commands correspond, to outline where the steps described in this section occur and to convey the possible different processing options. These commands will be run from the command line, and [`conda`](https://docs.conda.io/en/latest/) will be used for installing all of the software required for running this example. +[`Alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/){cite}`raw:He2022` is a fast, accurate, and memory-frugal single-cell and single-nucleus data processing tool. +[Simpleaf](https://github.com/COMBINE-lab/simpleaf) is a program, written in [rust](https://www.rust-lang.org/), that exposes a unified and simplified interface for processing some of the most common protocols and data types using the `alevin-fry` pipeline. +A nextflow-based [workflow](https://github.com/COMBINE-lab/quantaf) tool also exists to process extensive collections of single-cell data. +Here we will first show how to process single-cell raw data using two `simpleaf` commands. Then, we describe the complete set of `salmon alevin` and `alevin-fry` commands to which these `simpleaf` commands correspond, to outline where the steps described in this section occur and to convey the possible different processing options. +These commands will be run from the command line, and [`conda`](https://docs.conda.io/en/latest/) will be used for installing all of the software required for running this example. (raw-proc:example-prep)= ### Preparation -Before we start, we create a conda environment in the terminal and install the required package. `Simpleaf` depends on [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), [`salmon`](https://salmon.readthedocs.io/en/latest/) and [`pyroe`](https://github.com/COMBINE-lab/pyroe). They are all available on `bioconda` and will be automatically installed when installing `simpleaf`. +Before we start, we create a conda environment in the terminal and install the required package. +`Simpleaf` depends on [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/), [`salmon`](https://salmon.readthedocs.io/en/latest/) and [`pyroe`](https://github.com/COMBINE-lab/pyroe). +They are all available on `bioconda` and will be automatically installed when installing `simpleaf`. ```bash conda create -n af -y -c bioconda simpleaf @@ -675,7 +801,8 @@ conda activate af ````{admonition} Note on using an Apple silicon-based device -Conda does not currently build most packages natively for Apple silicon. Therefore, if you +Conda does not currently build most packages natively for Apple silicon. +Therefore, if you are using a non-Intel-based Apple computer (e.g., with an M1(Pro/Max/Ultra) or M2 chip), you should make sure to specify that your environment uses the Rosetta2 translation layer. To do this, you can replace the above commands with the following (instructions adopted @@ -773,7 +900,9 @@ simpleaf quant \ -o simpleaf_quant ``` -After running these commands, the resulting quantification information can be found in the `simpleaf_quant/af_quant/alevin` folder. Within this directory, there are three files: `quants_mat.mtx`, `quants_mat_cols.txt`, and `quants_mat_rows.txt`, which correspond, respectively, to the count matrix, the gene names for each column of this matrix, and the corrected, filtered cell barcodes for each row of this matrix. The tail lines of these files are shown below. Of note here is the fact that `alevin-fry` was run in the USA-mode (unspliced, spliced, and ambiguous mode), and so quantification was performed for both the spliced and unspliced status of each gene — the resulting `quants_mat_cols.txt` file will then have a number of rows equal to 3 times the number of annotated genes which correspond, to the names used for the spliced (S), unspliced (U), and splicing-ambiguous variants (A) of each gene. +After running these commands, the resulting quantification information can be found in the `simpleaf_quant/af_quant/alevin` folder. +Within this directory, there are three files: `quants_mat.mtx`, `quants_mat_cols.txt`, and `quants_mat_rows.txt`, which correspond, respectively, to the count matrix, the gene names for each column of this matrix, and the corrected, filtered cell barcodes for each row of this matrix. The tail lines of these files are shown below. +Of note here is the fact that `alevin-fry` was run in the USA-mode (unspliced, spliced, and ambiguous mode), and so quantification was performed for both the spliced and unspliced status of each gene — the resulting `quants_mat_cols.txt` file will then have a number of rows equal to 3 times the number of annotated genes which correspond, to the names used for the spliced (S), unspliced (U), and splicing-ambiguous variants (A) of each gene. ```bash # Each line in `quants_mat.mtx` represents @@ -798,7 +927,8 @@ TGCTCGTGTTCGAAGG ACTGTGAAGAAATTGC ``` -We can load the count matrix into Python as an [`AnnData`](https://anndata.readthedocs.io/en/latest/) object using the `load_fry` function from [`pyroe`](https://github.com/COMBINE-lab/pyroe). A similar function, [loadFry](https://rdrr.io/github/mikelove/fishpond/man/loadFry.html), has been implemented in the [`fishpond`](https://github.com/mikelove/fishpond) R package. +We can load the count matrix into Python as an [`AnnData`](https://anndata.readthedocs.io/en/latest/) object using the `load_fry` function from [`pyroe`](https://github.com/COMBINE-lab/pyroe). +A similar function, [loadFry](https://rdrr.io/github/mikelove/fishpond/man/loadFry.html), has been implemented in the [`fishpond`](https://github.com/mikelove/fishpond) R package. ```python import pyroe @@ -807,7 +937,9 @@ quant_dir = 'simpleaf_quant/af_quant' adata_sa = pyroe.load_fry(quant_dir) ``` -The default behavior loads the `X` layer of the `Anndata` object as the sum of the spliced and ambiguous counts for each gene. However, recent work{cite}`Pool2022` and [updated practices](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/release-notes) suggest that the inclusion of intronic counts, even in single-cell RNA-seq data, may increase sensitivity and benefit downstream analyses. While the best way to make use of this information is the subject of ongoing research, since `alevin-fry` automatically quantifies spliced, unspliced, and ambiguous reads in each sample, the count matrix containing the total counts for each gene can be simply obtained as follows: +The default behavior loads the `X` layer of the `Anndata` object as the sum of the spliced and ambiguous counts for each gene. +However, recent work{cite}`Pool2022` and [updated practices](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/release-notes) suggest that the inclusion of intronic counts, even in single-cell RNA-seq data, may increase sensitivity and benefit downstream analyses. +While the best way to make use of this information is the subject of ongoing research, since `alevin-fry` automatically quantifies spliced, unspliced, and ambiguous reads in each sample, the count matrix containing the total counts for each gene can be simply obtained as follows: ```python import pyroe @@ -820,13 +952,17 @@ adata_usa = pyroe.load_fry(quant_dir, output_format={'X' : ['U','S','A']}) ### The complete alevin-fry pipeline -`Simpleaf` makes it possible to process single-cell raw data in the "standard" way with a few commands. Next, we will show how to generate the identical quantification result by explicitly calling the `pyroe`, `salmon`, and `alevin-fry` commands. On top of the pedagogical value, knowing the exact command of each step will be helpful if only a part of the pipeline needs to be rerun or if some parameters not currently exposed by `simpleaf` need to be specified. +`Simpleaf` makes it possible to process single-cell raw data in the "standard" way with a few commands. +Next, we will show how to generate the identical quantification result by explicitly calling the `pyroe`, `salmon`, and `alevin-fry` commands. +On top of the pedagogical value, knowing the exact command of each step will be helpful if only a part of the pipeline needs to be rerun or if some parameters not currently exposed by `simpleaf` need to be specified. -Please note that the commands in the {ref}`raw-proc:example-prep` section should be executed in advance. All the tools called in the following commands, `pyroe`, `salmon`, and `alevin-fry`, have already been installed when installing `simpleaf`. +Please note that the commands in the {ref}`raw-proc:example-prep` section should be executed in advance. +All the tools called in the following commands, `pyroe`, `salmon`, and `alevin-fry`, have already been installed when installing `simpleaf`. #### Building the index -First, we process the genome FASTA file and gene annotation GTF file to obtain the _splici_ index. The commands in the following code chunk are analogous to the `simpleaf index` command discussed above. This includes two steps: +First, we process the genome FASTA file and gene annotation GTF file to obtain the _splici_ index. +The commands in the following code chunk are analogous to the `simpleaf index` command discussed above. This includes two steps: 1. Building the _splici_ reference (spliced transcripts + introns) by calling `pyroe make-splici`, using the genome and gene annotation file 2. Indexing the _splici_ reference by calling `salmon index` @@ -915,11 +1051,16 @@ alevin-fry quant -r cr-like \ -t 8 ``` -After running these commands, the resulting quantification information can be found in `alevin_fry_quant/alevin`. Other relevant information concerning the mapping, CB correction, and UMI resolution steps can be found in the `salmon_alevin`, `alevin_fry_gpl`, and `alevin_fry_quant` folders, respectively. +After running these commands, the resulting quantification information can be found in `alevin_fry_quant/alevin`. +Other relevant information concerning the mapping, CB correction, and UMI resolution steps can be found in the `salmon_alevin`, `alevin_fry_gpl`, and `alevin_fry_quant` folders, respectively. -In the example given here, we demonstrate using `simpleaf` and `alevin-fry` to process a 10x Chromium 3' v3 dataset. `Alevin-fry` and `simpleaf` provide many other options for processing different single-cell protocols, including but not limited to Dropseq{cite}`raw:Macosko2015`, sci-RNA-seq3{cite}`raw:Cao2019` and other 10x Chromium platforms. A more comprehensive list and description of available options for different stages of processing can be found in the [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) and [`simpleaf`](https://github.com/COMBINE-lab/simpleaf) documentation. `alevin-fry` also provides a [nextflow](https://www.nextflow.io/docs/latest/)-based workflow, called [quantaf](https://github.com/COMBINE-lab/quantaf), for conveniently processing many samples from a simply-defined sample sheet. +In the example given here, we demonstrate using `simpleaf` and `alevin-fry` to process a 10x Chromium 3' v3 dataset. +`Alevin-fry` and `simpleaf` provide many other options for processing different single-cell protocols, including but not limited to Dropseq{cite}`raw:Macosko2015`, sci-RNA-seq3{cite}`raw:Cao2019` and other 10x Chromium platforms. +A more comprehensive list and description of available options for different stages of processing can be found in the [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) and [`simpleaf`](https://github.com/COMBINE-lab/simpleaf) documentation. +`alevin-fry` also provides a [nextflow](https://www.nextflow.io/docs/latest/)-based workflow, called [quantaf](https://github.com/COMBINE-lab/quantaf), for conveniently processing many samples from a simply-defined sample sheet. -Of course, similar resources exist for many of the other raw data processing tools referenced and described throughout this section, including [`zUMIs`](https://github.com/sdparekh/zUMIs/wiki){cite}`zumis`, [`alevin`](https://salmon.readthedocs.io/en/latest/alevin.html){cite}`Srivastava2019`, [`kallisto|bustools`](https://www.kallistobus.tools/){cite}`Melsted2021`, [`STARsolo`](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md){cite}`Kaminow2021` and [`CellRanger`](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger). The [`scrnaseq`](https://nf-co.re/scrnaseq) pipeline from [`nf-core`](https://nf-co.re/) also provides a nextflow-based pipeline for processing single-cell RNA-seq data generated using a range of different chemistries and integrates several of the tools described in this section. +Of course, similar resources exist for many of the other raw data processing tools referenced and described throughout this section, including [`zUMIs`](https://github.com/sdparekh/zUMIs/wiki){cite}`zumis`, [`alevin`](https://salmon.readthedocs.io/en/latest/alevin.html){cite}`Srivastava2019`, [`kallisto|bustools`](https://www.kallistobus.tools/){cite}`Melsted2021`, [`STARsolo`](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md){cite}`Kaminow2021` and [`CellRanger`](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger). +The [`scrnaseq`](https://nf-co.re/scrnaseq) pipeline from [`nf-core`](https://nf-co.re/) also provides a nextflow-based pipeline for processing single-cell RNA-seq data generated using a range of different chemistries and integrates several of the tools described in this section. (raw-proc:useful-links)= @@ -956,6 +1097,7 @@ We gratefully acknowledge the contributions of: - Avi Srivastava - Hirak Sarkar - Rob Patro +- Seo H. Kim ### Reviewers From 6055e0b11edfeee8ef3f75e70bf81cdc040e407a Mon Sep 17 00:00:00 2001 From: Luis Date: Sat, 1 Feb 2025 15:17:09 +0100 Subject: [PATCH 06/29] reviewing until 3.2.2.1. Mapping to the full genome (exclusive) --- .../introduction/raw_data_processing.md | 70 +++++++++---------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 67642221..cd42ffa7 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -25,12 +25,12 @@ In this section, we focus on key steps of raw data processing: We also discuss the challenges and trade-offs involved in each step. -```{admonition} A note on Preceding Steps +```{admonition} A note on preceding steps The starting point for raw data processing is somewhat arbitrary. For this discussion, we treat lane-demultiplexed FASTQ files as the _raw_ input. However, these files are derived from earlier steps, such as base calling and base quality estimation, which can influence downstream processing. For example, base-calling errors and index hopping {cite}`farouni2020model` can introduce inaccuracies in FASTQ data. -These issues can be mitigated with computational approaches {cite}farouni2020model or experimental enhancements like [dual indexing](https://www.10xgenomics.com/blog/sequence-with-confidence-understand-index-hopping-and-how-to-resolve-it). +These issues can be mitigated with computational approaches {cite}`farouni2020model` or experimental enhancements like [dual indexing](https://www.10xgenomics.com/blog/sequence-with-confidence-understand-index-hopping-and-how-to-resolve-it). Here, we do not delve into the upstream processes, but consider the FASTQ files, derived from, e.g., BCL files via [appropriate tools](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/bcl2fastq-direct), as the raw input under consideration. ``` @@ -38,13 +38,13 @@ Here, we do not delve into the upstream processes, but consider the FASTQ files, ## Raw data quality control After obtaining raw FASTQ files, it is important to evaluate the quality of the sequencing reads. -A quick and effective way to perform this is by using quality control (QC) tools like [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). +A quick and effective way to perform this is by using quality control (QC) tools like `FastQC`. `FastQC` generates a detailed report for each FASTQ file, summarizing key metrics such as quality scores, base content, and other statistics that help identify potential issues arising from library preparation or sequencing. While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads—it is still good practice to run an independent QC check. This provides additional metrics that are often useful for identifying broader quality issues. -For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the [`FastQC` manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. +For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the `FastQC`[manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). @@ -71,7 +71,7 @@ Therefore, each module should be carefully reviewed before drawing conclusions a The summary panel of a bad example. ::: -**1. Basic Statistics** +**1. Basic statistics** The basic statistics module provides an overview of key information and statistics for the input FASTQ file, including the filename, total number of sequences, number of poor-quality sequences, sequence length, and the overall GC content (%GC) across all bases in all sequences. High-quality single-cell data typically have very few poor-quality sequences and exhibit a uniform sequence length. @@ -83,7 +83,7 @@ Additionally, the GC content should align with the expected GC content of the ge A good basic statistics report example. ::: -**2. Per Base Sequence Quality** +**2. Per base sequence quality** The per-base sequence quality view displays a box-and-whisker plot for each position in the read, illustrating the range of quality scores across all bases at each position. The x-axis represents the positions within the read, while the y-axis shows the quality scores. @@ -101,9 +101,9 @@ If poor-quality calls are observed, quality trimming may be necessary. [A more d A good (left) and a bad (right) per-read sequence quality graph. ::: -**3. Per Tile Sequence Quality** +**3. Per tile sequence quality** -Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` tile. +Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell`). The plot uses a color gradient to represent deviations, where “hotter” colors indicate larger deviations. High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell. @@ -117,7 +117,7 @@ For further investigation, consult resources like [QC Fail](https://sequencing.q A good (left) and a bad (right) per tile sequence quality view. ::: -**4. Per Sequence Quality Scores** +**4. Per sequence quality scores** The per-sequence quality score plot displays the distribution of average quality scores for each read in the file. The x-axis represents the average quality scores, while the y-axis shows the frequency of each score. @@ -130,9 +130,9 @@ If additional peaks appear, it may indicate a subset of reads with quality issue A good (left) and a bad (right) per sequence quality score plot. ::: -**5. Per Base Sequence Content** +**5. Per base sequence content** -The per-base sequence content plot shows the percentage of each nucleotide (A, T, G, C) called at each base position across all reads in the file. +The per-base sequence content plot shows the percentage of each nucleotide (A, T, G, and C) called at each base position across all reads in the file. For single-cell data, it is common to observe fluctuations at the start of the reads. This occurs because the initial bases represent the sequence of the priming sites, which are often not perfectly random. This is a frequent occurrence in RNA-seq libraries, even though `FastQC` may flag it with a warning or failure, as noted on the [QC Fail website](https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/). @@ -143,7 +143,7 @@ This is a frequent occurrence in RNA-seq libraries, even though `FastQC` may fla A good (left) and bad (right) per base sequence content plot. ::: -**6. Per Sequence GC Content** +**6. Per sequence GC content** The per-sequence GC content plot displays the GC content distribution across all reads (in red) compared to a theoretical distribution (in blue). The central peak of the observed distribution should align with the overall GC content of the transcriptome. @@ -163,7 +163,7 @@ The plot on the left is from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genom The plot on the right is taken from [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html). ::: -**7. Per Base N Content** +**7. Per base N content** The per-base N content plot displays the percentage of bases at each position that were called as ``N``, indicating that the sequencer lacked sufficient confidence to assign a specific nucleotide. In a high-quality library, the ``N`` content should remain consistently at or near zero across the entire length of the reads. @@ -176,7 +176,7 @@ Any noticeable non-zero ``N`` content may indicate issues with sequencing qualit A good (left) and a bad (right) per base N content plot. ::: -**8. Sequence Length Distribution** +**8. Sequence length distribution** The sequence length distribution graph displays the distribution of read lengths across all sequences in the file. For most single-cell sequencing chemistries, all reads are expected to have the same length, resulting in a single peak in the graph. @@ -189,7 +189,7 @@ Small differences in read lengths due to trimming are normal and should not be a A good (left) and a bad (right) sequence length distribution plot. ::: -**9. Sequence Duplication Levels** +**9. Sequence duplication levels** The sequence duplication level plot illustrates the distribution of duplication levels for read sequences, represented by the blue line, both before and after deduplication. In single-cell platforms, multiple rounds of {term}`PCR` are typically required, and highly expressed genes naturally produce a large number of transcripts. Additionally, since `FastQC` is not UMI-aware (i.e., it does not account for unique molecular identifiers), it is common for a small subset of sequences to show high duplication levels. @@ -203,7 +203,7 @@ However, the majority of sequences should still exhibit low duplication levels, A good (left) and a bad (right) per sequence duplication levels plot. ::: -**10. Overrepresented Sequences** +**10. Overrepresented sequences** The overrepresented sequences module identifies read sequences that constitute more than 0.1% of the total reads. In single-cell sequencing, some overrepresented sequences may arise from highly expressed genes amplified during PCR. @@ -218,7 +218,7 @@ Such cases warrant further investigation to ensure data quality. An overrepresented sequence table. ::: -**11. Adapter Content** +**11. Adapter content** The adapter content module displays the cumulative percentage of reads containing {term}`adapter sequences` at each base position. High levels of adapter sequences indicate incomplete removal of adapters during library preparation, which can interfere with downstream analyses. @@ -295,7 +295,7 @@ Additionally, "soft-clipping" may be used to reduce penalties for mismatches, in While these variations modify the rules of the dynamic programming recurrence and traceback, they do not fundamentally alter its overall complexity. In addition to general alignment techniques, several sophisticated modifications and heuristics have been developed to enhance the practical efficiency of aligning genomic sequencing reads. -For example,`banded alignment` {cite}`chao1992aligning` is a popular heuristic used by many tools to avoid computing large portions of the dynamic programming table when alignment scores below a threshold are not of interest. +For example, `banded alignment` {cite}`chao1992aligning` is a popular heuristic used by many tools to avoid computing large portions of the dynamic programming table when alignment scores below a threshold are not of interest. Other heuristics, like X-drop {cite}`zhang2000` and Z-drop {cite}`li2018minimap2`, efficiently prune unpromising alignments early in the process. Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022optimal, enable the determination of optimal alignments in significantly reduced time and space, particularly when high-scoring alignments are present. Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`, and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding {cite:t}`Suzuki2018`. Most widely-used alignment tools incorporate these highly optimized, vectorized implementations. @@ -550,8 +550,8 @@ There exist other challenges that we do not focus upon here, such as "convergent We consider the case where the same UMI is used to tag two different pre-PCR molecules arising from the same gene, in the same cell, as a convergent collision. When two or more distinct UMIs arise from the same pre-PCR molecule, e.g., due to the sampling of multiple priming sites from this molecule, we consider this a divergent collision. We expect convergent UMI collisions to be rare and, therefore, their effect typically small. -Further, transcript-level mapping information can sometimes be used to resolve such collisions{cite}`Srivastava2019`. -Divergent UMI collisions occur primarily among introns of unspliced transcripts {cite}`technote_10x_intronic_reads`, and approaches to addressing the issues they raise are an area of active research{cite}`technote_10x_intronic_reads,Gorin2021`. +Further, transcript-level mapping information can sometimes be used to resolve such collisions {cite}`Srivastava2019`. +Divergent UMI collisions occur primarily among introns of unspliced transcripts {cite}`technote_10x_intronic_reads`, and approaches to addressing the issues they raise are an area of active research {cite}`technote_10x_intronic_reads,Gorin2021`. Given that the use of UMIs is near ubiquitous in high-throughput scRNA-seq protocols and the fact that addressing these errors improves the estimation of gene abundances, there has been much attention paid to the problem of UMI resolution in recent literature {cite}`Islam2013,Bose2015,raw:Macosko2015,Smith2017,Srivastava2019,Kaminow2021,Melsted2021,raw:He2022,calib,umic,zumis`. @@ -568,8 +568,8 @@ In the context of single-cell data, a UMI graph $G(V,E)$ is a {term}`directed gr Each node $v_i \in V$ represents an equivalence class (EC) of reads, and the edge set $E$ encodes the relationship between the ECs. The equivalence relation $\sim_r$ defined on reads is based on their UMI and mapping information. We say reads $r_x$ and $r_y$ are equivalent, $r_x \sim_r r_y$, if and only if they have identical UMI tags and map to the same set of references. -UMI resolution approaches may define a "reference" as a genomic locus{cite}`Smith2017`, transcript{cite}`Srivastava2019,raw:He2022` or gene{cite}`raw:Zheng2017,Kaminow2021`. -Other UMI resolution approaches exist, for example, the reference-free model{cite}`umic` and the method of moments{cite}`Melsted2021`, but they may not be easily represented in this framework and are not discussed in further detail here. +UMI resolution approaches may define a "reference" as a genomic locus {cite}`Smith2017`, transcript {cite}`Srivastava2019,raw:He2022` or gene {cite}`raw:Zheng2017,Kaminow2021`. +Other UMI resolution approaches exist, for example, the reference-free model {cite}`umic` and the method of moments {cite}`Melsted2021`, but they may not be easily represented in this framework and are not discussed in further detail here. In the UMI graph framework, a UMI resolution approach can be divided into three major steps: **defining nodes**, **defining adjacency relationships**, and **resolving components**. @@ -599,8 +599,8 @@ Here we define the following functions on the node $v_i \in V$: - $m(v_i)$ is the reference set encoded in the mapping information, for $v_i$. - $D(v_i, v_j)$ is the distance between $u(v_i)$ and $u(v_j)$, where $v_j \in V$. -Given these function definitions, any two nodes $v_i, v_j \in V$ will be incident with a bi-directed edge if and only if $m(v_i) \cap m(v_j) \ne \emptyset$ and $D(v_i,v_j) \le \theta$, where $\theta$ is a distance threshold and is often set as $\theta=1${cite}`Smith2017,Kaminow2021,Srivastava2019`. -Additionally, the bi-directed edge might be replaced by a directed edge incident from $v_i$ to $v_j$ if $c(v_i) \ge 2c(v_j) -1$ or vice versa{cite}`Smith2017,Srivastava2019`. +Given these function definitions, any two nodes $v_i, v_j \in V$ will be incident with a bi-directed edge if and only if $m(v_i) \cap m(v_j) \ne \emptyset$ and $D(v_i,v_j) \le \theta$, where $\theta$ is a distance threshold and is often set as $\theta=1$ {cite}`Smith2017,Kaminow2021,Srivastava2019`. +Additionally, the bi-directed edge might be replaced by a directed edge incident from $v_i$ to $v_j$ if $c(v_i) \ge 2c(v_j) -1$ or vice versa {cite}`Smith2017,Srivastava2019`. Though these edge definitions are among the most common, others are possible, so long as they are completely defined by the $u$, $c$, $m$, and $D$ functions. With $V$ and $E$ in hand, the UMI graph $G = (V,E)$ is now defined. (raw-proc:umi-graph-resolution-def)= @@ -608,7 +608,7 @@ Though these edge definitions are among the most common, others are possible, so #### Defining the graph resolution approach Given the defined UMI graph, many different resolution approaches may be applied. -A resolution method may be as simple as finding the set of connected components, clustering the graph, greedily collapsing nodes or contracting edges{cite}`Smith2017`, or searching for a cover of the graph by structures following certain rules (e.g., monochromatic arboresences{cite}`Srivastava2019`) to reduce the graph. +A resolution method may be as simple as finding the set of connected components, clustering the graph, greedily collapsing nodes or contracting edges {cite}`Smith2017`, or searching for a cover of the graph by structures following certain rules (e.g., monochromatic arboresences {cite}`Srivastava2019`) to reduce the graph. As a result, each node in the reduced UMI graph, or each element in the cover in the case that the graph is not modified dynamically, represents a pre-PCR molecule. The collapsed nodes or covering sets are regarded as the PCR duplicates of that molecule. @@ -622,7 +622,7 @@ For approaches that probabilistically resolve ambiguity caused by multimapping, The last step in UMI resolution is quantifying the abundance of each gene using the resolved UMI graph. For approaches that discard multi-gene ECs, the molecule count vector for the genes in the current cell being processed (or count vector for short) is generated by counting the number of ECs labeled with each gene. On the other hand, approaches that process, rather than discard, multi-gene ECs usually resolve the ambiguity by applying some statistical inference procedure. -For example, {cite:t}`Srivastava2019` introduce an expectation-maximization (EM) approach for probabilistically assigning multi-gene UMIs, and related EM algorithms have also been introduced as optional steps in subsequent tools{cite}`Melsted2021,Kaminow2021,raw:He2022`. +For example, {cite:t}`Srivastava2019` introduce an expectation-maximization (EM) approach for probabilistically assigning multi-gene UMIs, and related EM algorithms have also been introduced as optional steps in subsequent tools {cite}`Melsted2021,Kaminow2021,raw:He2022`. In this model, the collapsed-EC-to-gene assignments are latent variables, and the deduplicated molecule count of genes are the main parameters. Intuitively, evidence from gene-unique ECs will be used to help probabilistically apportion the multi-gene ECs. The EM algorithm seeks the parameters that together have the (locally) highest likelihood of generating the observed ECs. @@ -640,7 +640,7 @@ Once a count matrix has been generated, it is important to perform a quality con There are several distinct assessments that generally fall under the rubric of quality control. Basic global metrics are often recorded and reported to help assess the overall quality of the sequencing measurement itself. These metrics consist of quantities such as the total fraction of mapped reads, the distribution of distinct UMIs observed per cell, the distribution of UMI deduplication rates, the distribution of detected genes per cell, etc. -These and similar metrics are often recorded by the quantification tools themselves{cite}`raw:Zheng2017,Kaminow2021,Melsted2021,raw:He2022` since they arise naturally and can be computed during the process of read mapping, cell barcode correction, and UMI resolution. +These and similar metrics are often recorded by the quantification tools themselves {cite}`raw:Zheng2017,Kaminow2021,Melsted2021,raw:He2022` since they arise naturally and can be computed during the process of read mapping, cell barcode correction, and UMI resolution. Likewise, there exist several tools to help organize and visualize these basic metrics, such as the [Loupe browser](https://support.10xgenomics.com/single-cell-gene-expression/software/visualization/latest/what-is-loupe-cell-browser), [alevinQC](https://github.com/csoneson/alevinQC), or a [kb_python report](https://github.com/pachterlab/kb_python), depending upon the quantification pipeline being used. Beyond these basic global metrics, at this stage of analysis, QC metrics are designed primarily to help determine which cells (CBs) have been sequenced "successfully", and which exhibit artifacts that warrant filtering or correction. @@ -708,12 +708,12 @@ If all of these metrics are consistently low across cells or if these plots conv ### Empty droplet detection One of the first QC steps is determining which cell barcodes correspond to "high-confidence" sequenced cells. -It is common in droplet-based protocols{cite}`raw:Macosko2015` that certain barcodes are associated with ambient {term}`RNA` instead of the RNA of a captured cell. +It is common in droplet-based protocols {cite}`raw:Macosko2015` that certain barcodes are associated with ambient {term}`RNA` instead of the RNA of a captured cell. This happens when droplets fail to capture a cell. These empty droplets still tend to produce sequenced reads, although the characteristics of these reads look markedly different from those associated with barcodes corresponding to properly captured cells. Many approaches exist to assess whether a barcode likely corresponds to an empty droplet or not. One simple method is to examine the cumulative frequency plot of the barcodes, in which barcodes are sorted in descending order of the number of distinct UMIs with which they are associated. -This plot often contains a "knee" that can be identified as a likely point of discrimination between properly captured cells and empty droplets{cite}`Smith2017,raw:He2022`. +This plot often contains a "knee" that can be identified as a likely point of discrimination between properly captured cells and empty droplets {cite}`Smith2017,raw:He2022`. While this "knee" method is intuitive and can often estimate a reasonable threshold, it has several drawbacks. For example, not all cumulative histograms display an obvious knee, and it is notoriously difficult to design algorithms that can robustly and automatically detect such knees. Finally, the total UMI count associated with a barcode may not, alone, be the best signal to determine if the barcode was associated with an empty or damaged cell. @@ -721,13 +721,13 @@ Finally, the total UMI count associated with a barcode may not, alone, be the be This led to the development of several tools specifically designed to detect empty or damaged droplets, or cells generally deemed to be of "low quality" {cite}`Lun2019,Heiser2021,Hippen2021,Muskovic2021,Alvarez2020,raw:Young2020`. These tools incorporate a variety of different measures of cell quality, including the frequencies of distinct UMIs, the number of detected genes, and the fraction of mitochondrial RNA, and typically work by applying a statistical model to these features to classify high-quality cells from putative empty droplets or damaged cells. This means that cells can typically be scored, and a final filtering can be selected based on an estimated posterior probability that cells are not empty or compromised. -While these models generally work well for single-cell {term}`RNA`-seq data, one may have to apply several additional filters or heuristics to obtain robust filtering in single-nucleus {term}`RNA`-seq data{cite}`Kaminow2021,raw:He2022`, like those exposed in the [`emptyDropsCellRanger`](https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDropsCellRanger.R) function of `DropletUtils`{cite}`Lun2019`. +While these models generally work well for single-cell {term}`RNA`-seq data, one may have to apply several additional filters or heuristics to obtain robust filtering in single-nucleus {term}`RNA`-seq data {cite}`Kaminow2021,raw:He2022`, like those exposed in the [`emptyDropsCellRanger`](https://github.com/MarioniLab/DropletUtils/blob/master/R/emptyDropsCellRanger.R) function of `DropletUtils` {cite}`Lun2019`. ### Doublet detection In addition to determining which cell barcodes correspond to empty droplets or damaged cells, one may also wish to identify those cell barcodes that correspond to doublets or multiplets. When a given droplet captures two (doublets) or more (multiplets) cells, this can result in a skewed distribution for these cell barcodes in terms of quantities like the number of reads and UMIs they represent, as well as gene expression profiles they display. -Many tools have also been developed to predict the doublet status of cell barcodes{cite}`DePasquale2019,McGinnis2019,Wolock2019,Bais2019,Bernstein2020`. +Many tools have also been developed to predict the doublet status of cell barcodes {cite}`DePasquale2019,McGinnis2019,Wolock2019,Bais2019,Bernstein2020`. Once detected, cells determined to likely be doublets and multiplets can be removed or otherwise adjusted for in the subsequent analysis. (raw-proc:output-representation)= @@ -741,7 +741,7 @@ Accurately resolving UMIs is particularly challenging, and issues related to UMI Additionally, multiple priming sites, especially in unspliced molecules, can violate the commonly assumed one molecule-to-one UMI relationship. In light of these challenges and the simplifications adopted to address them, there remains active research as to how best to represent the preprocessed data to downstream tools. -For example, several quantification tools{cite}`Srivastava2019,Melsted2021,Kaminow2021,raw:He2022` implement an _optional_ EM algorithm, initially introduced in this context by {cite:t}`Srivastava2019`, that probabilistically apportions UMIs associated with reads that map to more than one gene. +For example, several quantification tools {cite}`Srivastava2019,Melsted2021,Kaminow2021,raw:He2022` implement an _optional_ EM algorithm, initially introduced in this context by {cite:t}`Srivastava2019`, that probabilistically apportions UMIs associated with reads that map to more than one gene. This, however, can result in non-integer count matrices that may be unexpected by certain downstream tools. Alternatively, UMIs can be resolved to _gene groups_ instead of individual genes, preserving multimapping information in the preprocessed output. Notably, a similar approach has been used at the transcript level for over a decade as a succinct internal representation in bulk RNA-seq transcript quantification tools {cite}`Turro2011,Nicolae2011,Patro2014,Bray2016,Patro2017,Ju2017`. @@ -938,7 +938,7 @@ adata_sa = pyroe.load_fry(quant_dir) ``` The default behavior loads the `X` layer of the `Anndata` object as the sum of the spliced and ambiguous counts for each gene. -However, recent work{cite}`Pool2022` and [updated practices](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/release-notes) suggest that the inclusion of intronic counts, even in single-cell RNA-seq data, may increase sensitivity and benefit downstream analyses. +However, recent work {cite}`Pool2022` and [updated practices](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/release-notes) suggest that the inclusion of intronic counts, even in single-cell RNA-seq data, may increase sensitivity and benefit downstream analyses. While the best way to make use of this information is the subject of ongoing research, since `alevin-fry` automatically quantifies spliced, unspliced, and ambiguous reads in each sample, the count matrix containing the total counts for each gene can be simply obtained as follows: ```python @@ -1055,11 +1055,11 @@ After running these commands, the resulting quantification information can be fo Other relevant information concerning the mapping, CB correction, and UMI resolution steps can be found in the `salmon_alevin`, `alevin_fry_gpl`, and `alevin_fry_quant` folders, respectively. In the example given here, we demonstrate using `simpleaf` and `alevin-fry` to process a 10x Chromium 3' v3 dataset. -`Alevin-fry` and `simpleaf` provide many other options for processing different single-cell protocols, including but not limited to Dropseq{cite}`raw:Macosko2015`, sci-RNA-seq3{cite}`raw:Cao2019` and other 10x Chromium platforms. +`Alevin-fry` and `simpleaf` provide many other options for processing different single-cell protocols, including but not limited to Dropseq {cite}`raw:Macosko2015`, sci-RNA-seq3 {cite}`raw:Cao2019` and other 10x Chromium platforms. A more comprehensive list and description of available options for different stages of processing can be found in the [`alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) and [`simpleaf`](https://github.com/COMBINE-lab/simpleaf) documentation. `alevin-fry` also provides a [nextflow](https://www.nextflow.io/docs/latest/)-based workflow, called [quantaf](https://github.com/COMBINE-lab/quantaf), for conveniently processing many samples from a simply-defined sample sheet. -Of course, similar resources exist for many of the other raw data processing tools referenced and described throughout this section, including [`zUMIs`](https://github.com/sdparekh/zUMIs/wiki){cite}`zumis`, [`alevin`](https://salmon.readthedocs.io/en/latest/alevin.html){cite}`Srivastava2019`, [`kallisto|bustools`](https://www.kallistobus.tools/){cite}`Melsted2021`, [`STARsolo`](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md){cite}`Kaminow2021` and [`CellRanger`](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger). +Of course, similar resources exist for many of the other raw data processing tools referenced and described throughout this section, including [`zUMIs`](https://github.com/sdparekh/zUMIs/wiki) {cite}`zumis`, [`alevin`](https://salmon.readthedocs.io/en/latest/alevin.html) {cite}`Srivastava2019`, [`kallisto|bustools`](https://www.kallistobus.tools/) {cite}`Melsted2021`, [`STARsolo`](https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md) {cite}`Kaminow2021` and [`CellRanger`](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger). The [`scrnaseq`](https://nf-co.re/scrnaseq) pipeline from [`nf-core`](https://nf-co.re/) also provides a nextflow-based pipeline for processing single-cell RNA-seq data generated using a range of different chemistries and integrates several of the tools described in this section. (raw-proc:useful-links)= From 2968e03af7b8c9fdd4a8513084c075074428ff16 Mon Sep 17 00:00:00 2001 From: Luis Date: Sat, 1 Feb 2025 18:02:58 +0100 Subject: [PATCH 07/29] reviewing until 3.3.2. Future challenges (inclusive) --- jupyter-book/introduction/raw_data_processing.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index cd42ffa7..a037bd3c 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -44,7 +44,7 @@ A quick and effective way to perform this is by using quality control (QC) tools While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads—it is still good practice to run an independent QC check. This provides additional metrics that are often useful for identifying broader quality issues. -For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the `FastQC`[manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. +For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). @@ -365,7 +365,7 @@ For instance, methods such as those described by {cite:t}`Pool2022` incorporate While spliced alignment against the full genome offers versatility, it also comes with certain trade-offs. One major limitation is the high memory requirements of commonly used alignment tools in the single-cell space. Many of these tools are based on the **STAR** aligner {cite}`dobin2013star`, due to its speed and versatillity, and require substantial computational resources. -For a human-scale genome, constructing and storing the index can demand over $32$ GB of memory. +For a human-scale genome, constructing and storing the index can demand over 32 GB of memory. Using a sparse [suffix array](https://en.wikipedia.org/wiki/Suffix_array) can nearly halve the final index size, but this comes at the cost of reduced alignment speed and still requires significant memory for initial construction. Additionally, spliced alignment is inherently more complex than contiguous alignment. @@ -458,7 +458,7 @@ Several common strategies are used for cell barcode identification and correctio - Any barcode not in the list is corrected by finding the closest match from the permit list, typically using {term}`Hamming distance` or {term}`edit distance`. This strategy allows for efficient barcode correction but has limitations. If a corrupted barcode closely resembles multiple barcodes in the permit list, its correction becomes ambiguous. - For example, for a barcode taken from the [10x Chromium v3 permit list](https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/3M-february-2018.txt.gz) and mutated at a single position to a barcode not in the list, there is an $\sim 81\%$ probability that it sits at hamming distance $1$ from two or more barcodes in the permit list. + For example, for a barcode taken from the [10x Chromium v3 permit list](https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/3M-february-2018.txt.gz) and mutated at a single position to a barcode not in the list, there is an 81\% probability that it sits at hamming distance 1 from two or more barcodes in the permit list. The probability of such collisions can be reduced by considering correcting _only_ against barcodes from the known permit list, which, themselves, occur exactly in the given sample (or even only those that occur exactly in the given sample above some nominal frequency threshold). Also, information such as the base quality at the "corrected" position can be used to potentially break ties in the case of ambiguous corrections. Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded. @@ -473,20 +473,20 @@ Several common strategies are used for cell barcode identification and correctio Further parameters of the knee-finding algorithms can be altered to yield more or less restrictive selected barcode sets. Yet, such an approach can have certain drawbacks, like a tendency to be overly conservative and sometimes failing to work robustly in samples where no clear knee is present. -3. **Filtering and Correction Based on an Expected Cell Count**: +3. **Filtering and correction based on an expected cell count**: When barcode frequency distributions lack a clear knee or show bimodal patterns due to technical artifacts, barcode correction can be guided by a user-provided expected cell count. In such an approach, the user provides an estimate of the expected number of assayed cells. Then, the barcodes are ordered by descending frequency, the frequency $f$ at a robust quantile index near the expected cell count is obtained, and all cells having a frequency within a small constant fraction of $f$ (e.g., $\ge \frac{f}{10}$) are considered as valid barcodes. Again, the remaining barcodes are corrected against this valid list by attempting to correct uniquely to one of these valid barcodes based on sequence similarity. -4. **Filtering Based on a Forced Number of Valid Cells**: +4. **Filtering based on a forced number of valid cells**: The simplest approach, although potentially problematic, is for the user to manually specify the number of valid barcodes. - The user chooses an index in the sorted barcode frequency list. - All barcodes above this threshold are considered valid. - Remaining barcodes are corrected against this list using standard similarity-based correction methods. While this guarantees selection of at least n cells, it assumes that the chosen threshold accurately reflects the number of real cells. - It is only reasonable if he user has a good reason to believe that the threshold frequency should be set around the provided index. + It is only reasonable if the user has a good reason to believe that the threshold frequency should be set around the provided index. %In the `alevin-fry` framework, the frequency of every observed cell barcode is generated, and there are four customizable options to select the high-quality cell barcodes for downstream analysis: From e19bbbdf0046e222c2c975f0cacb88d879a61368 Mon Sep 17 00:00:00 2001 From: Luis Date: Sun, 2 Feb 2025 18:57:46 +0100 Subject: [PATCH 08/29] reviewing until 3.8. A real-world example (exclusive) --- jupyter-book/introduction/raw_data_processing.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index a037bd3c..a739bd77 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -44,7 +44,7 @@ A quick and effective way to perform this is by using quality control (QC) tools While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads—it is still good practice to run an independent QC check. This provides additional metrics that are often useful for identifying broader quality issues. -For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. +For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/technical-documents/fastqc-tutorial-and-faq.aspx), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. In the toggle section, all graphs, except specifically mentioned, are taken from the example reports on the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). @@ -159,7 +159,7 @@ As a result, some deviation from the theoretical distribution is not unusual in Per Sequence GC Content A good (left) and a bad (right) per sequence GC content plot. -The plot on the left is from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/). +The plot on the left is from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/technical-documents/fastqc-tutorial-and-faq.aspx). The plot on the right is taken from [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html). ::: @@ -365,7 +365,7 @@ For instance, methods such as those described by {cite:t}`Pool2022` incorporate While spliced alignment against the full genome offers versatility, it also comes with certain trade-offs. One major limitation is the high memory requirements of commonly used alignment tools in the single-cell space. Many of these tools are based on the **STAR** aligner {cite}`dobin2013star`, due to its speed and versatillity, and require substantial computational resources. -For a human-scale genome, constructing and storing the index can demand over 32 GB of memory. +For a human-scale genome, constructing and storing the index can demand over $32$ GB of memory. Using a sparse [suffix array](https://en.wikipedia.org/wiki/Suffix_array) can nearly halve the final index size, but this comes at the cost of reduced alignment speed and still requires significant memory for initial construction. Additionally, spliced alignment is inherently more complex than contiguous alignment. @@ -458,7 +458,7 @@ Several common strategies are used for cell barcode identification and correctio - Any barcode not in the list is corrected by finding the closest match from the permit list, typically using {term}`Hamming distance` or {term}`edit distance`. This strategy allows for efficient barcode correction but has limitations. If a corrupted barcode closely resembles multiple barcodes in the permit list, its correction becomes ambiguous. - For example, for a barcode taken from the [10x Chromium v3 permit list](https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/3M-february-2018.txt.gz) and mutated at a single position to a barcode not in the list, there is an 81\% probability that it sits at hamming distance 1 from two or more barcodes in the permit list. + For example, for a barcode taken from the [10x Chromium v3 permit list](https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/3M-february-2018.txt.gz) and mutated at a single position to a barcode not in the list, there is an $\sim 81\%$ probability that it sits at hamming distance $1$ from two or more barcodes in the permit list. The probability of such collisions can be reduced by considering correcting _only_ against barcodes from the known permit list, which, themselves, occur exactly in the given sample (or even only those that occur exactly in the given sample above some nominal frequency threshold). Also, information such as the base quality at the "corrected" position can be used to potentially break ties in the case of ambiguous corrections. Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded. @@ -535,7 +535,7 @@ Here, we concern ourselves primarily with two challenges. Failing to address such UMI errors can inflate the estimated number of molecules {cite}`Smith2017,ziegenhain2022molecular`. - **Multimapping**: - This issue arises in cases where a read or UMI belongs to multiple references(e.g., multi-gene reads/UMIs). This happens when different reads of a UMI map to different genes, when a read maps to multiple genes, or both. + This issue arises in cases where a read or UMI belongs to multiple references (e.g., multi-gene reads/UMIs). This happens when different reads of a UMI map to different genes, when a read maps to multiple genes, or both. The consequence of this issue is that the gene origin of the multi-gene reads/UMIs is ambiguous, which results in uncertainty about the sampled pre-PCR molecule count of those genes. Simply discarding multi-gene reads/UMIs can lead to a loss of data or a biased estimate among genes that tend to produce multimapping reads, such as sequence-similar gene families {cite}`Srivastava2019`. @@ -672,7 +672,7 @@ The figure shows the plots in the alevinQC report of an example single-cell data Each dot represents a corrected cell barcode with its corrected profile. ::: -The first(top left) view in {numref}`raw-proc-fig-alevinqc-plots` shows the distribution of cell barcode frequency in decreasing order. +The first (top left) view in {numref}`raw-proc-fig-alevinqc-plots` shows the distribution of cell barcode frequency in decreasing order. In all plots shown above, each point represents a corrected cell barcode, with its x-coordinate corresponding to its cell barcode frequency rank. In the top left plot, the y-coordinate corresponds to the observed frequency of the corrected barcode. Generally, this plot shows a "knee"-like pattern, which can be used to identify the initial list of high-quality barcodes. @@ -759,7 +759,7 @@ It is, of course, important to note that the development of methods and tools fo It is therefore often useful and reasonable, when performing your own analyses, to experiment with several different tools. At the coarsest level, the most common tools can process data robustly and accurately. -It has been suggested that with many common downstream analyses like clustering, and the methods used to perform them, the choice of preprocessing tool typically makes less difference than other steps in the analysis process {cite}`You_2021` +It has been suggested that with many common downstream analyses like clustering, and the methods used to perform them, the choice of preprocessing tool typically makes less difference than other steps in the analysis process {cite}`You_2021`. Nonetheless, it has also been observed that applying lightweight mapping restricted to the spliced transcriptome can increase the probability of spurious mapping and gene expression {cite}`Bruning_2022`. Ultimately, the choice of a specific tool largely depends on the task at hand, and the constraints on available computational resources. @@ -1076,7 +1076,7 @@ The [`scrnaseq`](https://nf-co.re/scrnaseq) pipeline from [`nf-core`](https://nf Tutorials for processing scRNA-seq raw data from [the galaxy project](https://galaxyproject.org/) can be found at [here](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/scrna-preprocessing-tenx/tutorial.html) and [here](https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/scrna-preprocessing/tutorial.html). -Tutorials for explaining and evaluating FastQC report are available from [MSU](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), [Galaxy Training](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/). +Tutorials for explaining and evaluating FastQC report are available from [MSU](https://rtsf.natsci.msu.edu/genomics/technical-documents/fastqc-tutorial-and-faq.aspx), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), [Galaxy Training](https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html) and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/). (raw-proc:references)= From a4172806ab045224bc3f17157721632dd732a15b Mon Sep 17 00:00:00 2001 From: Luis Date: Mon, 3 Feb 2025 17:58:36 +0100 Subject: [PATCH 09/29] reviewing done --- jupyter-book/introduction/raw_data_processing.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index a739bd77..0b40a7a8 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -780,7 +780,7 @@ Usually, a reference includes the genome sequences and the corresponding gene an In this example, we will use _chromosome 5_ of the human genome and its related gene annotations as the reference, which is a subset of the Human reference, [GRCh38 (GENCODE v32/Ensembl 98) reference](https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/build#GRCh38_2020A) from the 10x Genomics reference build. Correspondingly, we extract the subset of reads that map to the generated reference from a [human brain tumor dataset](https://www.10xgenomics.com/resources/datasets/200-sorted-cells-from-human-glioblastoma-multiforme-3-lt-v-3-1-3-1-low-6-0-0) from 10x Genomics. -[`Alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/){cite}`raw:He2022` is a fast, accurate, and memory-frugal single-cell and single-nucleus data processing tool. +[`Alevin-fry`](https://alevin-fry.readthedocs.io/en/latest/) {cite}`raw:He2022` is a fast, accurate, and memory-frugal single-cell and single-nucleus data processing tool. [Simpleaf](https://github.com/COMBINE-lab/simpleaf) is a program, written in [rust](https://www.rust-lang.org/), that exposes a unified and simplified interface for processing some of the most common protocols and data types using the `alevin-fry` pipeline. A nextflow-based [workflow](https://github.com/COMBINE-lab/quantaf) tool also exists to process extensive collections of single-cell data. Here we will first show how to process single-cell raw data using two `simpleaf` commands. Then, we describe the complete set of `salmon alevin` and `alevin-fry` commands to which these `simpleaf` commands correspond, to outline where the steps described in this section occur and to convey the possible different processing options. @@ -803,7 +803,7 @@ conda activate af Conda does not currently build most packages natively for Apple silicon. Therefore, if you -are using a non-Intel-based Apple computer (e.g., with an M1(Pro/Max/Ultra) or M2 chip), you +are using a non-Intel-based Apple computer (e.g., with a M1 (Pro/Max/Ultra) or M2 chip), you should make sure to specify that your environment uses the Rosetta2 translation layer. To do this, you can replace the above commands with the following (instructions adopted from [here](https://github.com/Haydnspass/miniforge#rosetta-on-mac-with-apple-silicon-hardware)): From 917e40f816bad870c6b3ef0ce5ca86ef375dd1bc Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Mon, 3 Feb 2025 21:38:18 +0100 Subject: [PATCH 10/29] final touch, ready for PR --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 0b40a7a8..a2212c03 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -803,7 +803,7 @@ conda activate af Conda does not currently build most packages natively for Apple silicon. Therefore, if you -are using a non-Intel-based Apple computer (e.g., with a M1 (Pro/Max/Ultra) or M2 chip), you +are using a non-Intel-based Apple computer (e.g., with an M1 (Pro/Max/Ultra) or M2 chip), you should make sure to specify that your environment uses the Rosetta2 translation layer. To do this, you can replace the above commands with the following (instructions adopted from [here](https://github.com/Haydnspass/miniforge#rosetta-on-mac-with-apple-silicon-hardware)): From e9acdfc2f9293e48cd3eac47560334d06607601d Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Thu, 6 Feb 2025 19:58:24 +0100 Subject: [PATCH 11/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index a2212c03..9d90985d 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -406,7 +406,7 @@ More reads can be confidently assigned compared to using only the spliced transc Augmented transcriptomes are widely used in methods that do not map to the full genome, particularly for single-nucleus data processing and RNA velocity analysis {cite}`Soneson2021Preprocessing` (see {doc}`../trajectories/rna_velocity`). These augmented references can be constructed for all common methods that do not rely on spliced alignment to the full genome {cite}`Srivastava2019,Melsted2021,raw:He2022`. -{cite:t}`raw:He2022` argue that this approach is valuable even for standard single-cell RNA-seq data and recommend constructing an augmented _splici_ reference (spliced + intronic) for mapping and quantification. +{cite}`raw:He2022` argue that this approach is valuable even for standard single-cell RNA-seq data and recommend constructing an augmented _splici_ reference (spliced + intronic) for mapping and quantification. The _splici_ reference is built using the spliced transcriptome sequence alongside sequences representing the merged intronic intervals of annotated genes. Each reference sequence is labeled with its splicing status, and mapping results are processed using splicing status-aware methods for {ref}`raw-proc:umi-resolution`. From 715cf5561d8f5ab1e61bd29aadf05bdd331e46a6 Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Thu, 6 Feb 2025 20:30:33 +0100 Subject: [PATCH 12/29] Revert "Update jupyter-book/introduction/raw_data_processing.md" This reverts commit e9acdfc2f9293e48cd3eac47560334d06607601d. --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 9d90985d..a2212c03 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -406,7 +406,7 @@ More reads can be confidently assigned compared to using only the spliced transc Augmented transcriptomes are widely used in methods that do not map to the full genome, particularly for single-nucleus data processing and RNA velocity analysis {cite}`Soneson2021Preprocessing` (see {doc}`../trajectories/rna_velocity`). These augmented references can be constructed for all common methods that do not rely on spliced alignment to the full genome {cite}`Srivastava2019,Melsted2021,raw:He2022`. -{cite}`raw:He2022` argue that this approach is valuable even for standard single-cell RNA-seq data and recommend constructing an augmented _splici_ reference (spliced + intronic) for mapping and quantification. +{cite:t}`raw:He2022` argue that this approach is valuable even for standard single-cell RNA-seq data and recommend constructing an augmented _splici_ reference (spliced + intronic) for mapping and quantification. The _splici_ reference is built using the spliced transcriptome sequence alongside sequences representing the merged intronic intervals of annotated genes. Each reference sequence is labeled with its splicing status, and mapping results are processed using splicing status-aware methods for {ref}`raw-proc:umi-resolution`. From a01a0460be75f051c9d7aff9b1f61e03523f7323 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 13:29:05 +0100 Subject: [PATCH 13/29] Update jupyter-book/glossary.md Co-authored-by: Lukas Heumos --- jupyter-book/glossary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 90b8b797..6a9f4851 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -99,7 +99,7 @@ Drop-seq A protocol for scRNA-seq that separates cells into nano-liter sized aqueous droplets enabling large-scale profiling. Edit distance - Edit distance (often referring to Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another. + Edit distance (often referred to as Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another. FASTQ reads Sequencing reads that are saved in the FASTQ format. From cb85bb03ea540bcac5513e2051c79d92e9276cb7 Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Fri, 7 Feb 2025 13:52:09 +0100 Subject: [PATCH 14/29] fix a bit of glossary and chapter 3 --- jupyter-book/glossary.md | 26 +++++++++---------- .../introduction/raw_data_processing.md | 2 +- 2 files changed, 13 insertions(+), 15 deletions(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 90b8b797..635df6dd 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -13,13 +13,9 @@ Algorithms AnnData AnnDatas - A Python package for annotated data matrices. - The primary data structure used in the scverse ecosystem. - -Backtrace -backtrace - In the context of sequence alignment, a backtrace refers to the step-by-step reconstruction of the alignment path that produces the optimal alignment score. - This path shows how the sequences are aligned, including matches, mismatches, insertions, deletions, and gaps. + A Python package for handling annotated data matrices, commonly used in single-cell and other omics analyses. + It provides an efficient way to store data as a matrix where rows (observations) and columns (features) can have associated metadata. + [AnnData](https://anndata.readthedocs.io/en/latest/index.html) supports slicing, subsetting, and saving to disk in formats like H5AD and Zarr. Barcode Barcodes @@ -40,11 +36,9 @@ Bulk RNA sequencing Therefore, resolution is lost, but bulk sequencing is usually cheaper, less laborious and faster to analyze. Cell - The fundamental unit of life. - Consists of cytoplasm enclosed within a membrane that contains many biomolecules such as proteins and nucleic acids. - Cells acquire specific functions, transition to cell types, divide, communicate and keep the organism going. - Learning about the structure, activity and communication of cells helps deciphering biology. - + The fundamental unit of life, consisting of cytoplasm enclosed within a membrane, containing biomolecules such as proteins and nucleic acids. + Cells acquire specific functions, transition into different types, divide, and communicate to sustain an organism. + Studying cell structure, activity, and interactions enables insights into gene expression dynamics, cellular trajectories, developmental lineages, and disease mechanisms. Cell barcode See {term}`barcode` @@ -165,10 +159,14 @@ Pseudotime Pseudotime is usually related to real time events, but not necessarily the same. RNA - Ribonucleic acid. Single-stranded nucleic acid present in all living cells that encodes and regulates gene expression. + Ribonucleic acid (RNA) is a single-stranded nucleic acid present in all living cells that encodes and regulates gene expression. + Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA). + RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease. + + RT-qPCR - Quantitative reverse transcription PCR (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR. + Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR. scanpy A Python package for single-cell analysis in Python by scverse. diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index a2212c03..cf7443c5 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -750,7 +750,7 @@ In this case, instead of the resulting count matrix having dimensions $C \times By propagating this information to the output count matrix, one can avoid the application of heuristic or probabilistic UMI resolution methods that can result in loss of data, or bias, in the counts used in downstream analyses. Of course, to make use of this information, downstream analysis methods must expect the information in this format. Further, those downstream methods must typically have a way to resolve these counts, eventually, to the level of genes, as the abundance of _gene groups_ lacks the intuitive biological interpretability of gene abundance. -Nonetheless, the benefits provided by such representations, in terms of conveying more complete and accurate information to downstream analysis tools, can be substantial, and tools taking advantage of such representations are being developed (e.g. [DifferentialRegulation](https://github.com/SimoneTiberi/DifferentialRegulation)); this is still an active area of research. +Nonetheless, the benefits provided by such representations, in terms of conveying more complete and accurate information to downstream analysis tools, can be substantial, and developing tools taking advantage of such representations is still an active area of research. ## Brief discussion From aa2d3ddf2e435fa618e41f70aa22faac88f7cbb2 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 13:52:34 +0100 Subject: [PATCH 15/29] Update jupyter-book/glossary.md Co-authored-by: Lukas Heumos --- jupyter-book/glossary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 6a9f4851..ac7c5c58 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -107,7 +107,7 @@ FASTQ reads Flowcell flowcell - A consumable device used in sequencing platforms, such as those from Illumina, where DNA or RNA fragments are sequenced. + A consumable device used in sequencing platforms where DNA or RNA fragments are sequenced. It consists of a glass or polymer surface with lanes or channels coated with oligonucleotides, which capture and anchor DNA or RNA fragments. During sequencing, these fragments are amplified into clusters, and their sequences are determined by detecting fluorescent signals emitted during nucleotide incorporation. The flowcell enables high-throughput sequencing by allowing millions of fragments to be sequenced simultaneously. From 20824480a74bbe692e5d332c9e6a63427a3cce3f Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 14:11:10 +0100 Subject: [PATCH 16/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index a2212c03..d00fb321 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -13,7 +13,7 @@ This matrix represents the estimated number of distinct molecules derived from e An overview of the topics discussed in this chapter. In the plot, "txome" stands for transcriptome. ::: -The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`raw:Zappia2021`, including normalization, integration, and filtering through methods for cell type identification, developmental trajectory inference, and expression dynamics studies. +The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`raw:Zappia2021`, including cell type identification or developmental trajectory inference. A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges, which are active areas of computational development. From 1f7af203331fbd29844a792ef97b321bde5918fb Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 14:12:46 +0100 Subject: [PATCH 17/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index d00fb321..cd7eeb90 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -440,7 +440,7 @@ The tag, sequence, and demultiplexing method used for single-cell profiling is g However, in droplet-based libraries, the number of observed cell barcodes (CBs) can differ significantly—often by several fold—from the number of originally encapsulated cells. This discrepancy arises from several key sources of error: -- Doublets/Multiplets: A single barcode may be associated with multiple cells, leading to an undercounting of cells. +- Doublets/multiplets: A single barcode may be associated with multiple cells, leading to an undercounting of cells. - Empty Droplets: Some droplets contain no encapsulated cells, and ambient RNA can become tagged with a barcode and sequenced, resulting in overcounting of cells. - Sequence Errors: Errors introduced during PCR amplification or sequencing can distort barcode counts, contributing to both under- and over-counting. From e0314d3d1f2fff2bffc0d715e4a5d47a6fefa543 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 14:17:15 +0100 Subject: [PATCH 18/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index cd7eeb90..90766786 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -441,7 +441,7 @@ However, in droplet-based libraries, the number of observed cell barcodes (CBs) This discrepancy arises from several key sources of error: - Doublets/multiplets: A single barcode may be associated with multiple cells, leading to an undercounting of cells. -- Empty Droplets: Some droplets contain no encapsulated cells, and ambient RNA can become tagged with a barcode and sequenced, resulting in overcounting of cells. +- Empty droplets: Some droplets contain no encapsulated cells, and ambient RNA can become tagged with a barcode and sequenced, resulting in overcounting of cells. - Sequence Errors: Errors introduced during PCR amplification or sequencing can distort barcode counts, contributing to both under- and over-counting. To address these issues, computational tools for demultiplexing RNA-seq reads into cell-specific bins use various diagnostic indicators to filter out artefactual or low-quality data. From c85dd4ce1c6a7db5f0b1f62295196983e6764583 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 14:17:29 +0100 Subject: [PATCH 19/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 90766786..6b840e09 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -442,7 +442,7 @@ This discrepancy arises from several key sources of error: - Doublets/multiplets: A single barcode may be associated with multiple cells, leading to an undercounting of cells. - Empty droplets: Some droplets contain no encapsulated cells, and ambient RNA can become tagged with a barcode and sequenced, resulting in overcounting of cells. -- Sequence Errors: Errors introduced during PCR amplification or sequencing can distort barcode counts, contributing to both under- and over-counting. +- Sequence errors: Errors introduced during PCR amplification or sequencing can distort barcode counts, contributing to both under- and over-counting. To address these issues, computational tools for demultiplexing RNA-seq reads into cell-specific bins use various diagnostic indicators to filter out artefactual or low-quality data. Numerous methods exist for removing ambient RNA contamination {cite}`raw:Young2020,Muskovic2021,Lun2019`, detecting doublets {cite}`DePasquale2019,McGinnis2019,Wolock2019,Bais2019`, and correcting cell barcode errors based on nucleotide sequence similarity. From 275228cfbff9543cf7095a41d4b9e65885636381 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 14:22:53 +0100 Subject: [PATCH 20/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 6b840e09..6639384f 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -384,7 +384,7 @@ Since most single-cell experiments are conducted on model organisms like mouse o Compared to the genome, transcriptome sequences are much smaller, significantly reducing the computational resources needed for mapping. Additionally, because splicing patterns are already represented in transcript sequences, this approach eliminates the need for complex spliced alignment. Instead, one can simply search for contiguous alignments or mappings for the read. -Instead, reads can be mapped using contiguous alignments, making both alignment-based and lightweight-mapping techniques suitable for transcriptome references. +Alternatively, reads can be mapped using contiguous alignments, making both alignment-based and lightweight-mapping techniques suitable for transcriptome references. As a result, both approaches are commonly used in popular tools that perform reference mapping against the spliced transcriptome. While these approaches significantly reduce the memory and time required for alignment and mapping, they fail to capture reads that arise from outside the spliced transcriptome. From 7f5886f9fa6eb0c132bfbc091d6c4ec0aa821de0 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 14:28:17 +0100 Subject: [PATCH 21/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 6639384f..9973254f 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -104,7 +104,7 @@ A good (left) and a bad (right) per-read sequence quality graph. **3. Per tile sequence quality** Using an Illumina library, the per-tile sequence quality plot highlights deviations from the average quality for reads across each {term}`flowcell` [tile](https://www.biostars.org/p/9461090/)(miniature imaging areas of the {term}`flowcell`). -The plot uses a color gradient to represent deviations, where “hotter” colors indicate larger deviations. +The plot uses a color gradient to represent deviations, where warmer colors indicate larger deviations. High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell. If hot colors appear in certain areas, it suggests that only part of the flowcell experienced poor quality. From 823bc0191bf96e912ae1a26e92c1d28e3e6a6077 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Fri, 7 Feb 2025 14:28:31 +0100 Subject: [PATCH 22/29] Update jupyter-book/introduction/raw_data_processing.md Co-authored-by: Lukas Heumos --- jupyter-book/introduction/raw_data_processing.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 9973254f..c844b554 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -107,7 +107,7 @@ Using an Illumina library, the per-tile sequence quality plot highlights deviati The plot uses a color gradient to represent deviations, where warmer colors indicate larger deviations. High-quality data typically display a uniform blue color across the plot, indicating consistent quality across all tiles of the flowcell. -If hot colors appear in certain areas, it suggests that only part of the flowcell experienced poor quality. +If warm colors appear in certain areas, it suggests that only part of the flowcell experienced poor quality. This could result from transient issues during sequencing, such as bubbles passing through the flowcell or smudges and debris within the flowcell lane. For further investigation, consult resources like [QC Fail](https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/) and the [common reasons for warnings](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/12%20Per%20Tile%20Sequence%20Quality.html) provided in the `FastQC` manual. From e4e41a3cfa5889358cc3d8382c587a1665f5a323 Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Fri, 7 Feb 2025 14:46:59 +0100 Subject: [PATCH 23/29] changes from feedbacks --- jupyter-book/glossary.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index c43cc252..15c0997f 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -163,8 +163,6 @@ RNA Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA). RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease. - - RT-qPCR Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR. From 0ef17da8235b74f1f6a51431d692a1b585e6ff59 Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Tue, 11 Feb 2025 14:25:40 +0100 Subject: [PATCH 24/29] fix the problem with Zappia link --- .../introduction/raw_data_processing.bib | 2 +- jupyter-book/introduction/raw_data_processing.md | 2 +- jupyter-book/preamble.bib | 16 ++++++++++++++++ jupyter-book/preamble.md | 2 +- 4 files changed, 19 insertions(+), 3 deletions(-) diff --git a/jupyter-book/introduction/raw_data_processing.bib b/jupyter-book/introduction/raw_data_processing.bib index 40de2f76..d1a30bbb 100644 --- a/jupyter-book/introduction/raw_data_processing.bib +++ b/jupyter-book/introduction/raw_data_processing.bib @@ -1,4 +1,4 @@ -@Article{raw:Zappia2021, +@Article{Zappia2021_raw, author={Zappia, Luke and Theis, Fabian J.}, title={Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape}, diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 5721b535..f7484e85 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -13,7 +13,7 @@ This matrix represents the estimated number of distinct molecules derived from e An overview of the topics discussed in this chapter. In the plot, "txome" stands for transcriptome. ::: -The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`raw:Zappia2021`, including cell type identification or developmental trajectory inference. +The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`Zappia2021_raw`, including cell type identification or developmental trajectory inference. A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges, which are active areas of computational development. diff --git a/jupyter-book/preamble.bib b/jupyter-book/preamble.bib index f747c686..dd71637e 100644 --- a/jupyter-book/preamble.bib +++ b/jupyter-book/preamble.bib @@ -18,3 +18,19 @@ @Article{Macaulay2017 url={https://doi.org/10.1016/j.tig.2016.12.003}, language={eng} } + +@Article{Zappia2021_pre, +author={Zappia, Luke +and Theis, Fabian J.}, +title={Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape}, +journal={Genome Biology}, +year={2021}, +month={Oct}, +day={29}, +volume={22}, +number={1}, +pages={301}, +issn={1474-760X}, +doi={10.1186/s13059-021-02519-4}, +url={https://doi.org/10.1186/s13059-021-02519-4} +} \ No newline at end of file diff --git a/jupyter-book/preamble.md b/jupyter-book/preamble.md index 6e87786f..cf8dae8e 100644 --- a/jupyter-book/preamble.md +++ b/jupyter-book/preamble.md @@ -22,7 +22,7 @@ To address this challenge, researchers employ a variety of strategies, with one Traditionally, each cells' transcriptome was primarily examined in a process known as single-cell RNA sequencing. However, recent advancements in single-cell genomics now enable the integration of transcriptome data with spatial, chromatin accessibility, or protein-level information. These developments not only enhance our understanding of complex regulatory mechanisms but also introduce additional challenges in data analysis. -Currently, analysts are faced with an overwhelming array of computational tools - over 1,700 methods dedicated to single-cell RNA-seq alone {cite}`raw:Zappia2021`. +Currently, analysts are faced with an overwhelming array of computational tools - over 1,700 methods dedicated to single-cell RNA-seq alone {cite}`Zappia2021_pre`. Navigating this extensive landscape to produce reliable, cutting-edge results poses a significant challenge. ## What this book covers From 8c2de032611ba8ad203d8702216fa91f260e681d Mon Sep 17 00:00:00 2001 From: Lukas Heumos Date: Tue, 11 Feb 2025 20:52:44 +0100 Subject: [PATCH 25/29] Raw review Signed-off-by: Lukas Heumos --- jupyter-book/glossary.md | 17 +++ .../introduction/raw_data_processing.md | 136 ++++++------------ 2 files changed, 59 insertions(+), 94 deletions(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 15c0997f..b3ace574 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -17,6 +17,11 @@ AnnDatas It provides an efficient way to store data as a matrix where rows (observations) and columns (features) can have associated metadata. [AnnData](https://anndata.readthedocs.io/en/latest/index.html) supports slicing, subsetting, and saving to disk in formats like H5AD and Zarr. +BAM +BAM files + BAM files are binary, compressed versions of SAM (Sequence Alignment/Map) files that store sequencing read alignments to a reference genome. + They contain the same information as {term}`SAM` files - including read sequences, quality scores, and alignment positions - but in a more space-efficient format that enables faster processing and reduced storage requirements. + Barcode Barcodes Bar code @@ -95,8 +100,10 @@ Drop-seq Edit distance Edit distance (often referred to as Levenshtein distance) measures the minimum number of operations (Substitution, Insertion, Deletion) required to transform one string into another. +FASTQ FASTQ reads Sequencing reads that are saved in the FASTQ format. + A FASTQ file stores DNA/RNA sequences and their corresponding quality scores in a 4-line format: identifier, sequence, optional description, and quality scores encoded in ASCII characters. FASTQ files are then used to map against the reference genome of interest to obtain gene counts for cells. Flowcell @@ -163,9 +170,19 @@ RNA Unlike DNA, RNA can be highly dynamic, acting as a messenger (mRNA) to carry genetic instructions, a structural or catalytic component (rRNA, snRNA), or a regulator of gene expression (miRNA, siRNA, lncRNA). RNA plays a central role in transcription, translation, and cellular responses, making it essential for understanding gene regulation, development, and disease. +RNA velocity + RNA velocity measures the rate of change in gene expression by comparing the ratio of unspliced (pre-mRNA) to spliced (mature) mRNA transcripts in single-cell RNA sequencing data. + This ratio provides insight into whether genes are being actively transcribed (increasing expression) or degraded (decreasing expression), allowing researchers to predict the future state of cells. + The concept leverages the fact that pre-mRNA signals indicate new transcription while mature mRNA levels reflect steady-state expression, enabling inference of cellular trajectory and developmental dynamics. + RT-qPCR Quantitative reverse transcription {term}`PCR` (RT-qPCR) monitors the amplification of a targeted {term}`DNA` molecule during the PCR. +SAM +SAM files + SAM (Sequence Alignment/Map) files are tab-delimited text files that store sequencing alignment data, showing how sequencing reads map to a reference genome. + Each line in a SAM file contains information about a single read alignment, including the read sequence, base quality scores, mapping position, and mapping quality. + scanpy A Python package for single-cell analysis in Python by scverse. diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index f7484e85..14fbc0cd 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -2,9 +2,7 @@ # Raw data processing -Here, we discuss the fundamental aspects of "preprocessing" for single-cell and single-nucleus RNA-sequencing (sc/snRNA-seq) data. While "preprocessing" is a common terminology, it can be misleading. -The process involves critical decisions about handling and representing the data that directly impact subsequent analyses. -For clarity, we will refer to this phase of processing as "raw data processing", and our focus will be on the phase of data analysis that begins with lane-demultiplexed FASTQ files and ends with a count matrix. +Raw data processing in single-cell sequencing converts sequencing machine output (so-called lane-demultiplexed {term}`FASTQ` files) into readily analyzable representations such as a count matrix. This matrix represents the estimated number of distinct molecules derived from each gene per quantified cell, sometimes categorized by the inferred splicing status of each molecule ({numref}`raw-proc-fig-overview`). :::{figure-md} raw-proc-fig-overview @@ -13,9 +11,9 @@ This matrix represents the estimated number of distinct molecules derived from e An overview of the topics discussed in this chapter. In the plot, "txome" stands for transcriptome. ::: -The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`Zappia2021_raw`, including cell type identification or developmental trajectory inference. -A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. -Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges, which are active areas of computational development. +A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. +Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. +Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges. In this section, we focus on key steps of raw data processing: @@ -41,8 +39,7 @@ After obtaining raw FASTQ files, it is important to evaluate the quality of the A quick and effective way to perform this is by using quality control (QC) tools like `FastQC`. `FastQC` generates a detailed report for each FASTQ file, summarizing key metrics such as quality scores, base content, and other statistics that help identify potential issues arising from library preparation or sequencing. -While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads—it is still good practice to run an independent QC check. -This provides additional metrics that are often useful for identifying broader quality issues. +While many modern single-cell data processing tools include some built-in quality checks—such as evaluating the N content of sequences or the fraction of mapped reads - it is still good practice to run an independent QC check. For readers interested in what a typical `FastQC` report looks like, in the following toggle content, example reports for both [high-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) and [low-quality](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) Illumina data provided by the `FastQC` [manual webpage](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), along with the tutorials and descriptions from [the RTSF at MSU](https://rtsf.natsci.msu.edu/genomics/technical-documents/fastqc-tutorial-and-faq.aspx), [the HBC training program](https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html), and [the QC Fail website](https://sequencing.qcfail.com/software/fastqc/) are used to demonstrate the modules in the `FastQC` report. Although these tutorials are not explicitly made for single-cell data, many of the results are still relevant for single-cell data, with a few caveats described below. @@ -54,8 +51,6 @@ For single-cell datasets, such as 10x Chromium v2 and v3, this typically corresp In contrast, technical reads, which contain barcode and UMI sequences, often do not exhibit biologically typical sequence or GC content. However, certain metrics, like the fraction of `N` base calls, are still relevant for all reads. -By running an initial quality check using tools like `FastQC`, researchers can identify potential problems early and ensure the raw data is suitable for subsequent processing and analysis. - ```{dropdown} Example FastQC Reports and Tutorials **0. Summary** @@ -85,7 +80,7 @@ A good basic statistics report example. **2. Per base sequence quality** -The per-base sequence quality view displays a box-and-whisker plot for each position in the read, illustrating the range of quality scores across all bases at each position. +The per-base sequence quality view displays a box-and-whisker plot for each position in the read. The x-axis represents the positions within the read, while the y-axis shows the quality scores. For high-quality single-cell data, the yellow boxes—representing the interquartile range of quality scores—should fall within the green area (indicating good quality calls). @@ -191,7 +186,8 @@ A good (left) and a bad (right) sequence length distribution plot. **9. Sequence duplication levels** -The sequence duplication level plot illustrates the distribution of duplication levels for read sequences, represented by the blue line, both before and after deduplication. In single-cell platforms, multiple rounds of {term}`PCR` are typically required, and highly expressed genes naturally produce a large number of transcripts. +The sequence duplication level plot illustrates the distribution of duplication levels for read sequences, represented by the blue line, both before and after deduplication. +In single-cell platforms, multiple rounds of {term}`PCR` are typically required, and highly expressed genes naturally produce a large number of transcripts. Additionally, since `FastQC` is not UMI-aware (i.e., it does not account for unique molecular identifiers), it is common for a small subset of sequences to show high duplication levels. While this may trigger a warning or failure in this module, it does not necessarily indicate a quality issue with the data. @@ -249,17 +245,13 @@ In single-cell sequencing protocols, the raw sequence files typically include: - Unique Molecular Identifiers (UMIs): Tags that distinguish individual molecules to account for amplification bias. - Raw {term}`cDNA` Sequences: The actual read sequences generated from the molecules. -As the first step in raw data processing ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses. -Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices, ultimately compromising the quality of subsequent analyses. +As the first step ({numref}`raw-proc-fig-overview`), accurate mapping or alignment is crucial for reliable downstream analyses. +Errors during this step, such as incorrect mapping of reads to transcripts or genes, can result in inaccurate or misleading count matrices. While mapping read sequences to reference sequences _far_ predates the development of scRNA-seq, the sheer scale of modern scRNA-seq datasets—often involving hundreds of millions to billions of reads—makes this step particularly computationally intensive. Many existing RNA-seq aligners are protocol-agnostic and do not inherently account for features specific to scRNA-seq, such as cell barcodes, UMIs, or their positions and lengths. As a result, additional tools are often required for steps like demultiplexing and UMI resolution {cite}`Smith2017`. -This reliance on separate tools introduces additional computational overhead. -It often necessitates storing large intermediate files, which significantly increases disk space usage. -Moreover, the extra input and output operations required to process these files further contribute to longer runtime requirements, making the mapping stage both resource-intensive and time-consuming. - To address the challenges of aligning and mapping scRNA-seq data, several specialized tools have been developed that handle the additional processing requirements automatically or internally. These tools include: @@ -273,7 +265,7 @@ These tools include: These tools provide specialized capabilities for aligning scRNA-seq reads, parsing technical read content (e.g., cell barcodes and UMIs), demultiplexing, and UMI resolution. Although they offer simplified user interfaces, their internal methodologies differ significantly. -Some tools generate traditional intermediate files, such as BAM files, which are processed further, while others operate entirely in memory or use compact intermediate representations to minimize input/output operations and reduce computational overhead. +Some tools generate traditional intermediate files, such as {term}`BAM` files, which are processed further, while others operate entirely in memory or use compact intermediate representations to minimize input/output operations and reduce computational overhead. While these tools vary in their specific algorithms, data structures, and trade-offs in time and space complexity, their approaches can generally be categorized along two axes: @@ -288,15 +280,16 @@ We focus on three main types of mapping algorithms commonly used for mapping sc/ First, we distinguish between alignment-based approaches and lightweight mapping-based approaches ({numref}`raw-proc-fig-alignment-mapping`). Alignment-based methods use various heuristics to identify potential loci from which reads may originate and then score the best nucleotide-level alignment between the read and reference, typically using dynamic programming algorithms. -These algorithms have a long and rich history, with the specific algorithm used depending on the type of alignment required. -For example, [global alignment](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) aligns the entirety of the query and reference sequences, while [local alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) focuses on aligning subsequences. Short-read alignment often employs a semi-global approach, also known as "fitting" alignment, where most of the query aligns to a substring of the reference. +[global alignment](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) aligns the entirety of the query and reference sequences, while [local alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm) focuses on aligning subsequences. +Short-read alignment often employs a semi-global approach, also known as "fitting" alignment, where most of the query aligns to a substring of the reference. Additionally, "soft-clipping" may be used to reduce penalties for mismatches, insertions, or deletions at the start or end of the read, achieved through ["extension" alignment](https://github.com/smarco/WFA2-lib#-33-alignment-span). While these variations modify the rules of the dynamic programming recurrence and traceback, they do not fundamentally alter its overall complexity. -In addition to general alignment techniques, several sophisticated modifications and heuristics have been developed to enhance the practical efficiency of aligning genomic sequencing reads. +Several sophisticated modifications and heuristics have been developed to enhance the practical efficiency of aligning genomic sequencing reads. For example, `banded alignment` {cite}`chao1992aligning` is a popular heuristic used by many tools to avoid computing large portions of the dynamic programming table when alignment scores below a threshold are not of interest. -Other heuristics, like X-drop {cite}`zhang2000` and Z-drop {cite}`li2018minimap2`, efficiently prune unpromising alignments early in the process. Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022optimal, enable the determination of optimal alignments in significantly reduced time and space, particularly when high-scoring alignments are present. +Other heuristics, like X-drop {cite}`zhang2000` and Z-drop {cite}`li2018minimap2`, efficiently prune unpromising alignments early in the process. +Recent advances, such as wavefront alignment {cite}`marco2021fast`, marco2022optimal, enable the determination of optimal alignments in significantly reduced time and space, particularly when high-scoring alignments are present. Additionally, much work has focused on optimizing data layout and computation to leverage instruction-level parallelism {cite}`wozniak1997using, rognes2000six, farrar2007striped`, and expressing dynamic programming recurrences in ways that facilitate data parallelism and vectorization, such as through difference encoding {cite:t}`Suzuki2018`. Most widely-used alignment tools incorporate these highly optimized, vectorized implementations. @@ -318,23 +311,23 @@ An abstract overview of the alignment-based method and lightweight mapping-based ::: Alignment-based approaches can be categorized into spliced-alignment and contiguous-alignment methods. -Currently, no lightweight-mapping approaches perform spliced mapping. -**Spliced-alignment methods** allow a sequence read to align across multiple distinct segments of a reference, allowing potentially large gaps between aligned regions. +```{dropdown} Spliced-alignment methods +Spliced-alignment methods allow a sequence read to align across multiple distinct segments of a reference, allowing potentially large gaps between aligned regions. These approaches are particularly useful for aligning RNA-seq reads to the genome, where reads may span {term}`splice junctions`. In such cases, a contiguous sequence in the read may be separated by intron and exon subsequence in the reference, potentially spanning kilobases of sequence. Spliced alignment is especially challenging when only a small portion of a read overlaps a splice junction, as limited sequence information is available to accurately place the overhanging segment. +``` -**Contiguous-alignment methods**, in contrast, require a continuous substring of the reference to align well with the read. +```{dropdown} Contiguous-alignment methods +Contiguous-alignment methods require a continuous substring of the reference to align well with the read. While small insertions and deletions may be tolerated, large gaps—such as those in spliced alignments—are generally not allowed. +``` Alignment-based methods, such as spliced and contiguous alignment, can be distinguished from **lightweight-mapping methods**, which include approaches like **pseudoalignment** {cite}`Bray2016`, **quasi-mapping** {cite}`srivastava2016rapmap`, and **pseudoalignment with structural constraints** {cite}`raw:He2022`. -Lightweight-mapping methods achieve significantly higher speed by bypassing nucleotide-level alignment between the read and reference sequences. -Instead, they determine mapping loci based on alternative rules and heuristics, such as identifying matching k-mers or other exact matches. -These methods may also consider the orientation and relative positions of these matches on both the read and reference (e.g., through chaining). - -While this approach greatly improves speed and throughput, it does not provide easily-interpretable score-based assessments to determine the quality of a match, making it more difficult to assess alignment confidence. +Lightweight-mapping methods achieve significantly higher speed. +However, they do not provide easily-interpretable score-based assessments to determine the quality of a match, making it more difficult to assess alignment confidence. (raw-proc:mapping-references)= @@ -347,7 +340,8 @@ There are three main categories of reference sequences: - Annotated transcriptome - Augmented transcriptome -Currently, not all combinations of mapping algorithms and reference sequences are possible. For instance, lightweight-mapping algorithms do not yet support spliced mapping of reads against a reference genome. +Currently, not all combinations of mapping algorithms and reference sequences are possible. +For instance, lightweight-mapping algorithms do not yet support spliced mapping of reads against a reference genome. (raw-proc:genome-mapping)= @@ -360,19 +354,6 @@ Since many reads originate from **spliced transcripts**, this method requires a A key advantage of this approach is that it accounts for reads arising from any location in the genome, not just those from annotated transcripts. Additionally, because a **genome-wide index** is constructed, there is minimal additional cost in reporting not only reads that map to known spliced transcripts but also those that overlap introns or align within non-coding regions, making this method equally effective for **single-cell** and **single-nucleus** data. Another benefit is that even reads mapping outside annotated transcripts, exons, or introns can still be accounted for, enabling **_post hoc_ augmentation** of the quantified loci. -For instance, methods such as those described by {cite:t}`Pool2022` incorporate expressed {term}`UTR` extensions in a sample-specific, data-driven manner, potentially increasing gene detection and improving quantification sensitivity. - -While spliced alignment against the full genome offers versatility, it also comes with certain trade-offs. -One major limitation is the high memory requirements of commonly used alignment tools in the single-cell space. -Many of these tools are based on the **STAR** aligner {cite}`dobin2013star`, due to its speed and versatillity, and require substantial computational resources. -For a human-scale genome, constructing and storing the index can demand over $32$ GB of memory. -Using a sparse [suffix array](https://en.wikipedia.org/wiki/Suffix_array) can nearly halve the final index size, but this comes at the cost of reduced alignment speed and still requires significant memory for initial construction. - -Additionally, spliced alignment is inherently more complex than contiguous alignment. -Because current spliced-alignment tools must explicitly compute a score for each read, this approach has a higher computational cost compared to alternatives. - -Finally, spliced alignment requires an available reference genome for the organism under study. -While this is rarely an issue for well-characterized model organisms, it can pose challenges when working with non-model organisms, where only a transcriptome assembly may be available. (raw-proc:txome-mapping)= @@ -385,36 +366,24 @@ Compared to the genome, transcriptome sequences are much smaller, significantly Additionally, because splicing patterns are already represented in transcript sequences, this approach eliminates the need for complex spliced alignment. Instead, one can simply search for contiguous alignments or mappings for the read. Alternatively, reads can be mapped using contiguous alignments, making both alignment-based and lightweight-mapping techniques suitable for transcriptome references. -As a result, both approaches are commonly used in popular tools that perform reference mapping against the spliced transcriptome. While these approaches significantly reduce the memory and time required for alignment and mapping, they fail to capture reads that arise from outside the spliced transcriptome. As a result, they are not suitable for processing single-nucleus data. Even in single-cell experiments, reads arising from outside of the spliced transcriptome can constitute a substantial fraction of all data, and there is growing evidence that such reads should be incorporated into subsequent analysis {cite}`technote_10x_intronic_reads,Pool2022`. Even in single-cell experiments, a substantial fraction of reads may arise from regions outside the spliced transcriptome, and increasing evidence suggests that incorporating these reads into downstream analyses can be beneficial {cite}`technote_10x_intronic_reads,Pool2022`. -Additionally, when paired with lightweight-mapping methods, short sequences shared between the spliced transcriptome and the actual genomic regions that generated a read can lead to spurious mappings. This, in turn, may result in misleading and even biologically implausible gene expression estimates {cite}`Kaminow2021,Bruning2022Comparative,raw:He2022`. +Additionally, when paired with lightweight-mapping methods, short sequences shared between the spliced transcriptome and the actual genomic regions that generated a read can lead to spurious mappings. +This, in turn, may result in misleading and even biologically implausible gene expression estimates {cite}`Kaminow2021,Bruning2022Comparative,raw:He2022`. (raw-proc:aug-txome-mapping)= #### Mapping to an augmented transcriptome To account for reads originating outside spliced transcripts, the spliced transcript sequences can be augmented with additional reference sequences, such as full-length unspliced transcripts or excised intronic sequences. -By incorporating these elements, augmented transcriptome references maintain a smaller index than the full genome while still allowing for contiguous read alignments. -This enables faster and more memory-efficient mapping compared to full-genome alignment, while still capturing many reads that would otherwise be missed when mapping solely to the spliced transcriptome. - -Additionally, expanding the reference set improves mapping accuracy. +This enables better, faster, and more memory-efficient mapping compared to full-genome alignment, while still capturing many reads that would otherwise be missed. More reads can be confidently assigned compared to using only the spliced transcriptome, and when combined with lightweight mapping approaches, spurious mappings can be significantly reduced {cite}`raw:He2022`. -Augmented transcriptomes are widely used in methods that do not map to the full genome, particularly for single-nucleus data processing and RNA velocity analysis {cite}`Soneson2021Preprocessing` (see {doc}`../trajectories/rna_velocity`). +Augmented transcriptomes are widely used in methods that do not map to the full genome, particularly for single-nucleus data processing and {term}`RNA velocity` analysis {cite}`Soneson2021Preprocessing` (see {doc}`../trajectories/rna_velocity`). These augmented references can be constructed for all common methods that do not rely on spliced alignment to the full genome {cite}`Srivastava2019,Melsted2021,raw:He2022`. -{cite:t}`raw:He2022` argue that this approach is valuable even for standard single-cell RNA-seq data and recommend constructing an augmented _splici_ reference (spliced + intronic) for mapping and quantification. -The _splici_ reference is built using the spliced transcriptome sequence alongside sequences representing the merged intronic intervals of annotated genes. -Each reference sequence is labeled with its splicing status, and mapping results are processed using splicing status-aware methods for {ref}`raw-proc:umi-resolution`. - -This approach offers several key benefits. -It allows the use of lightweight mapping methods while significantly reducing spurious mappings. -Additionally, it enables the detection of both spliced and unspliced reads, improving sensitivity in downstream analyses {cite}`technote_10x_intronic_reads,Pool2022`. -Since splicing status is tracked and reported separately, this method also unifies the preprocessing pipeline across single-cell, single-nucleus, and RNA velocity analyses, making it a versatile solution for transcript quantification. - (raw-proc:cb-correction)= ## Cell barcode correction @@ -451,7 +420,7 @@ Several common strategies are used for cell barcode identification and correctio 1. **Correction against a known list of _potential_ barcodes**: Certain chemistries, such as 10x Chromium, draw CBs from a known pool of potential barcode sequences. - Thus, the set of barcodes observed in any sample is expected to be a subset of this known list, often called a "whitelist", "permit list", or "pass list". + Thus, the set of barcodes observed in any sample is expected to be a subset of this known list, often called a "whitelist". In this case, the standard approach assumes that: - Any barcode matching an entry in the known list is correct. @@ -464,7 +433,7 @@ Several common strategies are used for cell barcode identification and correctio Yet, as the number of assayed cells increases, insufficient sequence diversity in the set of potential cell barcodes increases the frequency of ambiguous corrections, and reads tagged with barcodes having ambiguous corrections are most commonly discarded. 2. **Knee or elbow-based methods**: - If a set of potential barcodes is unknown — or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list — one can use a method based on the observation that high-quality barcodes are those associated with the highest number of reads in the sample. + If a set of potential barcodes is unknown - or even if it is known, but one wishes to correct directly from the observed data itself without consulting an external list - one can use a method based on the observation that high-quality barcodes are those associated with the highest number of reads in the sample. To achieve this, one can construct a cumulative frequency plot where barcodes are sorted in descending order based on the number of distinct reads or UMIs they are associated with. Often, this ranked cumulative frequency plot will contain a "knee" or "elbow" – an inflection point that can be used to characterize frequently occurring barcodes from infrequent (and therefore likely erroneous) barcodes. Many methods exist for attempting to identify such an inflection point {cite}`Smith2017,Lun2019,raw:He2022` as a likely point of discrimination between properly captured cells and empty droplets. @@ -488,15 +457,6 @@ Several common strategies are used for cell barcode identification and correctio While this guarantees selection of at least n cells, it assumes that the chosen threshold accurately reflects the number of real cells. It is only reasonable if the user has a good reason to believe that the threshold frequency should be set around the provided index. -%In the `alevin-fry` framework, the frequency of every observed cell barcode is generated, and there are four customizable options to select the high-quality cell barcodes for downstream analysis: - -### Future challenges - -While cellular barcoding of high-throughput single-cell profiling has been a tremendously successful approach, some challenges still remain, especially as the scale of experiments continues to grow. -For example, the design of a robust method for selecting high-quality cell barcodes from the set of all the observations is still an active area of research, with distinct challenges arising, e.g., between single-cell and single-nucleus experiments. -Also, as single-cell technologies have advanced to profile increasing numbers of cells, insufficient sequence diversity in the CB sequence can result in sequence corrections leading to CB collision. -Addressing this latter problem may require more intelligent barcode design methods and continuing increases in the lengths of oligonucleotides used for cell barcoding. - (raw-proc:umi-resolution)= ## UMI resolution @@ -526,8 +486,7 @@ In the ideal case, where the correct (unaltered) UMIs tag reads, the reads of ea Consequently, the UMI deduplication procedure is conceptually straightforward: the reads of a UMI are the PCR duplicates from a single pre-PCR molecule. The number of captured and sequenced molecules of each gene is the number of distinct UMIs observed for this gene. -However, the problems encountered in practice make the simple rules described above insufficient for identifying the gene origin of UMIs in general and necessitate the development of more sophisticated models. -Here, we concern ourselves primarily with two challenges. +However, the problems encountered in practice make the simple rules described above insufficient for identifying the gene origin of UMIs in general and necessitate the development of more sophisticated models: - **Errors in UMIs**: These occur when the sequenced UMI tag of reads contains errors introduced during PCR or the sequencing process. @@ -535,7 +494,8 @@ Here, we concern ourselves primarily with two challenges. Failing to address such UMI errors can inflate the estimated number of molecules {cite}`Smith2017,ziegenhain2022molecular`. - **Multimapping**: - This issue arises in cases where a read or UMI belongs to multiple references (e.g., multi-gene reads/UMIs). This happens when different reads of a UMI map to different genes, when a read maps to multiple genes, or both. + This issue arises in cases where a read or UMI belongs to multiple references (e.g., multi-gene reads/UMIs). + This happens when different reads of a UMI map to different genes, when a read maps to multiple genes, or both. The consequence of this issue is that the gene origin of the multi-gene reads/UMIs is ambiguous, which results in uncertainty about the sampled pre-PCR molecule count of those genes. Simply discarding multi-gene reads/UMIs can lead to a loss of data or a biased estimate among genes that tend to produce multimapping reads, such as sequence-similar gene families {cite}`Srivastava2019`. @@ -555,11 +515,13 @@ Divergent UMI collisions occur primarily among introns of unspliced transcripts Given that the use of UMIs is near ubiquitous in high-throughput scRNA-seq protocols and the fact that addressing these errors improves the estimation of gene abundances, there has been much attention paid to the problem of UMI resolution in recent literature {cite}`Islam2013,Bose2015,raw:Macosko2015,Smith2017,Srivastava2019,Kaminow2021,Melsted2021,raw:He2022,calib,umic,zumis`. +```{dropdown} Graph-based UMI resolution + (raw-proc:graph-based-umi-resolution)= ### Graph-based UMI resolution -As a result of the problems that arise when attempting to resolve UMIs, many methods have been developed to address the problem of UMI resolution. +As a result of the problems that ariOther UMI resolution approaches exist, for example, the reference-free model {cite}`umic` and the method of moments {cite}`Melsted2021`, but they may not be easily represented in this framework and are not discussed in further detail here.se when attempting to resolve UMIs, many methods have been developed to address the problem of UMI resolution. While there are a host of different approaches for UMI resolution, we will focus on a framework for representing problem instances, modified from a framework initially proposed by {cite:t}`Smith2017`, that relies upon the notion of a _UMI graph_. Each connected component of this graph represents a sub-problem wherein certain subsets of UMIs are collapsed (i.e., resolved as evidence of the same pre-PCR molecule). Many popular UMI resolution approaches can be interpreted in this framework by simply modifying precisely how the graph is refined and how the collapse or resolution procedure carried out over this graph works. @@ -569,7 +531,6 @@ Each node $v_i \in V$ represents an equivalence class (EC) of reads, and the edg The equivalence relation $\sim_r$ defined on reads is based on their UMI and mapping information. We say reads $r_x$ and $r_y$ are equivalent, $r_x \sim_r r_y$, if and only if they have identical UMI tags and map to the same set of references. UMI resolution approaches may define a "reference" as a genomic locus {cite}`Smith2017`, transcript {cite}`Srivastava2019,raw:He2022` or gene {cite}`raw:Zheng2017,Kaminow2021`. -Other UMI resolution approaches exist, for example, the reference-free model {cite}`umic` and the method of moments {cite}`Melsted2021`, but they may not be easily represented in this framework and are not discussed in further detail here. In the UMI graph framework, a UMI resolution approach can be divided into three major steps: **defining nodes**, **defining adjacency relationships**, and **resolving components**. @@ -615,6 +576,8 @@ The collapsed nodes or covering sets are regarded as the PCR duplicates of that Different rules for defining the adjacency relationship and different approaches for graph resolution itself can seek to preserve different properties and can define a wide variety of distinct overall UMI resolution approaches. For approaches that probabilistically resolve ambiguity caused by multimapping, the resolved UMI graph may contain multi-gene equivalence classes (ECs), with their gene origins determined in the next step. +``` + (raw-proc:umi-graph-quantification)= #### Quantification @@ -629,8 +592,6 @@ The EM algorithm seeks the parameters that together have the (locally) highest l Usually, the UMI resolution and quantification process described above will be performed separately for each cell, represented by a corrected CB, to create a complete count matrix for all genes in all cells. However, the relative paucity of per-cell information in high-throughput single-cell samples limits the evidence available when performing UMI resolution, which in turn limits the potential efficacy of model-based solutions like the statistical inference procedure described above. -Thus, further research here is certainly warranted. -For example, {cite:t}`Srivastava2020-lf` introduced an approach that allows sharing information among transcriptionally similar cells to improve the quantification result further. (raw-proc:count-qc)= @@ -740,18 +701,6 @@ For example, read mapping is imperfect, as is cell barcode correction. Accurately resolving UMIs is particularly challenging, and issues related to UMIs attached to multimapping reads are often overlooked. Additionally, multiple priming sites, especially in unspliced molecules, can violate the commonly assumed one molecule-to-one UMI relationship. -In light of these challenges and the simplifications adopted to address them, there remains active research as to how best to represent the preprocessed data to downstream tools. -For example, several quantification tools {cite}`Srivastava2019,Melsted2021,Kaminow2021,raw:He2022` implement an _optional_ EM algorithm, initially introduced in this context by {cite:t}`Srivastava2019`, that probabilistically apportions UMIs associated with reads that map to more than one gene. -This, however, can result in non-integer count matrices that may be unexpected by certain downstream tools. -Alternatively, UMIs can be resolved to _gene groups_ instead of individual genes, preserving multimapping information in the preprocessed output. -Notably, a similar approach has been used at the transcript level for over a decade as a succinct internal representation in bulk RNA-seq transcript quantification tools {cite}`Turro2011,Nicolae2011,Patro2014,Bray2016,Patro2017,Ju2017`. -Additionally, transcript-level representations have been proposed for clustering and dimensionality reduction in full-length single-cell RNA-seq data {cite}`Ntranos2016` and for differential expression analysis in single-cell RNA-seq {cite}`Ntranos2019`. -In this case, instead of the resulting count matrix having dimensions $C \times G$, where $G$ is the number of genes in the quantified annotation, it will have dimension $C \times E$, where $E$ is the number of distinct _gene groups_ (commonly called equivalence class labels) across all cells in the given sample. -By propagating this information to the output count matrix, one can avoid the application of heuristic or probabilistic UMI resolution methods that can result in loss of data, or bias, in the counts used in downstream analyses. -Of course, to make use of this information, downstream analysis methods must expect the information in this format. -Further, those downstream methods must typically have a way to resolve these counts, eventually, to the level of genes, as the abundance of _gene groups_ lacks the intuitive biological interpretability of gene abundance. -Nonetheless, the benefits provided by such representations, in terms of conveying more complete and accurate information to downstream analysis tools, can be substantial, and developing tools taking advantage of such representations is still an active area of research. - ## Brief discussion To close this chapter, we convey some observations and suggestions that have arisen from recent benchmarking and review studies surrounding some of the common preprocessing tools described above {cite}`You_2021,Bruning_2022`. @@ -802,9 +751,8 @@ conda activate af ````{admonition} Note on using an Apple silicon-based device Conda does not currently build most packages natively for Apple silicon. -Therefore, if you -are using a non-Intel-based Apple computer (e.g., with an M1 (Pro/Max/Ultra) or M2 chip), you -should make sure to specify that your environment uses the Rosetta2 translation layer. +Therefore, if you are using a non-Intel-based Apple computer (e.g., with an M1 (Pro/Max/Ultra) or M2 chip), +you should make sure to specify that your environment uses the Rosetta2 translation layer. To do this, you can replace the above commands with the following (instructions adopted from [here](https://github.com/Haydnspass/miniforge#rosetta-on-mac-with-apple-silicon-hardware)): From fff13c97fefcad02b38594e532fc4f5618688593 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Wed, 12 Feb 2025 11:45:50 +0100 Subject: [PATCH 26/29] Update jupyter-book/glossary.md Co-authored-by: Lukas Heumos --- jupyter-book/glossary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index b3ace574..199ba08c 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -5,7 +5,7 @@ Adapter sequences adapter sequences Short, synthetic DNA or RNA sequences that are ligated to the ends of DNA or RNA fragments during library preparation for sequencing. These adapters are essential for binding the fragments to the flowcell and enabling amplification and sequencing. - However, if adapters are not properly removed or trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses. + However, if adapters are not trimmed after sequencing, they can appear in the reads, potentially interfering with alignment and downstream analyses. Algorithm Algorithms From 314cc2e9f2bf15ca22511d8b57146382743a5732 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Wed, 12 Feb 2025 11:46:06 +0100 Subject: [PATCH 27/29] Update jupyter-book/glossary.md Co-authored-by: Lukas Heumos --- jupyter-book/glossary.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 199ba08c..4f3508a7 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -137,7 +137,7 @@ loci Accurate identification of loci is critical for mapping reads and understanding the genomic or transcriptomic context of the data. MuData - A Python package for multimodal annotated data matrices. + A Python package for multimodal annotated data matrices that builds on {term}`AnnData`. The primary data structure in the scverse ecosystem for multimodal data. Muon From a318e15d19b9eee685bb377db483aded1d3b4d71 Mon Sep 17 00:00:00 2001 From: seo <159482645+seohyonkim@users.noreply.github.com> Date: Wed, 12 Feb 2025 11:48:26 +0100 Subject: [PATCH 28/29] Update jupyter-book/glossary.md Co-authored-by: Lukas Heumos --- jupyter-book/glossary.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 4f3508a7..4b9bc4e4 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -201,7 +201,8 @@ Spike-in RNA Splice Junctions splice junctions - Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing. These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA. + Locations where introns are removed, and exons are joined together in a mature RNA transcript during RNA splicing. + These junctions occur at specific nucleotide sequences and are critical for the proper assembly of functional mRNA. Trajectory inference Also known as pseudotemporal ordering. From dea6a5f5a8fc1876a29890f8e20140047c35db7a Mon Sep 17 00:00:00 2001 From: seohyonkim Date: Wed, 12 Feb 2025 12:32:47 +0100 Subject: [PATCH 29/29] updates after feedback --- jupyter-book/glossary.md | 2 +- jupyter-book/introduction/raw_data_processing.md | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/jupyter-book/glossary.md b/jupyter-book/glossary.md index 4b9bc4e4..f8012365 100644 --- a/jupyter-book/glossary.md +++ b/jupyter-book/glossary.md @@ -193,7 +193,7 @@ scverse signal-to-noise ratio A measure of the clarity of a signal relative to background noise. In sequencing, the signal represents the detectable information derived from the DNA or RNA molecules being sequenced, while the noise includes random errors or unwanted signals that can obscure or distort the true data. - A high SNR indicates that the signal is strong and reliable compared to the noise, resulting in better data quality. + A high signal-to-noise ratio (SNR) indicates that the signal is strong and reliable compared to the noise, resulting in better data quality. Conversely, a low SNR means the noise may interfere with or reduce the accuracy of the sequencing results. Spike-in RNA diff --git a/jupyter-book/introduction/raw_data_processing.md b/jupyter-book/introduction/raw_data_processing.md index 14fbc0cd..76af466d 100644 --- a/jupyter-book/introduction/raw_data_processing.md +++ b/jupyter-book/introduction/raw_data_processing.md @@ -11,6 +11,7 @@ This matrix represents the estimated number of distinct molecules derived from e An overview of the topics discussed in this chapter. In the plot, "txome" stands for transcriptome. ::: +The count matrix is the foundation for a wide range of scRNA-seq analyses {cite}`Zappia2021_raw`, including cell type identification or developmental trajectory inference. A robust and accurate count matrix is essential for reliable {term}`downstream analyses`. Errors at this stage can lead to invalid conclusions and discoveries based on missed insights, or distorted signals in the data. Despite the straightforward nature of the input (FASTQ files) and the desired output (count matrix), raw data processing presents several technical challenges.