Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing. See our pre-print at here. All codes and analysis results in the manuscript can be found at here
Recent development involving long-read single-cell transcriptome sequencing (lr-scRNA-Seq) represents a significant leap forward in single-cell genomics. With the recent introduction of R10 flowcells by Oxford Nanopore, computational methods should now shift focus on harnessing the unique benefits of long reads to analyze transcriptome complexity. In this context, we introduce a comprehensive suite of computational methods named Single-Cell Omics for Transcriptome CHaracterization (SCOTCH). Our method is compatible with the single-cell library preparation platform from 10X Genomics, Pacbio Biosciences, and Parse Biosciences, facilitating the analysis of special cell populations, such as neurons, hepatocytes and developing cardiomyocytes.
SCOTCH offers both a preprocessing and a statistical pipeline. The preprocessing pipeline accepts BAM files with tagged barcodes generated by vendor-supplied pipelines (e.g., wf-single-cell) and aligns reads to both known and novel isoforms. SCOTCH can leverage existing high-quality gene annotations, update these annotations using the data to enhance novel isoform identification, or operate entirely annotation-free. SCOTCH outputs count matrices at both the gene and transcript levels. The statistical pipeline supports the analysis of differential transcript usage (DTU), which is determined by the relative abundances of transcripts within the same gene.
To avoid package version conflicts, we strongly recommand to use conda to set up the environment. If you don't have conda installed, you can run codes below in linux to install.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
After installing conda, SCOTCH sources can be downloaded.
git clone https://github.com/WGLab/SCOTCH.git
cd SCOTCH
conda env create --name SCOTCH --file scotch.yml
conda activate SCOTCH
main_preprocessing.py
is the primary function for running the SCOTCH preprocessing pipeline. This pipeline consists of four steps: (1) generating gene annotation files, (2) generating the compatibility matrix, (3) summarizing novel gene annotations, and (4) generating the count matrix. We will cover more details below.
In general, SCOTCH accepts BAM file(s) tagged by vendor-supplied pipelines for its preprocessing pipeline (For ONT data, please see here for more details), and output a series of information including read-isoform mappings, gene count matrix, isoform count matrix, and updated gene annotations. The --bam
argument is used to assign input file(s), and --target
is used to assign the output folder(s).
The SCOTCH pipeline allows users to process one sample at a time or multiple samples simultaneously depending on study designs. When preprocessing multiple samples together, SCOTCH generates a unified gene/isoform annotation based on all BAM files. This approach is particularly beneficial in studies involving multiple samples, as it ensures consistent identification and consolidation of novel isoforms across different samples. Some usage examples are below:
Example1: --bam
accepts a single bam file
--bam sample1.bam --target path/to/sample1/output
Example2: --bam
accepts multiple bam files, each from one sample
--bam sample1.bam sample2.bam --target path/to/sample1/output path/to/sample2/output
Example3: --bam
accepts a folder of bam files from one sample
--bam path/to/sample1/bamfiles/folder --target path/to/sample1/output
Example4: --bam
accepts multiple bam folders, each from one sample
--bam path/to/sample1/bamfiles/folder path/to/sample2/bamfiles/folder --target path/to/sample1/output path/to/sample2/output
Below is the directory structure of output under each target
folder you can expect:
-
auxiliary/
: This folder contains information of read-isoform mappings, cell barcodes, UMI, and exon coordinates of Isoforms. -
bam/
: Information of BAM files preprocessed, including read names, barcodes, read length etc. -
compatible_matrix/
: This directory holds the compatibility matrices of read - isoform alignments. -
reference/
: This folder includes the reference files used or generated by SCOTCH. -
count_matrix/
: This folder holds the count matrices on both gene and isoform levels generated by SCOTCH.
In this step, SCOTCH will generate annotation files for the reference genome and tagged BAM files. Set --tast annotation
to run this step.
SCOTCH offers three modes for generating gene annotations:
- Annotation-Only Mode: SCOTCH can rely entirely on existing gene annotations. This mode allows for the discovery of novel isoforms defined by combinations of known exons. This mode will fail to identify novel isoforms involving unknown exons, such as intron retention. Set
--reference
as path to gene annotation .gtf file. To save time, users can also set--reference_pkl
as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set--update_gtf_off
. - Enhanced-Annotation Mode: SCOTCH can use BAM files from one or multiple samples to update and refine existing gene annotations. This mode allows for the discovery of de novo (sub)exons with more types of novel isoforms than annotation-only mode. Set
--reference
as path to gene annotation .gtf file. To save time, users can also set--reference_pkl
as the path to SCOTCH generated annotation based on given gtf file. SCOTCH has pre-computated this file based on this (human hg38) provided by 10X genome. In addition, set--update_gtf
. - Annotation-Free Mode: SCOTCH can generate gene and isoform annotations based solely on BAM files, allowing for the discovery of novel genes and isoforms. Set
--reference None
and--reference_pkl None
.
--bam
: path(s) to the bam file(s)/folder(s)--workers
: number of threads for parallel computing.--coverage_threshold_exon
: coverage threshold to support exon discovery, percentage to the maximum coverage, larger values will be more conservative, default is 0.02.--coverage_threshold_splicing
: threshold to support splicing discovery, percentage to the maximum splicing junctions, larger values will be more conservative, default is 0.02.--z_score_threshold
: z score threshold to discovery sharp changes of read coverage, larger values will be more conservative, default is 10.
Below is an example of generating annotation in the Enhanced-Annotation Mode for two samples simultanuously. See example/annotation.sh
for an implementation in slurm.
python3 src/main_preprocessing.py \
--task annotation \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2 \
--reference path/to/genes.gtf \
--workers 30
In this step, SCOTCH aligns reads to existing gene isoforms and identifies novel isoforms based on the annotation files generated in step 1. Set --tast 'compatible matrix'
to run this step. A compatibility matrix is generated for each gene, where each row represents a read and each column represents a gene isoform. Detailed information on read-isoform mappings can be found in the auxiliary
folder. To speed up this step, job arrays can be submitted in SLURM to process genes in parallel. An example implementation for 100 job arrays is provided in example/compatible.sh
.
--target
: the same with step1--bam
the same with step1--match_low
exon percentage lower threshold to call a read-exon unmapping. For example, setting 0.2 means reads covers <20% of the exon length as unmapped. (default is 0.2)--match_low
exon percentage upper threshold to call a read-exon mapping. For example, setting 0.6 means reads covers <20% of the exon length as mapped. (default is 0.6)--total_jobs
: the number of batches to split genes and run in parallel--job_index
: the batch/job index current task to run
python3 src/main_preprocessing.py \
--task 'compatible matrix' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--bam path/to/bam/file/or/bamfolder/sample1 path/to/bam/file/or/bamfolder/sample2
In this step, SCOTCH will generate a new gene annotation file, including all novel isoforms identified. Set --tast summary
to run this step. See example/summary.sh
for an implementation in SLURM.
--reference
: set to None if using the annotation-free mode; set to known annotation gtf file path if using enhanced-annotation or annotation-only mode.
python3 src/main_preprocessing.py \
--task 'summary' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2
--reference path/to/reference/genes.gtf
Finally, SCOTCH will generate gene- and isoform-level copunt matrix. Set --tast 'count matrix'
to run this step. See example/count.sh
for an implementation in SLURM.
--target
: the same with step1- '--novel_read_n' filter out novel isoform is the number of reads mapped for the sample less than
novel_read_n
, and reads mapped to this novel isoform will be treated as uncategorized. Default is 20. --workers
: number of threads for parallel computing.--save_csv
/--save_mtx
: these settings are used to set up output format. Saving csv files takes some time.(default is to save in both csv and mtx formats)
python3 src/main_preprocessing.py \
--task 'count matrix' \
--target path/to/output/folder/of/sample1 path/to/output/folder/of/sample2 \
--workers 20 --save_csv_false --save_mtx
In R, run below codes to install SCOTCH statistical pipeline.
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
library("devtools")
install_github("WGLab/SCOTCH")
Below is sample codes for DTU analysis. Example data can be found in /data
folder.
library(SCOTCH)
#----read gene-level count matrix-----#
sample8_CD4_gene=as.matrix(read.csv('gene_count_1.csv',row.names = 'X'))
sample8_CD8_gene=as.matrix(read.csv('gene_count_2.csv',row.names = 'X'))
#----read transcript-level count matrix-----#
sample8_CD4_transcript=as.matrix(read.csv("transcript_count_1.csv",row.names = 'X'))
gene_transcript_CD4_df = data.frame(genes=str_remove(colnames(sample8_CD4_transcript),"\\.(ENST|novel|uncategorized).+"),
transcripts=colnames(sample8_CD4_transcript))
sample8_CD8_transcript=as.matrix(read.csv("transcript_count_2.csv",row.names = 'X'))
gene_transcript_CD8_df = data.frame(genes=str_remove(colnames(sample8_CD8_transcript),"\\.(ENST|novel|uncategorized).+"),
transcripts=colnames(sample8_CD8_transcript))
#----gene-level analysis-----#
df_gene = scotch_gene(sample8_CD4_gene, sample8_CD8_gene), epsilon=0.01,ncores=10)%>%
filter(pct1>=0.01|pct2>=0.01)
#----transcript-level analysis-----#
df_transcript = scotch_transcript(gene_transcript_CD4_df,gene_transcript_CD8_df,
sample8_CD4_transcript, sample8_CD8_transcript, ncores=10)
df_scotch = df_gene%>%left_join(df_transcript,by=c('genes'='gene'))
SCOTCH will output the following statistics in results:
genes: gene name
pct1: percent of cells expression the gene in cell population 1
pct2: percent of cells expression the gene in cell population 2
logFC: fold change in log scale
p_gene: p value on the gene level
pct_diff: difference between pct1 and pct2
p_gene_adj: adjusted p value of differential gene expression
isoforms: isoform name
alpha1: estimated parameter of the dirichlet distribution for cell population 1
alpha2: estimated parameter of the dirichlet distribution for cell population 2
TU1: relative transcript usage of cell population 1
TU2: relative transcript usage of cell population 2
TU_diff: difference of TU1 and TU2
TU_var1: variance of TU1
TU_var2: variance of TU2
dispersion1: dispersion parameter for cell population 1
dispersion2: dispersion parameter for cell population 2
isoform_switch: whether there is isoform switching event
isoform_switch_ES: effect size of isoform switching event
p_DTU_gene: p value for differential transcript usage analysis for the whole gene
p_transcript: p value for differential transcript usage analysis for the specific transcript
p_transcript_adj: adjusted p value for differential transcript usage analysis for the specific transcript
p_DTU_gene_adj: adjusted p value for differential transcript usage analysis for the whole gene