SICILIAN is a statistical method for identifying high confidence RNA splice junctions from a spliced aligner (currently based on STAR).
Dehghannasiri*, R., Olivieri*, J. E., Damljanovic, A., and Salzman, J. "Specific splice junction detection in single cells with SICILIAN", Genome Biology, August 2021. (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02434-8)
Please contact Roozbeh Dehghannasiri ([email protected]) or Julia Eve Olivieri ([email protected]).
There are three options for running SICILIAN. All three implementations can be used for any scRNA-Seq (and bulk RNA) datasets but each one is better suited for specific sequencing technologies and dataset size):
- Installing using the scripts in this repository and running SICILIAN after installing required libraries (recommended for SmartSeq2 and datasets with large number of samples).
- Using the
Nextflow
/Docker
-based pipeline from (https://github.com/czbiohub/nf-sicilian) with minimal installation (recommended for 10x datasets) - Using the cloud-based online tool on the Cancer Genomics Cloud (CGC) platform (recommended for running on the datasets available on CGC i.e., TCGA, CCLE, ...).
The cloud-based online tool for SICILIAN is available on the Seven Bridges Cancer Genomics Cloud platform sponsored by the National Cancer Institute. The workflow is fully dockerized and has a user-friendly interface, making it easy to run particularly for users with little bioinformatics expertise. Users only need to upload their fastq files or use the datasets publicly-available on CGC (i.e., TCGA, CCLE, TARGET, ...) and select correct annotation and index files to run SICILIAN. The online tool can be accessed via this link.
- Download the latest version of SICILIAN codes by cloning its github repository
git clone https://github.com/salzmanlab/SICILIAN.git
- Intall needed packages for
R
andPython
- Build annotation pickle files and STAR index files for a genome assembly and annotation using
create_annotator.py
script - Set the input variables in
SICILIAN.py
which is the main script that submits all necessary jobs for running SICILIAN on input RNA-Seq data. - For running SICILIAN on 10x scRNA-Seq data, it needs to first demultiplex 10x fastq file based on cell barcode and UMI information stored in R1. For 10x analysis, SICILIAN executes
whitelise
andextract
commands from UMI_tools software . Therefore,UMI_tools
should be preinstalled on the local cluster for running SICILIAN on 10x samples.
-
SICILIAN has been developed using
Python 3.6.1
. The following python packages are needed to be installed (they can be installed using therequirements.txt
file in the SICILIAN repository and commandpip3 install -r requirements.txt
):- argparse
- numpy
- pandas
- pyarrow
- pysam
-
SICILIAN has been developed with
R 3.6.1
and requires the following packages to be installed inR
(The R script in SICILIAN automatically checks if a package is already installed and will install any that are needed, so no manual preinstallation is needed!):- data.table
- glmnet
- tictoc
- dplyr
- stringr
- GenomicAlignments
- cutpointr
SICLIAN uses STAR as the aligner and therefore it needs STAR index files. SICILIAN also needs annotator pickle files for pulling gene names and adding them to the detected junctions. There are two options for the index and annotator files required for running SICILIAN: 1- download pre-built ready-to-use files or 2- build these files using the instructions given below.
All STAR index and annotator pickle files needed for running SICILIAN can be downloaded using the following links for human
, mouse
, mouse lemur
, and COVID-19
genomes. All you need to run SICILIAN is to download these files. Each file can be downloaded directly from Google Drive or Zenodo using the following links. The folder which should be untarred after downloading contains a subfolder for STAR index files (built based on STAR 2.7.5) and five other folders containing the GTF and the annotator pickle files needed for annotating the splice junctions called by SICILIAN:
-
Human hg38:
-
Mouse mm10:https://doi.org/10.5281/zenodo.6462332
-
Mouse Lemur Mmur3:
-
SARS-COV-2:
The other option is to build these files manually by using the following instructions. STAR index files can be built using the following command:
STAR --runThreadN 4 --runMode genomeGenerate --genomeDir Genome_data/star --genomeFastaFiles fasta_file.fa --sjdbGTFfile gtf_file.gtf
The annotator pickle file for a GTF file needs to be made only once by the create_annotator.py
script in the SICILIAN scripts directory, using the following command:
python3 create_annotator.py -g gtf_file.gtf -a annotation_name
annotation_name
can be set to any arbitrary name but we recommend it contains the name and the version of the annotation (i.e., hg38_gencode_v33
).
After running the above command,create_annotator.py
will create 3 different pickle files in the current working directory (determined by os.getcwd()
in python): annotation_name.pkl
, annotation_name_exon_bounds.pkl
, and annotation_name_splices.pkl
.
annotation_name.pkl
: is a required input for SICILIAN and is used to add gene names to junction IDsannotation_name_exon_bounds.pkl
: is an optional input for SICILIAN and is used to determine whether or not each splice site (5' and 3') of a splice junction is an annotated exon boundaryannotation_name_splices.pkl
: is an optional input for SICILIAN and is used to determine whether or not the splice junction is annotated in the annotation gtf file
Update the input parameters in SICILIAN.py
script with information about your sample, genome assembly and annotations, and STAR alignment.
data_path
: specifies the path to the directory that contains the fastq files for the input RNA-Seq data.names
: specifies the name of the fastq file for the input RNA-Seq data (without suffix)r_ends
: list of unique endings for the file names of R1 and R2 fastq files. Example:r_ends = ["_1.fastq.gz", "_2.fastq.gz"]
out_dir
: specifies the path to the directory that will contain the folder specified byrun_name
for SICILIAN output files.run_name
: folder name for the SICILIAN output files.star_path
: the path to the STAR executable filestar_ref_path
: the path to t:q he STAR index filesgtf_file
: the path to the GTF file used as the reference annotation file for the genome assembly.annotator_file
: the path to theannotation_name.pkl
fileexon_pickle_file
: the path to theannotation_name_exon_bounds.pkl
file (this is an OPTIONAL input)splice_pickle_file
: the path to theannotation_name_splices.pkl
file (this is an OPTIONAL input)domain_file
: the path to the reference file for annotated protein domains downloaded from UCSC used for finding missing and inserted protein domains in the splice junction (this is an OPTIONAL input)single
: set equal toTrue
if the data is single-end, andFalse
if it is paired-end. Note that currently ifsingle = True
it is assumed that the single read to be aligned is in the second fastq file (this aligns with the setup of SICILIAN for droplet (10x) single-cell protocols, whereR1
contains the cell barcode and UMI information, and R2 contains the actual cDNA information). This also causes the files to be demultiplexed to create a new fastq file before they're mapped.tenX
: set equal toTrue
if the input RNA-Seq data is 10x andFalse
otherwise.stranded_library
: set equal toTrue
if input RNA-Seq data is based on a stranded library andFalse
otherwise. (for stranded libraries such as 10x,stranded_library
should be set toTrue
). Whenstranded_library
is set toTrue
, strand orientations from the alignment bam file will be used as the strand orientation of the junction. For unstranded libraries, SICILIAN uses gene strand information from the GTF file as the read strand is ambiguous.bc_pattern
: this parameter is needed only for 10x data and determines the barcode/UMI pattern in R1. For V3 chemistry in which UMI has 12 bps,bc_pattern
should be set to"C"*16 + "N"*12
and for 10x data based on V2 chemistry it should be set to"C"*16 + "N"*10
.bc_pattern
is needed forUMI_tools
steps before STAR alignment on input 10x data.
STAR alignment parameters can be adjusted in the STAR_map
function in SICILIAN.py
. By default, SICILIAN runs STAR with default parameters.
These parameters let you determine which steps of the SICILIAN pipeline will be run (for example, if you have already run STAR alignment for a sample but SICILIAN has failed at one of the subsequent steps, you can skip over the STAR alignment step by setting run_map
to False
) when rerunning SICILIAN:
run_whitelist
: Set equal toTrue
if your input data is 10X want to run UMI-tools whitelist script to extract cell barcodes and identify the most likely true cell barcodes (will be run only for 10X)run_extract
: Set equal toTrue
if you want to run UMI-tools extract script which removes UMIs from fastq reads and append them to read name (will be ron only for 10x)run_map
: Set equal toTrue
if you want to run the STAR alignment, andFalse
otherwiserun_class
: Set equal toTrue
if you want to run the class input job, andFalse
otherwiserun_GLM
: Set equal toTrue
if you want to run the GLM step and assign statistical scores to each junction in the class input file. The output of this step is a file namedGLM_output.txt
.
After assigning these variables, run python3 SICILIAN.py
on the command line to submit the SICILIAN jobs for the input data.
The output files by SICILIAN will be written in the output folder specified by out_dir/run_name
and will contain both the output files generated by STAR and also output files generated by STAR postsprocessing modeling.
We currently run STAR with --outSAMtype BAM Unsorted
and --chimOutType WithinBAM SoftClip Junctions
; therefore, STAR will produce a single BAM file that will contain both Aligned and Chiemric alignments. In addition to the alignment BAM file and STAR log files, there will be Chimeric.out.junction
, SJ.out.tab
, and Unmapped.out.mate
files, containing spliced junctions, summary of all chimeric alignments, and unmapped reads, respectively. For paired-end data as SICILIAN runs STAR independently for R1
and R2
reads, STAR output file names for R1
begin with 1 (i.e., 1Aligned.out.bam
) and those for R2
begin with 2 (i.e., 1Aligned.out.bam
). For single-end reads, all STAR output files withh begin with 2. A comprehensive description of the STAR output files can be found in STAR manual
Class input file is a file created by the run_class
step in SICILIAN and contains the information for all chimeric and spliced alignments extracted from the STAR BAM file. The class input file is saved in both parquet and tsv formats (class_input.tsv
and class_input.pq
; same for class_input_secondary
which contains only secondary alignments). Each row in the class input file contains the alignment positions, read id, and alignment features needed for statistical modeling for a spliced/chimeric alignment. Class input file is the input file for SICILIAN statistical modeling step activated by the run_GLM
flag.
Junction classification There are four categories for junctions in SICILIAN, depending on the relative positions of acceptor and donor sites and how far apart they are from each other.
linear
: acceptor and donor are on the same chromosome and strand, closer than 1 MB to each other, and are based on the reference genome canonical orderingrev
: acceptor and donor are on the same chromosome and strand, closer than 1 MB to each other, and are ordered opposite of the reference genome canonical ordering (evidence for circRNAs)sc
: local strandcrosses in which acceptor and donor are on the same chromosome but opposite strands.fusion
: fusion junctions in which acceptor and donor are on different chromosomes, or on the same chromosome and strand but farther than 1MB from each other.
This file is built by the last step in SICILIAN which performs the actual statistical modeling to assign a statistical score to each candidate junction (activated by run_GLM
flag) and contains all junctions that exist in class_input.tsv
along with their statistical scores and junction-level alignment summaries.
The following columns exist in the GLM_output.txt
file:
refName_newR1
: the junction id that contains gene name, positions, and strands for both acceptor and donor sides of the junctionfileTypeR1
: equalsAligned
if the junction comes fromAligned
alignments in the BAM file and equalsChimeric
if it comes fromChimeric
alignments in the BAM filejuncPosR1A
: position of the donor (5') side of the junctionjuncPosR1B
: position of the acceptor (3') side of the junctionchrR1A
: chromosome of the donor (5') side of the junctionchrR1B
: chromosome of the acceptor (5') side of the junctionread_strandR1A
:read_strandR1B
:gene_strandR1A
: The strand that the gene atjuncPosR1A
is on (+
or-
) based on the GTF filegene_strandR1B
: The strand that the gene atjuncPosR1B
is on (+
or-
) based on the GTF file *geneR1A_uniq
: gene name for the donor (5') side of the junctiongeneR1B_uniq
: gene name for the acceptor (3') side of the junctiongeneR1A_ensembl
: the gene ensembl id forgeneR1A_uniq
geneR1B_ensembl
: the gene ensembl id forgeneR1B_uniq
numReads
: number of reads that have been aligned to the junctiongeneR1A_expression
: the gene counts (HTseq counts) forgeneR1A_uniq
according to column V3 in1ReadsPerGene.out.tab
for PE data (2ReadsPerGene.out.tab
for SE data)geneR1B_expression
: the gene counts (HTseq counts) forgeneR1B_uniq
according to column V3 in1ReadsPerGene.out.tab
for PE data (2ReadsPerGene.out.tab
for SE data)median_overlap_R1
: median of junction overlaps across reads aligned to the junctionsd_overlap
: the standard deviation of junction overlap across reads aligned to the junctionthreeprime_partner_number_R1
: number of distinct 3' splice sites across the class input file for the 5' side of the junctionfiveprime_partner_number_R1
: number of distinct 5' splice sites across the class input file for the 3' side of the junctionp_predicted_glmnet_constrained
: aggregated score for the junctionp_predicted_glmnet_corrected_constrained
: aggregated score for the junction with correction for anomalous reads (calculated only for paired-end data)junc_cdf_glmnet_constrained
: cdf ofp_predicted_glmnet_constrained
relative to its null distribution by randomly assigning reads to junctionsjunc_cdf_glmnet_corrected_constrained
: cdf ofp_predicted_glmnet_corrected_constrained
relative to its null distribution by randomly assigning reads to junctionsfrac_genomic_reads
: fraction of aligned reads to the junction that have been also mapped to a genomic regionave_min_junc_14mer
: the average ofmin_junc_14mer
across the reads aligned to the junctionave_max_junc_14mer
: the average ofmax_junc_14mer
across the reads aligned to the junctionave_AT_run_R1
: the average ofAT_run_R1
across the reads aligned to the junction, whereAT_run_R1
is the length of the longest run of A's (or T's) for a read alignment in R1ave_GC_run_R1
: the average ofGC_run_R1
across the reads aligned to the junction, whereGC_run_R1
is the length of the longest run of G's (or C's) for a read alignment in R1ave_max_run_R1
: the average ofmax_run_R1
or max(AT_run_R1
,GC_run_R1
) across the reads aligned to the junction in R1ave_entropy_R1
: the average of read sequence entropy calculated based on 5-mers.frac_anomaly
: the fraction of the aligned reads for the junction that are anamolousp_val_median_overlap_R1
: the p-value of the statistical test for comparing the median overlaps of the aligned reads to the junction against the null of randomly aligned reads and small p-values are desired as they indicate that the median_overlap of the junction is large enough.emp.p_glmnet_constrained
: tme empirical p-value obtained based onjunc_cdf_glmnet_constrained
(the statistical score used for calling junctions in SE data)emp.p_glmnet_corrected_constrained
: tme empirical p-value obtained obtained based onjunc_cdf_glmnet_corrected_constrained
(the statistical score used for calling junctions in SE data)seqR1
: a representative sequence for the junction (the sequence of one of the reads aligned to the junction)
sicilian_called_splice_juncs.tsv
is the final output file by SICILIAN that contains the splice junctions called based upon SICILIAN statistical modeling. This file also includes extra annotation information (for splice junction, exon boundaries, and protein domain) for the called junctions. In addition to the columns that are also present in GLM_output.txt
, sicilian_called_splice_juncs.tsv
contains the following columns for the annotation of the called junctions:
splice_ann
: both sides of the junction are annotated as exon boundaries and the junction itself is found in the GTF; subset ofboth_ann
both_ann
: both sides of the junction are annotated as exon boundaries; this is equivalent toexon_annR1A
ANDexon_annR1B
exon_annR1A
: the exon on the first half of the junction is at an annotated boundaryexon_annR1B
: the exon on the second half of the junction is at an annotated boundarymissing_domains
: the protein domains that have been missed (spliced out) due to splicingdomain_insertions
: the protein domains to which the splice junction adds amino acid sequence