Tutorial: running SQANTI3 on an example dataset

Video tutorial

Watch Tutorial Video on YouTube

Step-by-Step tutorial

The document is structured to guide users through various aspects of the SQANTI3 tool.

SQANTI3 Installation: This section will guide users on how to install the SQANTI3 tool. It may include prerequisites, dependencies, and step-by-step instructions for a successful installation.
Example Dataset Overview: Before diving into the execution of SQANTI3, users will be introduced to a sample dataset. This dataset includes directories and files that encompass the output from various phases of the SQANTI3 pipeline and the necessary input files to operate the software. Organized under directories and files, users will find outputs from machine learning filters, quality control, and rules-based filtering processes. In addition to the genomic and annotation files for chromosome 22, there are paired-end sequencing read files for two replicates of the Universal Human Reference (UHR) sample and scripts demonstrating how to run different SQANTI3 modules on the provided data. This comprehensive example dataset ensures users have a foundational understanding of the expected data structure and the processes involved in SQANTI3 operations.
Executing and Understanding SQANTI3 QC: Here, users will learn how to run the quality control (QC) component of SQANTI3 which focuses on characterizing transcriptomes generated from lrRNA-Seq data, comparing them to reference annotations, and leveraging various datasets including short-read sequencing, CAGE, and Quant-Seq for a comprehensive evaluation. The guide will delve into the details of FSM and ISM subcategories, the integration of TSS and TTS evidence, and the internal handling of short-read data using tools like STAR and Kallisto.
Executing and Understanding SQANTI3 Filter: This segment introduces the filtering capabilities of SQANTI3. Users will be guided on how to utilize both the machine learning-based and rules-based methods to efficiently differentiate between genuine isoforms and potential artifacts, ensuring a high-quality transcriptome tailored to their research needs.
Executing and Understanding SQANTI3 Rescue: The final section will focus on the 'rescue' functionality of SQANTI3. Users will learn the process of executing this feature and, just as importantly, understand the results and the rationale behind 'rescuing'. The SQANTI3 rescue algorithm aims to retrieve potentially discarded transcripts during long-read data processing by identifying appropriate reference transcripts, ensuring they meet quality criteria and are non-redundant, and then incorporating them into the final transcriptome.

1. Installation

1.1 Dependencies

General: Perl, Minimap2, Python (3.7), R (>= 3.4.0), kallisto, samtools, STAR, uLTRA, deSALT, pip
Python Libraries: bx-python, BioPython, BCBioGFF, cython, NumPy, pysam, pybedtools, psutil, pandas, scipy
R Libraries: R packages for sqanti3_qc.py and sqanti3_filter.py (installed with conda environment)

1.2 Installation

1.2.1. Anaconda: Install/update Anaconda. Add to PATH and update if needed: bash export PATH=$HOME/anacondaPy37/bin:$PATH conda -V conda update conda

1.2.2 SQANTI3: Download SQANTI3 v5.2.2 and extract: bash wget https://github.com/ConesaLab/SQANTI3/archive/refs/tags/v5.2.2.tar.gz tar -xvf v5.2.2.tar.gz cd SQANTI3

1.2.3 Conda Environment: Create and activate environment: bash conda env create -f SQANTI3.conda_env.yml source activate SQANTI3.env

1.2.4. gtfToGenePred: Download gtfToGenePred to SQANTI3/utilities and make executable: bash wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred -P <path_to>/SQANTI3/utilities/ chmod +x <path_to>/SQANTI3/utilities/gtfToGenePred

2. Example Dataset Overview

This section provides a brief overview of the example dataset to be utilized throughout this folder. The dataset consists of various files and directories that represent the output from different stages of the SQANTI3 pipeline and the requisite input files for running SQANTI3.

2.1 Directories

SQANTI3_QC_output: Contains output files generated by the SQANTI3 Quality Control (QC) process.
MLfilter_output: This directory contains output files from the machine learning filter process in SQANTI3.
rules_filter_output: Houses output files from the rules-based filtering process in SQANTI3.

2.2 Files

Genome and Annotation Files:
- GRCh38.p13_chr22.fasta: A FASTA file containing the sequence of chromosome 22 from the GRCh38.p13 version of the human genome.
- GRCh38.p13_chr22.fasta.fai: Index file for the FASTA file, facilitating quick access to different regions within the file.
- Homo_sapiens_GRCh38_Ensembl_86.chr22.gff3: A GFF3 file containing an annotation for chromosome 22 from the Ensembl 86 release of the human genome.
- gencode.v38.basic_chr22.gtf: A GTF file containing GENCODE basic gene annotation for chromosome 22 from the GENCODE v38 release.
Sample Data:
- UHR_Rep1_chr22.R1.fastq.gz and UHR_Rep1_chr22.R2.fastq.gz: Paired-end sequencing read files for replicate 1 of a Universal Human Reference (UHR) sample, focused on chromosome 22.
- UHR_Rep2_chr22.R1.fastq.gz and UHR_Rep2_chr22.R2.fastq.gz: Paired-end sequencing read files for replicate 2 of a UHR sample, focused on chromosome 22.
- UHR_abundance.tsv: A tab-separated file containing transcript abundance estimates for the UHR sample.
- UHR_chr22.gtf: A GTF file containing transcript annotations for the UHR sample, focused on chromosome 22.
- UHR_chr22_short_reads.fofn: A file of file names (FOFN) listing short-read data files related to the UHR sample on chromosome 22.
Example Run Scripts:
- run_SQANTI3_QC.sh: A shell script demonstrating how to run the SQANTI3 Quality Control (QC) process on the example data.
- run_SQANTI3_MLfilter.sh: A shell script demonstrating how to run the SQANTI3 machine learning filter.
- run_SQANTI3_rules_filter.sh: A shell script demonstrating how to run the SQANTI3 rules-based filtering process on the example data.

3. Executing and Understanding SQANTI3 QC (Video for this part)

SQANTI3 is the updated iteration of the SQANTI tool designed for quality control and characterization of transcriptomes built from lrRNA-Seq data. Central to SQANTI3 is its Quality Control (QC) module, which evaluates the de novo transcriptome against a reference annotation and outputs a classification file. This classification not only delves deep into transcript variations like Full Splice Matches (FSM) and Incomplete Splice Matches (ISM) but also introduces new subcategories for them. Additionally, SQANTI3 QC integrates evidence from both short and long-read data, even accepting orthogonal data types like CAGE and Quant-Seq, to support and validate transcript start and end sites. The software has also been enhanced to handle short-read data internally, incorporating tools like STAR and Kallisto for mapping and quantification.

3.1 Running SQANTI3 QC with example command:

The provided command runs the sqanti3_qc.py script which is designed for structural and quality annotation of novel transcript isoforms.

3.1.1 Example script

python sqanti3_qc.py example/UHR_chr22.gtf example/gencode.v38.basic_chr22.gtf example/GRCh38.p13_chr22.fasta    \
                     --CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed    \
                     --polyA_motif_list data/polyA_motifs/mouse_and_human.polyA_motif.txt    \
                     -o UHR_chr22 -d example/SQANTI3_output -fl example/UHR_abundance.tsv    \
                     --short_reads example/UHR_chr22_short_reads.fofn --cpus 4 --report both

3.1.2 Positional Arguments:

example/UHR_chr22.gtf:
- Input isoforms file in GTF format.
example/gencode.v38.basic_chr22.gtf:
- Reference annotation file in GTF format.
example/GRCh38.p13_chr22.fasta:
- Reference genome file in FASTA format.

3.1.3 Optional Arguments:

--CAGE_peak data/ref_TSS_annotation/human.refTSS_v3.1.hg38.bed:
- Specifies a BED format file with FANTOM5 CAGE Peak data for annotating transcription start sites (TSS).
--polyA_motif_list data/polyA_motifs/mouse_and_human.polyA_motif.txt:
- Provides a ranked list of polyA motifs that are sequences often found near polyadenylation sites.
-o UHR_chr22:
- Sets the prefix for the output files to UHR_chr22.
-d example/SQANTI3_output:
- Designates the directory where output files will be saved: example/SQANTI3_output.
-fl example/UHR_abundance.tsv:
- Specifies a file containing full-length PacBio abundance data.
--short_reads example/UHR_chr22_short_reads.fofn:
- Provides a File Of File Names (fofn) with paths to FASTA or FASTQ files from short-read RNA-Seq.
--cpus 4:
- Sets the number of threads (or CPU cores) to be used during alignment by aligners to 4.
--report both:
- Specifies the report format, producing outputs in both HTML and PDF formats.

Note: The backslashes (\) at the end of lines in the command are used to split a lengthy command into multiple readable lines. Each backslash indicates the continuation of the command on the next line.

3.2 Understanding SQANTI3 QC:

SQANTI3 QC produces a series of output files following its execution, using flags like --dir or -d and --output or -o to specify the directory and file prefix, respectively. This tool also relies on other software, including STAR, kallisto, and GMST, which means there will be subfolders in the output directory. For an example output, refer to the example/SQANTI3_QC_output subfolder in the main SQANTI3 directory.

3.2.1 SQANTI3 transcriptome classification (Video link for this part)

SQANTI3 sorts each isoform by comparing it to a known reference transcript. Here's how it categorizes them:

FSM (Full Splice Match) Subcategories:

The reference and query isoform both have the same exon count, with each internal junction aligning to the reference. However, the exact start (5') and end (3') can vary within the first and last exons.

Rrefernece Match: The query isoform matches the reference isoform exactly (less than 50bp), including both internal junctions and exact 5' start and 3' end.
5' Mismatch: The internal junctions match, but the 5' start is different (difference higher than 50bp).
3' Mismatch: The internal junctions match, but the 3' end is different (difference higher than 50bp).
5' and 3' Mismatch: The internal junctions match, but both the 5' and 3' ends are different (difference higher than 50bp).

ISM (Incomplete Splice Match)

The query isoform has fewer outer exons compared to the reference, but all internal junctions align with the reference's positions. The exact start (5') and end (3') can vary within the initial/final exons.

5' fragments: The query isoform is missing some 5' exons compared to the reference.
3' fragments: The query isoform is missing some 3' exons compared to the reference.
Internal fragments: Missing some start and end exons but retains the internal exons.
intron retention: Contains an intron that is usually spliced out in the reference.

NIC (Novel In Catalog) Subcategories & NNC (Novel Not in Catalog) Subcategories:

NIC (Novel In Catalog): The term "Novel In Catalog" (NIC) refers to a subtype of novel isoforms that are characterized by their utilization of known splice junctions. While the junctions, which are pairs of donor-acceptor sites, are previously identified and cataloged, the specific combinations in which these junctions are assembled in the isoform are novel. Essentially, NIC represents isoforms that employ previously recognized junctions in a unique or previously undocumented manner.

NNC (Novel Not in Catalog): The "Novel Not in Catalog" (NNC) denotes a subtype of novel isoforms where at least one of the splice sites, either donor or acceptor, is entirely novel and has not been cataloged previously. Even if the other splice sites involved are known, the presence of a novel splice site means that the isoform falls under the NNC category. This subtype signifies isoforms that introduce new splicing patterns to the catalog due to the presence of previously unidentified splice sites.

NIC Combination of known SJ: In this category, the query isoform is formed by combining known splice junctions. In essence, even though the particular arrangement of splice junctions in the isoform is novel, all the individual splice junctions themselves have been previously cataloged.
NIC Combination of known splice site: This refers to a query isoform composed of known splice sites. Splice sites are specific nucleotide sequences at the ends of introns that signal the splicing machinery. A splice site can be either a donor (at the start of an intron) or an acceptor (at the end of an intron). Even if the precise configuration or combination in the isoform hasn't been seen before, the individual splice sites have been documented previously.
NIC Intron Retention: This scenario involves a query NIC isoform that retains an intron that is usually spliced out in other known isoforms.
NIC Mono-exon: This scenario involves a NIC isoform that consists of a single exon without any introns.

3.2.2 TSS (Transcription Start Site) Evaluation in SQANTI3 (Video link for this part)

The accurate identification of Transcription Start Sites (TSS) stands as a cornerstone in the realm of molecular biology and genomics. TSSs are the specific locations on the DNA where transcription of a particular gene commences. Recognizing these sites with precision is pivotal.

Distance from the Reference TSS:

First, it assesses how far a detected TSS falls from the reference TSS.
The proximity of your TSS to the reference can indicate its accuracy. If it's too far off, it may be a sign of an inaccurate TSS detection.

Using CAGE Peaks Data:

Cap Analysis Gene Expression (CAGE) is a technique to identify the TSS of genes. SQANTI uses CAGE peaks data to help distinguish between true TSS and potential false positives. If the detected TSS aligns with a CAGE peak, it's more likely to be a true TSS. In contrast, if it doesn't align with any CAGE peaks, it might be a false detection.
The TSS ratio is computed for each transcript by determining the ratio between short-read coverage downstream and upstream of the TSS. A true TSS is expected to have much lower upstream coverage, resulting in a TSS ratio greater than one. On the other hand, 5’ end-degraded transcripts are expected to have uniform coverage on both sides of the TSS, leading to a TSS ratio approximately equal to one

3.2.3 SQANTI3 Transcriptional termination site quality control (Video link for this part)

SQANTI3 uses three quality control metrics for evaluating transcription termination sites (TTS) in transcriptome:

Distance to Annotated TTS: Measures the gap between a transcript's TTS and the nearest known TTS from reference databases to spot novel or existing termination sites.
Polyadenylation (polyA) Motif Detection: Scans for a specific sequence near the 3' end of transcripts that signals for polyadenylation, serving as a biological marker for genuine TTS.
Quant-seq Data Support: If available, this data provides additional validation for the 3' ends, enhancing confidence in the identified TTS.

Check different TTS:

Distance to Annotated TTS:

What: SQANTI3 calculates the physical distance between the TTS of the newly sequenced transcript and the closest annotated TTS belonging to the same gene in reference databases.
Why: By doing this, SQANTI3 can evaluate if a transcript's TTS is novel or if it matches (or is proximate to) a previously known termination site. If a transcript's TTS is very distant from any annotated TTS, it may be a novel site or an artifact, and further validation may be needed. Polyadenylation (polyA) Motif Detection:

Polyadenylation (polyA) Motif Detection:

What: Transcripts typically have a sequence motif near their 3' end that signals for polyadenylation (addition of a polyA tail). SQANTI3 scans the transcript sequence to detect this motif.
Why: The presence of a polyA motif can indicate a genuine TTS, as the motif is a biological marker for transcript termination and subsequent addition of a polyA tail. The text also mentions that these detected motifs are commonly found 16-18 base pairs from the end of the transcript, consistent with known experimental evidence. This adds an extra layer of validation. Absence or unusual positioning of this motif might suggest the 3' end of the transcript was inaccurately determined or that alternative polyadenylation sites exist.

Quant-seq Data Support (polyA site data):

What: Quant-seq is a method to sequence the 3' ends of RNAs and is used to validate the 3' ends obtained from other sequencing methods. If Quant-seq data is available, SQANTI3 considers it as orthogonal evidence to validate the TTS of a transcript.
Why: Quant-seq specifically targets the 3' end of transcripts, making it a reliable method to validate TTS. If both long-read sequencing (like from PacBio) and Quant-seq point to the same TTS, confidence in that TTS's accuracy increases.

3.2.3 SQANTI3 Splicing junction classification (Video link for this part)

When it comes to splice junctions, here's how SQANTI approaches the task:

Transcript Classification Based on Splice Junctions:

SQANTI classifies transcripts by comparing their splice junctions to those in a provided reference transcriptome.

Full Splice Match (FSM): Transcripts that match a reference transcript at all splice junctions.
Incomplete Splice Match (ISM): Transcripts that match consecutive, but not all, splice junctions of the reference transcripts. Those matching mostly the UTR3 sequence of their reference are labeled UTR3 Fragment.
Novel Transcripts: Transcripts introducing new splice patterns are divided into:
- Novel in Catalog (NIC): Contains either new combinations of already annotated splice junctions or novel ones formed from previously annotated donors and acceptors.
- Novel Not in Catalog (NNC): Utilizes novel donors and/or acceptors.

Canonical vs Noncanonical Splicing:

Splice junctions are categorized into canonical and noncanonical based on the dinucleotide pairs at their start and end.

Canonical splicing includes the combinations GT-AG, GC-AG, and AT-AC, found in over 99.9% of human introns.
All other combinations are considered noncanonical splicing.
Users can also supply their own definitions of canonical junctions.

Splice Junction Coverage with Short Reads:

SQANTI3 incorporates short-read data by internally running STAR and Kallisto for mapping and quantification of splice junction. The combination of both long and short reads provides a dual layer of evidence. If a novel splice junction is detected by both read types, there's increased confidence in its authenticity. Unlike SQANTI, SQANTI3 can process raw FASTQ files from Illumina directly, producing short read-related metrics. Here are the details of how SQANTI3 processes short reads data:

A genome index is prepared and short reads are mapped using STAR.
The mapping does not use a reference annotation, ensuring that SJ identification is not influenced by previous annotations.
STAR's parameters follow the ENCODE-DCC RNA-seq protocol. To enhance novel SJ detection, the --twopassMode option is turned on.
Post STAR execution, two files are generated per replicate: SJ.out.tab, which provides quantification of long-read-defined SJs by short reads, and a BAM file, utilized for the novel TSS ratio metric computation.

3.2.4 SQANTI3 Intra-priming (Video link for this part)

Researchers use specific oligos, anywhere between 20-35 bases long, called primers, to initiate the copying or amplifying process of DNA. Typically, an 'oligodT' primer is custom-designed to match the target sequence at the end of a transcript, which often has a long string of 'A's known as the polyA tail. However, given that there are also A-rich regions within the transcripts (not just at the end), the oligodT primer might inadvertently begin amplification from these internal regions instead. If these transcripts don't have a typical ending (consensus poly Adenilation site), SQANTI thinks they might be mistakes from internal priming and removes them.

Mechanisms Intra-priming:

Imagine your DNA is a long string of letters, with 'A' being one of them. Sometimes, there's a region in the DNA that has lots of 'A's, either 6 of them in a row or 12 out of a group of 20 letters.
When a small piece of DNA (or a primer) finds this A-rich area, it can stick to it and start copying the DNA from there.
This is called "internal priming" because it's happening inside the DNA sequence and not where it's supposed to be.

How SQANTI3 Check Intra-priming

Some tools use an 'oligodT' to start the DNA copying process at the end of a transcript, which typically has a long string of 'A's called the polyA tail. However, since there are A-rich regions inside transcripts too (not just at the end), the 'oligodT' might mistakenly start copying from there.SQANTI checks for this by looking at transcripts and seeing if there's a very A-rich region (>= 80% A's in the last 20 letters) at their end.

3.2.5 SQANTI3 RT-switching (Video link for this part)

RT-switching refers to the "jumping" or switching of the reverse transcriptase enzyme between repeated regions of an RNA molecule due to its secondary structure. This can lead to the formation of cDNA molecules that don't accurately represent the original RNA sequence and can introduce artifacts when analyzing RNA sequences or structures.

Here's a step-by-step explanation of RT-switching:

Reverse Transcription Start: The process begins with the initiation of reverse transcription. This is where the reverse transcriptase enzyme (RT) starts transcribing RNA into complementary DNA (cDNA). The RNA molecule has a secondary structure and repeated regions which play a role in RT-switching.
RNA Secondary Structure and Repeated Regions: RNA molecules can form secondary structures due to complementary base pairing within the molecule itself. These structures may bring repeated regions into close proximity. When the RT enzyme encounters these structures, it might jump from one repeated region to another.
Formation of cDNA with Skipped Repeats: When the RT enzyme switches or "jumps" between repeated regions, the resulting cDNA molecule may not have all the repeated regions that were originally present in the RNA. Instead, the cDNA will possess only one copy of the direct repeats.
Novel Splice Junctions Appear: When the cDNA, which has undergone RT-switching, is sequenced or mapped, it might appear as though it has novel splice junctions. These are not true biological splice junctions but are artifacts caused by the RT-switching event.

4. Executing and Understanding SQANTI3 Filter (Video link for this part)

SQANTI3 has enhanced its filtering capabilities by improving its machine learning-based filter and introducing a flexible rules-based strategy, allowing users to discern between true isoforms and artifacts more accurately.

4.1 SQANTI3 rules filter (Video link for this part)

SQANTI3 has introduced a rules-based strategy for filtering, especially when defining TP and TN sets is challenging. Here, a JSON file is used to specify the characteristics of a reliable isoform. The file has rules and requisites, with each rule consisting of multiple requisites. All requisites of a rule must be met for a transcript to be deemed true. Multiple rules for the same category are evaluated independently, and a transcript only needs to pass one of these rules. The default filter has specific sets of rules for different transcript categories, but users can define their criteria for more tailored results.

4.1.1 Rules filter example:

To filter a classification file and generate outputs with the prefix "filtered_results", you can use:

python sqanti3_filter.py rules path/to/classification.txt -o filtered_results

Note: this will apply default filters and produce filtered files with the "filtered_results" prefix in the directory where the script is run.

4.1,2 Positional Argument:

sqanti_class: Path to the SQANTI3 QC classification file.

4.1.3 Optional Arguments:

-h, --help: Displays help message and exits.
--isoAnnotGFF3 ISOANNOTGFF3: Path to the isoAnnotLite GFF3 file that is to be filtered.
--isoforms ISOFORMS: Path to the fasta/fastq isoform file that is to be filtered.
--gtf GTF: Path to the GTF file that is to be filtered.
--sam SAM: Path to the SAM alignment of the input fasta/fastq.
--faa FAA: Path to the ORF prediction faa file to be filtered by SQANTI3.
-o OUTPUT, --output OUTPUT: Specifies the prefix for output files.
-d DIR, --dir DIR: Sets the directory for output files. Default is the directory where the script was run.
-v, --version: Displays the program version number.
--skip_report: When supplied, the filter will not generate a report.
-j JSON_FILTER, --json_filter JSON_FILTER: Specifies the JSON file where filtering rules are expressed. By default, the filter uses utilities/filter/filter_default.json.

4.1.4 Understand how to create SQANTI3 rules

Creating Custom Filters:

Starting with SQANTI3 version 5.0, users can specify any number of filtering rules in a JSON file.
The structure involves defining rules for each structural category, with each rule made up of various requisites.
- Rules for a category are treated as logical OR.
- Requisites within a rule are treated as logical AND.
The "rest" category can be used to specify rules for all other unspecified categories.

Rule Structure in JSON:

Rules can be set for any numeric or character column in the classification file:
- Numeric columns: Define an interval ([X,Y]) or just the lower limit.
- Character/logical columns: Define which terms or values are accepted.

Example of User-defined Filter (Video link for this part):

Full Splice Match Category:
- Isoforms must not be a potential intra-priming product.They should have less than 60 percent of genomic adenines in the downstream 20 BP window.
Incomplete Splice Match Category:
- Isoform must be larger than two kilobases and shorter than 15 kilobases.
- Must be cataloged within the three prime fragments, five Prime fragments, or internal fragment subcategories.
Novel In the Catalog Category:
- All splice junctions should be canonical or covered by at least 10 short reads.
Novel Not in the Catalog Category:
- All splice junctions should be canonical.
- The absolute distance to the closest annotated transcriptional start sites and transcriptional termination sites must be 50 BP or less or covered by at least 10 short reads and the absolute distance to the closest annotated transcriptional start sites and transcriptional termination sites must be 50 BP or less.
Rest of the Categories:
- Transcripts will be kept if:
  - They are not suspected of being an RT switching artifact.
  - All of their splice junctions are canonical.
  - The transcript is coding.
  - It is not an intra-priming product.
  - It is not a mono exon transcript.

An example JSON filter is provided that defines rules for different isoform categories, including "full-splice_match", "incomplete-splice_match", "novel_in_catalog", "novel_not_in_catalog", and "rest". This filter specifies various conditions for keeping isoforms, such as avoiding potential intrapriming products, setting length thresholds, and requiring canonical splice junctions.

In summary, SQANTI3 provides a flexible way for users to filter their transcriptomes based on specific criteria to curate high-quality transcript sets. This is facilitated through easy-to-define JSON-based rule sets.

4.2 SQANTI3 Machine Learning Filter (Video link for this part)

SQANTI3 has enhanced its machine learning-based filter (ML filter) from its predecessors. The filter, rooted in a random forest classifier, discerns between true isoforms and artifacts. The training relies on SQANTI3's QC attributes and the selection of true positive (TP) and true negative (TN) isoform sets. Users can now define these sets or let the software automatically pick them based on SQANTI3's classifications. By default, the filter considers NNC non-canonical isoforms as TN and reference match (RM) subcategories as TP. There's a balance in set size for training, and a default threshold is set where the transcript's likelihood of being a genuine isoform must be ≥0.7. Users can adjust this threshold and specify which isoforms are included or excluded. The filter now benefits from new validation metrics and additional data like CAGE-seq peaks, the short read-based TSS ratio metric, polyA motif details, and Quant-seq peaks. This allows for better artifact detection. Rules for model training can be customized to avoid overfitting. The classifier's output modifies the original *_classification.txt file to produce a *_MLresult_classification.txt file.

4.2.1 Example command:

python sqanti3_filter.py ml path/to/classification.txt

4.2.2 Positional Argument:

sqanti_class: SQANTI3 QC classification file.

4.2.3 Optional Arguments:

--isoAnnotGFF3: Specifies the isoAnnotLite GFF3 file to be filtered.
--isoforms: Designates the fasta/fastq isoform file for filtering.
--gtf: Indicates the GTF file for filtering.
--sam: Signifies the SAM alignment of the input fasta/fastq.
--faa: Represents the ORF prediction faa file for SQANTI3 filtering.
-o, --output: Prefix for output files.
-d, --dir: Output directory (default is the directory where the script was run).
-e, --filter_mono_exonic: Filters all mono-exonic transcripts by default.
-v, --version: Shows program version number.
--skip_report: Skips creating a filtering report.
-t, --percent_training: Data proportion for training (default is 0.8 or 80%).
-p, --TP: Path to file listing TP transcripts.
-n, --TN: Path to file listing TN transcripts.
-j, --threshold: Probability threshold to classify as positive isoforms (default is 0.7).
-f, --force_fsm_in: Forces the inclusion of FMS transcripts regardless of ML results.
--intermediate_files: Outputs intermediate files from ML filter.
-r, --remove_columns: Provides path to file with names of columns to be excluded during training.
-z, --max_class_size: Max number of isoforms for TP and TN sets (default is 3000).
-i, --intrapriming: Sets adenine percentage to flag an isoform as intra-priming (default is 60).

For an in-depth understanding of each parameter, you can delve into the detailed descriptions above. The ML filter allows users to define TP and TN sets or utilize the built-in sets. After defining the sets, you'll split them for model training/testing. The ML filter will utilize the probability threshold to label transcripts as Isoforms or Artifacts.

Additionally, certain columns from the classification table are omitted before running the SQANTI3 ML filter. Users have the liberty to exclude more columns if needed. There's an intra-priming filter included in the script and options to enforce the removal or retention of specific isoform groups.

4.2.4 Understand how the machine learning filter works

In machine learning, especially for supervised learning tasks, training data is pivotal. The model's ability to accurately predict or classify unseen data depends largely on the quality and quantity of the training data. In the context of the SQANTI3's machine learning filter, let's delve deeper into the nuances of training data and the model.

1. True Positive (TP) and True Negative (TN) Sets

Selection Criteria for TP and TN sets:

True Positive (TP): Reliable isoforms from which the model learns the features of a genuine isoform. The built-in selection includes Reference Match (RM) isoforms, and if there are insufficient RM isoforms, Full-Splice Match (FSM) isoforms are taken as TP.
True Negative (TN): These are isoforms considered as low-quality or artifacts. The built-in selection encompasses Novel Not in Catalog (NNC) isoforms with at least one non-canonical junction. If this group is small, all NNC isoforms are used as TN.

User-defined Training Sets: SQANTI3 offers flexibility. Users can supply their own TP and TN sets, tailoring the training data to their specific requirements. If the user has prior knowledge about certain transcripts being genuine isoforms or artifacts, this functionality becomes valuable.

Balancing the Training Sets: If there's a significant disparity between the sizes of the TP and TN sets, the larger set is downsized through random sampling to match the smaller set size. This ensures balanced training and avoids bias towards the over-represented class.

2. Partitioning the Training Data

Data Splitting: It's crucial to separate the data into training and testing sets. By default, 80% of the data is allocated for training, and 20% for testing. The rationale is to train the model on a majority of the data and validate its performance on unseen data.

Cross-validation: SQANTI3 employs a 10-fold cross-validation strategy during model training. This involves splitting the training data into 10 subsets; the model is trained on 9 subsets and validated on the 1 remaining subset. This process is repeated 10 times. Cross-validation helps in reducing overfitting and offers a more generalized model.

3. Random Forest Classifier

Random Forest is an ensemble learning method that combines multiple decision trees' outputs. It's known for its versatility, accuracy, and ability to handle large datasets with higher dimensionality.

Model Training: The training process utilizes functions from the caret R package. The model is trained using various features derived from the transcripts, such as length, exon count, junction characteristics, and more.

Prediction Threshold: Post-training, the model generates a probability score for a transcript being a genuine isoform. By default, a threshold of 0.7 is set, meaning transcripts with a probability score greater than this are labeled as isoforms.

Model Testing: The remaining 20% of the data, not used in training, is employed to evaluate the model's performance. This ensures the model's reliability in real-world scenarios.

Model Storage: Once trained, the model is saved as an .RData object. If the filter is rerun in the same directory, it will use the existing model, skipping the training step. This is handy for applying a pre-existing model to new datasets, but if a new model is desired, the previous model needs to be moved or deleted.

4. Feature Importance and Exclusion

Certain features may not contribute significantly to the model or might even introduce bias. SQANTI3 allows users to exclude specific features/columns from the training process. This flexibility ensures the model focuses on the most relevant features and is not distracted or biased by less pertinent information.

Excluding Data Columns: For preventing overfitting during model training, users can define and exclude specific columns from the training process. This ensures that the model doesn't unduly rely on certain features, which may not be generally applicable outside the training data.

Note The random forest classifier has specific scenarios where it might not run, so be mindful of exceptions.

5. Executing and Understanding SQANTI3 Rescue (Video link for this part)

The SQANTI3 rescue module is designed to prevent the loss of transcripts from long-read data that could not be accurately processed. It aims to identify valid transcript models for transcripts initially discarded due to potential artifacts after filtering using the SQ3 filter module. Full documentation can be found in the (wiki)

5.1 Example command:

5.1.1 Rules-based rescue

For this example, let's assume the following:

We're going to use the rules-based rescue (rules).
The path to the FASTA file generated by SQANTI3 QC is /path/to/isoforms_corrected.fasta.
The path to the GTF file output by SQANTI3 filter is /path/to/filtered.gtf.
The path to the reference genome FASTA used when running SQANTI3 QC is /path/to/reference.fasta.
The path to the reference transcriptome GTF used to run SQANTI3 QC is /path/to/reference.gtf.
The path to the reference transcriptome classification file is /path/to/reference_classif.txt (Explained in (wiki)).
We're going to use the default mode (automatic).
We will consider all mono-exon transcripts for the rescue.
The output directory will be /path/to/output_dir.
The prefix for the output files will be sqanti3_rescue_output.
The path to the JSON file including the rules used when running the SQANTI3 rules filter is /path/to/rules.json.
The SQANTI filter (ML or rules) output classification file is /path/to/filter_classif.txt.

sqanti3_rescue.py rules\
--isoforms /path/to/isoforms_corrected.fasta\
--gtf /path/to/filtered.gtf\
-f /path/to/reference.fasta
-g /path/to/reference.gtf\
-k /path/to/reference_classif.txt\
--mode automatic\
-e all\
-o sqanti3_rescue_output\
-d /path/to/output_dir\
-j /path/to/rules.json\
/path/to/filter_classif.txt`

5.1.2 Machine learning-based rescue

For this example, let's assume the following for machine learning rescue:

The path to the FASTA file generated by SQANTI3 QC is /path/to/isoforms_corrected.fasta.
The path to the GTF file output by SQANTI3 filter is /path/to/filtered.gtf.
The path to the reference genome FASTA used when running SQANTI3 QC is /path/to/reference.fasta.
The path to the reference transcriptome GTF used to run SQANTI3 QC is /path/to/reference.gtf.
The path to the reference genome FASTA used when running SQANTI3 QC is /path/to/reference_genome.fasta.
The path to the reference transcriptome classification file is /path/to/reference_classif.txt.
We're going to use the default mode (automatic).
We will consider all mono-exon transcripts for the rescue.
The output directory will be /path/to/output_dir.
The prefix for the output files will be sqanti3_ml_rescue_output.
The path to the randomforest.RData object obtained when running the SQANTI3 ML filter is /path/to/randomforest.RData.
The default machine learning probability threshold will be used (0.7).
The SQANTI filter (ML or rules) output classification file is /path/to/ml_filter_classif.txt.

The command to run sqanti3_rescue.py using the machine learning-based rescue would be:

sqanti3_rescue.py ml \
--isoforms /path/to/isoforms_corrected.fasta \
--gtf /path/to/filtered.gtf \
-f /path/to/reference.fasta
-g /path/to/reference.gtf \
-f /path/to/reference_genome.fasta \
-k /path/to/reference_classif.txt \
--mode automatic \
-e all \
-o sqanti3_ml_rescue_output \
-d /path/to/output_dir \
-r /path/to/randomforest.RData \
-j 0.7 \
/path/to/ml_filter_classif.txt

5.2 The Main Principles of SQANTI3 rescue:

Consistent Quality: Rescued transcript models should satisfy the quality control (QC) standards set during the filtering phase.
Non-redundancy: If the replacement transcript (identified for an artifact) already exists in the filtered transcriptome, it isn't added again.

5.3 Understand the Process of SQANTI3 rescue:

Automatic Rescue:
- Targets FSM (Full Splice Match) artifacts arising when there's insufficient confidence in validating Transcription Start Sites (TSS) or Transcription Termination Sites (TTS).
- If a reference transcript relates to a removed FSM, it's automatically added to the transcriptome. However, if multiple FSMs point to the same reference but have varied TSS, the reference is added just once.
- Artifacts from ISM, NIC, and NNC categories are seen as potential rescue candidates. ISM artifacts are considered only if they don't have an associated FSM with the same reference transcript.
Rescue Target Identification:
- Defines a group of "rescue targets" or potential replacements for the rescue candidates. This encompasses both long read-defined and reference isoforms that have at least one same-gene rescue target.
- Candidates are mapped to the targets in a "rescue-by-mapping" process. The tool used here is minimap2, set for long-read alignment.
Quality Control:
- Ensures the reference transcriptome targets used in the rescue adhere to the same QC standards as the long-read targets.
- The SQANTI3 QC and Filter modules (machine learning or rules) are applied on the reference transcriptome, ensuring the same data and quality standards as were used for the long-read transcript models.
Final Selection:
- Filters out mapping hits if the rescue target didn't pass the established filters.
- If multiple valid mapping hits exist per candidate, the one with the highest probability (from either long-read or reference) will be chosen. If more than one hit passes the rules filter, all targets are considered.
- If the best matching target is a long read-defined isoform already in the transcriptome, nothing changes since the artifact is already represented.
- The remaining reference targets are checked for redundancy and added only if they aren't present in the already curated transcriptome.

In essence, the SQANTI3 rescue module seeks to ensure no valuable transcript data is lost while maintaining stringent quality and non-redundancy principles.

Wiki index