-
Notifications
You must be signed in to change notification settings - Fork 17
Vignette Long Read Proteogenomics Workflow with Test Data
Here, we repeat the steps found in README at the top of this repository, but we will work with all the processes found in this repository and describe the input files, the processing results and the output produced in these steps, so that the visitor may reproduce and see themselves. Throughout - emphasis on what specific files are required to enable a visitor to use this repository to analyze their own results.
Please note that machines need to be sized to match the size and number of samples to be processed. This vignette uses an artificially small sample, in general limited to one chromosome (Chr 22) and one mass spec file, but it is to illustrate the inputs, processing and output that is performed by this Nextflow workflow.
This quick start and steps were performed on a MacBook Pro running BigSur Version 11.4 with 16 GB 2667 MHz DDR48 RAM and a 2.3 GHz 8-Core Intel Core i9 processor.
The test and the data in the completed as follows:
Duration : 39m 40s
CPU hours : 1.4
Succeeded : 37
The visitor will be walked through the pre-requisites, clone the library and execute with demonstration data also used in the GitHub Actions
.
In this quick start, Dockerhub Desktop Application for the Mac with an Intel Chip
was used.
Follow the instructions there to install.
On the MacBook Pro running BigSur Version 11.4 with 16 GB RAM, It was necessary to configure the Dockerhub resources to use 6GB
of RAM.
![](https://github.com/sheynkman-lab/Long-Read-Proteogenomics/raw/main/docs/images/DockerHubDesktopResourceConfigurationMacWithIntelChip.png)
On the MacBook Pro, the 64-bit version of miniconda was downloaded and installed
follow the installation instructions.
conda create -n lrp
conda activate lrp
Install and set the Nextflow version.
conda install -c bioconda nextflow -y
export NXF_VER=20.01.0
Now with the environment ready, we can clone.
git clone https://
.com/sheynkman-lab/Long-Read-Proteogenomics
cd Long-Read-Proteogenomics
The input parameters are specified within this configuration file: test with sqanti configuration. Inspecting this file we see reduced file sizes which serve dual purpose of this vignette as well as GitHub actions. The GitHub actions permit the automatic testing of the processes containerized and used with the Nextflow workflow within this repository.
This configuration file may be run directly from the repository:
nextflow run sheynkman-lab/Long-Read-Proteogenomics --config conf/test_with_sqanti.config
or if the user has cloned this repository, it may be run from the command line
nextflow run main.nf --config conf/test_with_sqanti.config
This execution, runs all the steps, and unlike the quick start, does execute the Sqanti3
which has special additional annotation features regarding the annotation of the protein and will be discussed within this vignette.
In the Nextflow setup in the Main arguments
portion of the configuration file, the name
is for the PublishDir
output directory
name = 'jurkat_chr22'
A Nextflow process has the pattern: input, process, output -- the process is a script. Additionally, there is an option to publish the results -- where these results go is setup as a relative directory defined by the user - in our case, the results of each of the processes by is defined by the params.outdir/params.name
as well a further subdirectory. Keeping the relative path information easy for the user to follow and understand.
Using the tree
command on the MacOS, install using brew install tree
from a terminal window, need Homebrew
We see the first level down two directories, results
and pipeline_info
.
results
├── jurkat_chr22
└── pipeline_info
In our case the output directory is called results
and the name
provided in the configuration file is jurkat_chr22
.
Using the name specified in the main arguments, the output of each of the tasks/processes executed by Nextflow, produced by the scripts and published to these subdirectories using the PublishDir
command within the process definition.
results/jurkat_chr22/
├── accession_mapping
├── cpat
├── gencode_db
├── hybrid_protein_database
├── isoseq3
├── metamorpheus
├── novel_peptides
├── orf_calling
├── pacbio_6frm_gene_grouped
├── pacbio_cds
├── peptide_analysis
├── protein_classification
├── protein_filter
├── protein_gene_rename
├── protein_group_compare
├── raw_convert
├── reference_tables
├── refined_database
├── rename_cds
├── sqanti3
├── sqanti3-filtered
├── sqanti_protein
├── star
├── track_visualization
└── transcriptome_summary
The pipeline_info
directory contains Nextflow generated files that are helpful in sizing, running and determining scaling options on this workflow.
results/pipeline_info/
├── execution_report.html
├── execution_timeline.html
├── execution_trace.txt
├── pipeline_dag.svg
└── test_with_sqanti_pipeline_dag.svg
We can inspect the execution_report.html
downloading it as it was run with this configuration file previously by downloading it from Zenodo and opening it up in a browser.
wget https://zenodo.org/record/5234651/files/execution_report.html
Nextflow generates this report and inspecting it in a browser reveals many runtime details useful for review of the execution and planning for future executions. It contains tab selectable Summary, Resources and Tasks. Tasks show information about each task in the workflow. You can search for specific values, the filters include Metrics, Metadata and All. You can look at each of the tasks all together. All details are contained, including which container was used for each process, status for the run and a has to guide the user to the work directory under which the task was run, useful for debugging.
Inspecting the execution_timeline.html
, also downloading it from Zenodo:
wget https://zenodo.org/record/5234651/files/execution_timeline.html
Opening this file in the browser we can see each of the processes as they executed and how long they took. Very useful for understanding what occurs when and how much CPU time the processes took.
This workflow uses Established Tools
and Custom Tools
to bring together Transcriptomic
, Proteogenomic
and Proteomic
information to make sample-specific annotated databases, permitting analysis and visualization of sample specific results.
Before we go through the Transcriptomic
, Proteogenomic
and Proteomic
processes, we divert into providing a little background regarding our input data and Nextflow workflow channels.
The Input data
portion of the configuration file test with sqanti configuration, we have the following parameters defined:
// Input data
gencode_gtf = 'https://zenodo.org/record/5234651/files/gencode.v35.annotation.chr22.gtf'
gencode_transcript_fasta = 'https://zenodo.org/record/5234651/files/gencode.v35.pc_transcripts.chr22.fa'
gencode_translation_fasta = 'https://zenodo.org/record/5234651/files/gencode_protein.chr22.fasta'
genome_fasta = 'https://zenodo.org/record/5234651/files/GRCh38.primary_assembly.genome.chr22.fa'
hexamer = 'https://zenodo.org/record/5234651/files/Human_Hexamer.tsv'
logit_model = 'https://zenodo.org/record/5234651/files/Human_logitModel.RData.gz'
normalized_ribo_kallisto = 'https://zenodo.org/record/5234651/files/kallist_table_rdeplete_jurkat.tsv'
primers_fasta = 'https://zenodo.org/record/5234651/files/NEB_primers.fasta'
sample_ccs = 'https://zenodo.org/record/5234651/files/jurkat.codethon_toy.ccs.bam'
sample_kallisto_tpm = 'https://zenodo.org/record/5234651/files/jurkat_gene_kallisto.tsv'
star_genome_dir = 'https://zenodo.org/record/5234651/files/star_genome.tar.gz'
uniprot_protein_fasta = 'https://zenodo.org/record/5234651/files/uniprot_protein.chr22.fasta'
The gencode_gtf
, gencode_transcript_fasta
, gencode_translation_fasta
and genome_fasta
have all been reduced to just chromosome 22
of the human genome. Users can use the version of Gencode and compatible assembly of their genome of interest. Further details of the input parameters are found here.
These data then are mapped in the course of the execution of the Nextflow workflow to channels
. Channels are used by Nextflow to enable parallel processing.
There are two types of Nextflow channels:
A queue channel
cannot be used more than one time as a process input or as a process output.
gencode_gtf
gets mapped to the [value channel
](https://www.nextflow.io/docs/latest/channel.html#value-channel ch_gencode_gtf
.
ch_gencode_gtf = Channel.value(file(params.gencode_gtf))
gencode_transcript_fasta
is mapped to either the channel ch_gencode_transcript_fasta
or the `ch_gencode_transcript_fasta_uncompressed channel depending upon whether or not it is compressed.
if (params.gencode_transcript_fasta.endsWith('.gz')){
ch_gencode_transcript_fasta = Channel.value(file(params.gencode_transcript_fasta))
} else {
ch_gencode_transcript_fasta_uncompressed = Channel.value(file(params.gencode_transcript_fasta))
}
This permits the handling a compressed (filename ends with gz
) or not. Regardless, note that gencode_transcript_fasta
gets mapped to a value channel
can be read unlimited times without consuming its content.
There are several processes that offer file decompression, you may read about them in the Utility Processes
section. Now we begin with Transcriptomic Processes
description, then follow with Proteogenomic Processes
, Protein Processes
and finish with Utility Processes
.
Beginning with Circular Consensus Reads
, the IsoSeq3 process is executed to produced polished transcripts which are input to Sqanti3 for Isoform and Junction Annotations and Isoform Structures, CPAT which produces the beginning input required for the Proteogenomic processes of ORF Predictions, ORF calls and Database annotations, and the 6-Frame translation databases.
IsoSeq3 process
uses as input ch_sample_ccs
, ch_genome_fasta_isoseq
and ch_primers_fasta
files and produces in the results/jurkat_chr22/isoseq3
directory.
results/jurkat_chr22/isoseq3/
├── jurkat_chr22.collapsed.abundance.txt
├── jurkat_chr22.collapsed.fasta
├── jurkat_chr22.collapsed.gff
├── jurkat_chr22.collapsed.report.json
├── jurkat_chr22.demult.lima.summary
├── jurkat_chr22.flnc.bam
├── jurkat_chr22.flnc.bam.pbi
└── jurkat_chr22.flnc.filter_summary.json
Only two outputs are mapped to queue channels for subsequent processing by other processes:
-
ch_isoseq_gtf
mapped fromjurkat_chr22.collapsed_gff
and -
ch_fl_count
mapped fromjurkat_ch22.collapsed_abundance.txt
.
sqanti3
: sqanti3 classifies IsoSeq3 isoforms and corrects isoforms using the star junction details relative to the reference genome if short read fastq files are provided.
results/jurkat_chr22/sqanti3
├── jurkat_chr22.params.txt
├── jurkat_chr22_classification.txt
├── jurkat_chr22_corrected.fasta
├── jurkat_chr22_corrected.gtf
├── jurkat_chr22_junctions.txt
└── jurkat_chr22_sqanti_report.pdf
Outputs mapped to queue channels for subsequent processing by other processes are:
-
ch_sample_unfiltered_classification
fromjurkat_chr22_classification.txt
, -
ch_sample_unfiltered_fasta
fromjurkat_chr22_corrected.fasta
-
ch_sample_unfiltered_gtf
fromjurkat_chr22_corrected_gtf
CPAT
is a bioinformatics tool to predict an RNA’s coding probability based on the RNA sequence characteristics. To achieve this goal, CPAT calculates scores of sequence-based features from a set of known protein-coding genes and background set of non-coding genes.
- ORF size
- ORF coverage
- Fickett score
- Hexamer usage bias
CPAT builds a logistic regression model using these 4 features as predictor variables and the “protein-coding status” as the response variable. After evaluating the performance and determining the probability cutoff, the model can be used to predict new RNA sequences.
Output produced is as follows:
results/jurkat_chr22/cpat/
├── CPAT_run_info.log
├── jurkat_chr22.ORF_prob.best.tsv
├── jurkat_chr22.ORF_prob.tsv
├── jurkat_chr22.ORF_seqs.fa
├── jurkat_chr22.no_ORF.txt
├── jurkat_chr22.r
├── jurkat_chr22_cpat.error
└── jurkat_chr22_cpat.output
Three of these are passed to downstream processes
-
ch_cpat_all_orfs
comes fromjurkat_chr22.ORF_prob.tsv
-
ch_cpat_best_orf
comes fromjurkat_chr22.ORF_prob.best.tsv
-
ch_cpat_protein_fasta
comes fromjurkat_chr22.ORF_seqs.fa
transcriptome_summary
compares the abundance (CPM) based upon long-read sequencing to the abundances (TPM) inferred from short-read sequencing as computed by Kallisto (analyzed outside of this pipeline), additionally produces a pacbio_gene reference table.
results/jurkat_chr22/transcriptome_summary/
├── gene_level_tab.tsv
├── pb_gene.tsv
└── sqanti_isoform_info.tsv
The three outputs are passed to downstream processes as follows:
-
ch_gene_level
comes fromgene_level_tab.tsv
-
ch_sqanti_isoform_info
comes fromsqanti_isoform_info.tsv
-
ch_pb_gene
comes frompb_gene.tsv
six_frame_translation
generates a fasta file of all possible protein sequences derivable from each PacBio transcript translating the fasta file in all six frames(3+, 3-). This is used to examine what peptides could theoretically match the peptides found via a mass spectrometry search against GENCODE.
Output is just one file
results/jurkat_chr22/pacbio_6frm_gene_grouped/
└── jurkat_chr22.6frame.fasta
this file is mapped to the following queue channel
-
ch_six_frame
fromjurkat_chr22.6frame.fasta
orf_calling
Selects the most plausible ORF from each PacBio flnc transcript, using the following information:
-
comparison of ATG start in transcript to reference (GENCODE) - selecitng the ORF with ATG start matching the one in the reference if it exists.
-
coding probability score from CPAT
-
number of upstream ATGs for the condidate ORF - decreasing the score as the number of upstream ATGs increases using sigmoid function.
Additionally provides calling confidence of each ORF called
-
Clear Best ORF : best score and fewest upstream ATGs for all called ORFs
-
Plausible ORF : not clearly the best, but with a decent CPAT coding score (> 0.364)
-
Low quality ORF : low CPAT coding score (< 0.364)
The output is a single file:
results/jurkat_chr22/orf_calling/
└── jurkat_chr22_best_orf.tsv
-
ch_best_orf
comes fromjurkat_chr22_best_orf.tsv
In this case, this queue channel is split to be consumed by 4
downstream processes
ch_best_orf.into{
ch_best_orf_refine
ch_best_orf_cds
ch_best_orf_sqanti_protein
ch_best_orf_pclass
}
-
filters ORF database to include only accessions with a CPAT coding score above a threshold (default is 0.0),
-
filters ORFS to only include ORFs that have a stop codon
-
Collapses transcripts that produce the same protein into one entry, keeping a base accession (first alphanumeric).
-
Abundances of transcripts (CPM) are also collapsed during this process.
Output is:
results/jurkat_chr22/refined_database/
├── jurkat_chr22_orf_refined.fasta
└── jurkat_chr22_orf_refined.tsv
These output are mapped to the following queue channels:
-
ch_refined_info
is mapped fromjurkat_chr22_orf_refined.fasta
-
ch_sample_fasta_refine
This routine performs the following functions:
-
Preprocessing step to SQANTI Protein
-
CDS is renamed to exon and transcript stop and start
-
locations are updated to reflect CDS start and stop
The output produced is:
results/jurkat_chr22/rename_cds/
├── gencode.cds_renamed_exon.gtf
├── gencode.transcript_exons_only.gtf
├── jurkat_chr22.cds_renamed_exon.gtf
└── jurkat_chr22.transcript_exons_only.gtf
These output files are mapped to the following queue channels:
-
ch_sample_cds_renamed
is mapped fromjurkat_chr22.cds_renamed_exon.gtf
-
ch_sample_transcript_exon_only
is mapped fromjurkat_chr22.transcript_exons_only.gtf
-
ch_ref_cds_renamed
is mapped fromgencode.cds_renamed_exon.gtf
-
ch_ref_transcript_exon_only
is mapped fromgencode.transcript_exons_only.gtf
This routine sqanti_protein
classifies protein splice sites and calculates additional statistics for the start and stop of an ORF.
The output produced is captured in a single file:
results/jurkat_chr22/sqanti_protein/
└── jurkat_chr22.sqanti_protein_classification.tsv
This file is mapped to the following queue channel
-
ch_sqanti_protein_classification
is mapped fromjurkat_chr22.sqanti_protein_classification.tsv
The routine five_prime_utr
determines the 5' UTR status and is an intermediate step for protein classification. The 5' UTR determination status is necessary for the latter step of protein category classification.
The output produced is captured in a single file:
results/jurkat_chr22/5p_utr/
└── jurkat_chr22.sqanti_protein_classification_w_5utr_info.tsv
This file is mapped to the following queue channel
-
ch_sqanti_protein_classification_w_5utr
is mapped fromjurkat_chr22.sqanti_protein_classification_w_5utr_info.tsv
This routine classifies protein based upon splicing and start site. It is really one of the most important routines and many of the previous routines feed into this process.
protein_classification
has the following main classifications:
-
pFSM
: full-protein-match. This means that the protein fully matches a gencode protein -
pISM
: incomplete-protein-match. This protein is partially matching the gencode protein and is considered an N- or C- terminus truncation artifact. -
pNIC
: novel-in-catalog. This means that this protein is comprised of known N- terminus, known exons, and/or C-terminus in a new combination creating resulting from a not-yet-been-seen-before splicing event. -
pNNC
: novel-not-in-catalog. This is a protein composed of a novel N-terminus, novel exon, and/or novel C-terminus and therefore has not contains not a recombination of known entity made from a novel splicing event, but contains novel ends of the exons in such a way as not been seen before therefore not found in the gencode database before.
These are the results as derived from the transcriptomic portion of the experimental data. As measured by long read RNA sequencing. Now we enter the proteomic portion of this proteogenomic pipeline where these events are or are not confirmed. It is important to note that these events are conformational because they are derived from separate experiments, separate library constructions.
The output of this routine is found in the sqanti_protein
subdirectory of the results/jurkat_chr22
:
results/jurkat_chr22/sqanti_protein/
└── jurkat_chr22.sqanti_protein_classification.tsv
└── jurkat_chr22.sqanti_genes.tsv
These files are then mapped to the following Nextflow queue channels.
-
ch_protein_classification_unfiltered
is mapped from thejurkat_chr22.sqanti_protein_classification.tsv
file -
ch_pr_genes
is mapped from thejurkat_chr22.genes.tsv
file
This step uses the output of sqanti_protein and filters out proteins that are not classified as pFSM, pNIC or pNNC. Removing then pISMs (either N-terminus or C-terminus truncations). And removing pNNC with junctions after the stop codon (default 2).
The output is published to results/jurkat_chr22/protein_filter
subdirectory:
results/jurkat_chr22/protein_filter/
├── jurkat_chr22.classification_filtered.tsv
├── jurkat_chr22.filtered_protein.fasta
└── jurkat_chr22_with_cds_filtered.gtf
These files are then mapped to the following Nextflow queue channels:
-
ch_filtered_protein_classification
is mapped fromjurkat_chr22.classification_filtered.tsv
-
ch_filtered_protein_fasta
is mapped fromjurkat_chr22.filtered_protein.fasta
-
ch_filtered_cds
is mapped fromjurkat_chr22_with_cds_filtered.gtf
This process makes a hybrid database that is composed of high-confidence PacBio proteins and Gencode proteins. High-confidence is defined as genes in which the PacBio sampling is adequate (average transcript length 1-4 kb and a total of 3 counts per million (CPM) - derived from Kallisto - per gene. The CPMs derived from the Kallisto program was done outside of this Nextflow workflow and is run using short read RNA-sequencing.
The output from this process is published to results/jurkat_chr22/hybrid_protein_database
results/jurkat_chr22/hybrid_protein_database/
├── jurkat_chr22_cds_high_confidence.gtf
├── jurkat_chr22_high_confidence_genes.tsv
├── jurkat_chr22_hybrid.fasta
└── jurkat_chr22_refined_high_confidence.tsv
The following Nextflow queue channels are used in downstream processes:
-
ch_high_confidence_cds
is mapped from thejurkat_chr22_cds_high_confidence.gtf
file -
ch_sample_hybrid_fasta
is mapped from thejurkat_chr22_hybrid.fasta
file -
ch_refined_into_high_conf
is mapped from thejurkat_chr22_refined_high_confidence.tsv
file.
These data will be used in visualization of the results.
Mass spec results were obtained from instrumentation that measures properly prepared aliquot from the same sample that the long read Proteomics library was constructed. Often times the mass spec data needs to be converted into a format that can be used by the main proteomics tool used in this workflow, Metamorpheus. This is done by the following process:
In our toy
example, we used only one mass spec file so the output produced is a single file, if we had more mass spec files, there would be an output file for each of the provided raw files.
results/jurkat_chr22/raw_convert/
└── 120426_Jurkat_highLC_Frac28.mzML
The Nextflow queue channels mapped from this output (which can flex to be for as many files that match the provided pattern as possible) is as follows:
-
ch_mass_spec_converted
is mapped from any output file in the directory as indicated by thefile(*)
pattern.
It is important to note that the same is true regarding the input.
on the input, the pattern sought is for any file within the the directory. This is honestly a utility function, we could have described it in the utility processes section. Now we describe the heart of the Proteomic Processes
section, where we see three different databases used, GENCODE, Uniprot and a PacBio Sample Specific refined Database. In the paper, and the user, can then compare the results.
This routine runs the program Metamorpheus, which performs a mass spectrum search using a database derived from GENCODE data.
metamorpheus_with_gencode_database
The results are published to the results/jurkat_chr22/metamorpheus/gencode
subdirectory
results/jurkat_chr22/metamorpheus/gencode/
├── search_results
│ ├── Task\ Settings
│ │ └── Task1SearchTaskconfig.toml
│ ├── Task1SearchTask
│ │ ├── 120426_Jurkat_highLC_Frac28.mzID
│ │ ├── AllPSMs.psmtsv
│ │ ├── AllPSMs_FormattedForPercolator.tab
│ │ ├── AllPeptides.Gencode.psmtsv
│ │ ├── AllQuantifiedPeaks.tsv
│ │ ├── AllQuantifiedPeptides.tsv
│ │ ├── AllQuantifiedProteinGroups.Gencode.tsv
│ │ ├── model.zip
│ │ ├── prose.txt
│ │ └── results.txt
│ └── allResults.txt
└── toml
├── CalibrationTask.toml
├── GlycoSearchTask.toml
├── GptmdTask.toml
├── SearchTask.toml
└── XLSearchTask.toml
The Nextflow queue channels mapped from the following files are used in subsequent processes:
-
ch_gencode_peptides
is mapped from the filesearch_results/Task1SearchTask/AllPeptides.Gencode.psmtsv
-
ch_gencode_protein_group
is mapped from the filesearch_results/Task1SearchTask/AllQuantifiedProteinGroups.Gencode.tsv
The Metamorpheus Uniprot process runs Metamorpheus mass spectrum search using a database derived from UniProt data.
metamorpheus_with_uniprot_database
This routine publishes its results to results/jurkat_chr22/metamorpheus/uniprot
subdirectory
results/jurkat_chr22/metamorpheus/uniprot/
├── search_results
│ ├── Task\ Settings
│ │ └── Task1SearchTaskconfig.toml
│ ├── Task1SearchTask
│ │ └── AllPeptides.UniProt.psmtsv
│ └── allResults.txt
└── toml
├── CalibrationTask.toml
├── GlycoSearchTask.toml
├── GptmdTask.toml
├── SearchTask.toml
└── XLSearchTask.toml
The Nextflow queue channels mapped from the following files are used in subsequent processes:
-
ch_uniprot_peptides
is mapped from the filesearch_results/Task1SearchTask/AllPeptides.UniProt.psmtsv
-
ch_uniprot_protein_group
is mapped from the filesearch_results/Task1SearchTask/AllQuantifiedProteinGroups.UniProt.tsv
This routine runs MetaMorpheus mass spectrum search with a PacBio refined database
metamorpheus_with_sample_specific_database_refined
This routine publishes its results to the results/jurkat_chr22/metamorpheus/pacbio/refined
subdirectory
results/jurkat_chr22/metamorpheus/pacbio/refined/
├── search_results
│ ├── Task\ Settings
│ │ └── Task1SearchTaskconfig.toml
│ ├── Task1SearchTask
│ │ ├── 120426_Jurkat_highLC_Frac28.mzID
│ │ ├── AllPSMs.psmtsv
│ │ ├── AllPSMs_FormattedForPercolator.tab
│ │ ├── AllPeptides.jurkat_chr22.refined.psmtsv
│ │ ├── AllQuantifiedPeaks.tsv
│ │ ├── AllQuantifiedPeptides.tsv
│ │ ├── AllQuantifiedProteinGroups.jurkat_chr22.refined.tsv
│ │ ├── model.zip
│ │ ├── prose.txt
│ │ └── results.txt
│ └── allResults.txt
└── toml
├── CalibrationTask.toml
├── GlycoSearchTask.toml
├── GptmdTask.toml
├── SearchTask.toml
└── XLSearchTask.toml
Two Nextflow queue channels are defined and mapped from the following two files:
-
ch_pacbio_peptides_refined
is mapped fromsearch_results/Task1SearchTask/AllPeptides.jurkat_chr22.refined.psmtsv
-
ch_pacbio_protein_groups_refined
is mapped from"search_results/Task1SearchTask/AllQuantifiedProteinGroups.jurkat_chr22.refined.tsv
This runs the MetaMorpheus mass spectrum search using the PacBio filtered database.
metamorpheus_with_sample_specific_database_filtered
publishes its results to results/jurkat_chr22/metamorpheus/pacbio/filtered
results/jurkat_chr22/metamorpheus/pacbio/filtered/
├── search_results
│ ├── Task\ Settings
│ │ └── Task1SearchTaskconfig.toml
│ ├── Task1SearchTask
│ │ └── AllPeptides.jurkat_chr22.filtered.psmtsv
│ └── allResults.txt
└── toml
├── CalibrationTask.toml
├── GlycoSearchTask.toml
├── GptmdTask.toml
├── SearchTask.toml
└── XLSearchTask.toml
The following Nextflow queue channels are mapped from the published output files as follows:
-
ch_pacbio_peptides_filtered
is mapped from the filesearch_results/Task1SearchTask/AllPeptides.jurkat_chr22.filtered.psmtsv
-
ch_pacbio_protein_groups_filtered
is mapped from the filesearch_results/Task1SearchTask/AllQuantifiedProteinGroups.jurkat_chr22.filtered.tsv
This run of MetaMorpheus mass spectrum search is against the hybrid database constructed for the PacBio transcripts.
metamorpheus_with_sample_specific_database_hybrid
.
The results from this run are published in the results/jurkat_chr22/metamorpheus/pacbio/hybrid
subdirectory
results/jurkat_chr22/metamorpheus/pacbio/hybrid/
├── search_results
│ ├── Task\ Settings
│ │ └── Task1SearchTaskconfig.toml
│ ├── Task1SearchTask
│ │ ├── 120426_Jurkat_highLC_Frac28.mzID
│ │ ├── AllPSMs.psmtsv
│ │ ├── AllPSMs_FormattedForPercolator.tab
│ │ ├── AllPeptides.jurkat_chr22.hybrid.psmtsv
│ │ ├── AllQuantifiedPeaks.tsv
│ │ ├── AllQuantifiedPeptides.tsv
│ │ ├── AllQuantifiedProteinGroups.jurkat_chr22.hybrid.tsv
│ │ ├── model.zip
│ │ ├── prose.txt
│ │ └── results.txt
│ └── allResults.txt
└── toml
├── CalibrationTask.toml
├── GlycoSearchTask.toml
├── GptmdTask.toml
├── SearchTask.toml
└── XLSearchTask.toml
The following Nextflow queue channels are derived from the indicated published files as stated here:
-
ch_pacbio_peptides_hybrid
is mapped fromsearch_results/Task1SearchTask/AllPeptides.jurkat_chr22.hybrid.psmtsv
-
ch_pacbio_protein_groups_hybrid
is mapped fromsearch_results/Task1SearchTask/AllQuantifiedProteinGroups.jurkat_chr22.hybrid.tsv
This runs the MetaMorpheus mass spectrum search using the hybrid database and the version of MetaMorpheus that contains the "Rescue and Resolve" algorithm. The "Rescue and Resolve" algorithm allows for the "Rescue" of protein isoforms that are normally discarded during protein parsimony, given that the protein isoform has strong evidence from transcriptional expression.
metamorpheus with sample_specific_database_rescue_resolve
.
The output is published in the results/jurkat_chr22/metamorpheus/pacbio/rescue_resolve
subdirectory.
results/jurkat_chr22/metamorpheus/pacbio/rescue_resolve/
├── search_results
│ ├── Task\ Settings
│ │ └── Task1SearchTaskconfig.toml
│ ├── Task1SearchTask
│ │ ├── 120426_Jurkat_highLC_Frac28.mzID
│ │ ├── AllPSMs.psmtsv
│ │ ├── AllPSMs_FormattedForPercolator.tab
│ │ ├── AllPeptides.jurkat_chr22.rescue_resolve.psmtsv
│ │ ├── AllQuantifiedPeaks.tsv
│ │ ├── AllQuantifiedPeptides.tsv
│ │ ├── AllQuantifiedProteinGroups.jurkat_chr22.rescue_resolve.tsv
│ │ ├── model.zip
│ │ ├── prose.txt
│ │ └── results.txt
│ └── allResults.txt
└── toml
├── CalibrationTask.toml
├── GlycoSearchTask.toml
├── GptmdTask.toml
├── SearchTask.toml
└── XLSearchTask.toml
There are no Nextflow Queue channels defined by the output files. We have ended the journey and have only visualization and peptide analysis remaining.
gunzip_gencode_transcript_fasta
: producing ch_gencode_transcript_fasta_uncompressed
.
gunzip_logit_model
: produces ch_logit_model_uncompressed
untar_mass_spec
: produces two optional sets of file types [*.raw
] and [*.mzML| *.mzml
] mapped to channels ch_mass_spec_raw and
ch_mass_spec_mzml` respectively.
gunzip_translation_fasta
: produces uncompressed files mapped to the ch_gencode_translation_fasta_uncompressed
channel.
gunzip_genome_fasta
: produces the uncompressed files mapped to the ch_genome_fasta_uncompressed
channel.
gunzip_uniprot_protein_fasta
: produces the uncompressed files mapped to the ch_uniprot_protein_fasta_uncompressed
channel.
generate_reference_tables
: The ch_gencode_transcript_fasta_uncompressed and ch_gencode_gtf channels are used together only in this process. This setup process generates needed tabular data sets to be used in downstream processing including ensg_gene.tsv, enst_isoname.tsv, gene_ensp.tsv, gene_isoname.tsv, isoname_lens.tsv, gene_lens.tsv and protein_coding_genes.txt from gencode_gtf and gencode_transcript_fasta.
This process creates several queue channels
. These are split as follows:
-
ch_protein_coding_genes
is split into [ch_protein_coding_genes_gencode_fasta
] to be consumed by processmake_gencode_database
and [ch_protein_coding_genes_filter_sqanti
] to be consumed by processfilter_sqanti
. -
ch_ensg_gene
is split into [ch_ensg_gene_filter
] to be consumed by processfilter_sqanti
, [ch_ensg_gene_six_frame
] to be consumed by processsix_frame_translation
and [ch_ensg_gene_pclass
] to be consumed by processprotein_classification
. -
ch_gene_lens
is split into [ch_gene_lens_transcriptome
] to be consumed by [...
], [ch_gene_lens_aggregate
] to be consumed by [...
]. -
ch_gene_isoname
is split into [ch_gene_isoname_pep_viz
] to be consumed by process [...
] and [ch_gene_isoname_pep_analysis
] to be consumed by process [...
].
Using the output of the prepare_reference_tables channel, ch_protein_coding_genes_gencode_fasta
and the gencode_translation_fasta_uncompressed
are input to the
make_gencode_database
: The process clusters same-protein GENCODE entries and produces a non-redundant gencode_protein.fasta file and gencode_isoname_clusters containing transcript accessions that were clustered. These are represented in one file that goes
process_generate_star_genome
: STAR Alignment run only if sqanti has not been previously been run and if fastq (short read RNAseq) files have been provided. STAR alignment is run if fastq reads are provided which results in sample junction alignments that are fed to SQANTI3 for classification filtering
gencode_track_visualization
: Generates the tracks that can be used in the UCSC genome browser for the gencode database.
peptide_track_visualization
: Makes peptide tracks use in the UCSC Genome Browser for refined, filtered and hybrid databases
To visualize results, please see the visualization capabilities of the final annotated results
.
There are 37
separate processes that are executed in this configuration file. Some specific to the running of this test vignette, such as the untarring of the compressed star_genome directory. During the course of execution, Nextflow generates a diagram of the connectivity between channels and their processes as shown here.
Though there are 37
processes defined in the Nextflow
main.nf
there are only 25
that produce output to this results directory.
This concludes the overview of running the Long Read Proteogenomics Workflow with Test Data
vignette. If you have any questions, comments, etc. Please write us, make an issue and we will endeavour to answer as soon as possible.
Sheynkman-Lab