Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best practice for workflow annotation field length #19735

Open
mvdbeek opened this issue Mar 3, 2025 · 0 comments
Open

Best practice for workflow annotation field length #19735

mvdbeek opened this issue Mar 3, 2025 · 0 comments
Assignees

Comments

@mvdbeek
Copy link
Member

mvdbeek commented Mar 3, 2025

          I explored for a few workflows what that would mean. I think 250 might be a little strict ?

This is the 250 char limit

⏺ Here are the proposed changes for all the workflow files (.ga files):

  1. BREW3R.ga

  Current:
  - Name: BREW3R
  - Annotation: This workflow takes a collection of BAM (output of STAR) and a gtf. It extends the input gtf using de novo annotation.

  Proposed Change:
  - Name: BREW3R - De Novo Transcriptome Annotation
  - Annotation: Extends 3' ends of gene annotations using BAM files (from STAR alignments) and a reference GTF. Specifically designed for 3'-biased
  sequencing techniques like 10X scRNA-seq or BRB-seq that primarily capture transcript 3' ends.

  2. goseq-go-kegg-enrichment-analsis.ga

  Current:
  - Name: Goseq GO-KEGG Enrichment Analysis
  - Annotation: This workflow is used for GO and KEGG enrichment analysis using GOseq tools.

  Proposed Change:
  - Name: Gene Ontology and KEGG Pathway Enrichment Analysis
  - Annotation: Performs functional enrichment analysis of gene sets using GOseq, identifying over-represented Gene Ontology terms and KEGG
  pathways. Helps interpret biological significance of differentially expressed genes.

  3. rnaseq-de-filtering-plotting.ga

  Current:
  - Name: RNAseq_DE_filtering_plotting
  - Annotation: This workflow can only work on an experimental setup with exactly 2 conditions. It takes two collections of count tables as input
  and performs differential expression analysis. Additionally it filters for DE genes based on adjusted p-value and log2 fold changes thresholds. It
   also generates informative plots.

  Proposed Change:
  - Name: RNA-Seq Differential Expression Analysis with Visualization
  - Annotation: Identifies differentially expressed genes between two conditions from count tables, applying statistical filters (adjusted p-value
  and log2 fold change thresholds) and generating publication-quality visualizations.

  4. rnaseq-pe.ga

  Current:
  - Name: RNA-seq for Paired-end fastqs
  - Annotation: This workflow takes as input a list of paired-end fastqs. Adapters and bad quality bases are removed with fastp. Reads are mapped
  with STAR with ENCODE parameters and genes are counted simultaneously as well as normalized coverage (per million mapped reads) on uniquely mapped
   reads. The counts are reprocessed to be similar to HTSeq-count output. Alternatively, featureCounts can be used to count the reads/fragments per
  gene. FPKM are computed with cufflinks and/or with StringTie. The unstranded normalized coverage is computed with bedtools.

  Proposed Change:
  - Name: RNA-Seq Analysis: Paired-End Read Processing and Quantification
  - Annotation: Complete RNA-Seq analysis for paired-end data: quality control with fastp, alignment with STAR, quantification via multiple methods
  (STAR, featureCounts), and expression calculation (FPKM with Cufflinks/StringTie, normalized coverage with bedtools).

  5. rnaseq-sr.ga

  Current:
  - Name: RNA-seq for Single-read fastqs
  - Annotation: This workflow takes as input a list of single-end fastqs. Adapters and bad quality bases are removed with fastp. Reads are mapped
  with STAR with ENCODE parameters and genes are counted simultaneously as well as normalized coverage (per million mapped reads) on uniquely mapped
   reads. The counts are reprocessed to be similar to HTSeq-count output. Alternatively, featureCounts can be used to count the reads/fragments per
  gene. FPKM are computed with cufflinks and/or with StringTie. The unstranded normalized coverage is computed with bedtools.

  Proposed Change:
  - Name: RNA-Seq Analysis: Single-End Read Processing and Quantification
  - Annotation: Complete RNA-Seq analysis for single-end data: quality control with fastp, alignment with STAR, quantification via multiple methods
  (STAR, featureCounts), and expression calculation (FPKM with Cufflinks/StringTie, normalized coverage with bedtools).

  These changes will make the workflow names and annotations more descriptive, consistent, and user-friendly while staying within the 250-character
  limit for annotations.

This is the 500 char limit:

⏺ With a 500 character limit, I can provide more detailed annotations for each workflow. Here are my suggested improvements to make the most of the
  500-character limit:

  1. BREW3R

  Enhanced Annotation (500 char limit):
  "Extends 3' ends of gene annotations using BAM files (from STAR alignments) and a reference GTF. Specifically designed for 3'-biased sequencing
  techniques like 10X scRNA-seq or BRB-seq that primarily capture transcript 3' ends. The BREW3R tool enhances annotations by using evidence from
  RNA-seq data to improve 3' UTR definitions, which is particularly important for accurate quantification in single-cell and bulk RNA-seq
  experiments."

  2. Gene Ontology and KEGG Pathway Enrichment Analysis

  Enhanced Annotation (500 char limit):
  "Performs functional enrichment analysis of gene sets using GOseq, identifying over-represented Gene Ontology terms and KEGG pathways. This
  workflow accounts for gene length bias in RNA-seq data when calculating enrichment statistics. Input requires differentially expressed genes and
  gene lengths. Generates comprehensive tables and visualizations of enriched GO terms across all three ontologies (Biological Process, Molecular
  Function, Cellular Component) as well as KEGG pathway enrichment results."

  3. RNA-Seq Differential Expression Analysis with Visualization

  Enhanced Annotation (500 char limit):
  "Identifies differentially expressed genes between exactly two experimental conditions from count tables. The workflow performs statistical
  testing, applies filters based on adjusted p-value and log2 fold change thresholds, and generates publication-quality visualizations including
  volcano plots, MA plots, and heatmaps. Takes two collections of count tables as input and produces filtered gene lists and interactive plots for
  interpreting expression differences. Optimal for simple two-condition experimental designs."

  4. RNA-Seq Analysis: Paired-End Read Processing and Quantification

  Enhanced Annotation (500 char limit):
  "Complete RNA-Seq analysis for paired-end data: quality control with fastp to remove adapters and low-quality bases, alignment with STAR using
  ENCODE parameters, gene quantification via multiple methods (STAR and featureCounts), and expression calculation (FPKM with Cufflinks/StringTie,
  normalized coverage with bedtools). Processes raw FASTQ files to produce count tables, normalized expression values, and genomic coverage tracks.
  Supports stranded and unstranded libraries, generating both HTSeq-compatible counts and normalized measures for downstream analysis."

  5. RNA-Seq Analysis: Single-End Read Processing and Quantification

  Enhanced Annotation (500 char limit):
  "Complete RNA-Seq analysis for single-end data: quality control with fastp to remove adapters and low-quality bases, alignment with STAR using
  ENCODE parameters, gene quantification via multiple methods (STAR and featureCounts), and expression calculation (FPKM with Cufflinks/StringTie,
  normalized coverage with bedtools). Processes raw FASTQ files to produce count tables, normalized expression values, and genomic coverage tracks.
  Supports stranded and unstranded libraries, generating both HTSeq-compatible counts and normalized measures for downstream analysis."

  These enhanced annotations maintain all the essential information from the shorter versions while adding valuable context about inputs, outputs,
  and specific features that would help users understand the workflows better.

The 250 char limit leads to contracting some information to the point that it isn't entirely correct anymore (see fastp, where adapter trimming isn't exactly "qc"). The 500 char version seems much better to me and what we should show in the workflow cards.

Originally posted by @mvdbeek in #19591 (comment)

@mvdbeek mvdbeek changed the title Best practice for workflow annotation field Best practice for workflow annotation field length Mar 3, 2025
@mvdbeek mvdbeek self-assigned this Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant