Skip to content

naobservatory/mgs-workflow

Repository files navigation

Nucleic Acid Observatory Viral Metagenomics Pipeline

This Nextflow pipeline is designed to process metagenomic sequencing data, characterize overall taxonomic composition, and identify and quantify reads mapping to viruses infecting certain host taxa of interest. It was developed as part of the Nucleic Acid Observatory project.

The pipeline currently consists of three workflows:

  • INDEX: Creates indices and reference files used by the RUN and RUN_VALIDATION workflows1.
  • RUN: Performs the main analysis, including QC, viral identification, taxonomic profiling, and optional BLAST validation.
  • RUN_VALIDATION: Performs part of the run workflow dedicated to validation of taxonomic classification with BLAST2.
  • DOWNSTREAM: Performs downstream analysis of the results from the run workflow, currently limited to marking duplicate reads3.

Documentation

Footnotes

  1. The INDEX workflow is intended to be run first, after which many instantiations of the RUN workflow can use the same index output files.

  2. The RUN_VALIDATION workflow is intended to be run after the RUN workflow if the optional BLAST validation was not selected during the RUN workflow. Typically, this workflow is run on a subset of the host viral reads identified in the RUN workflow, to evaluate the sensitivity and specificity of the viral identification process.

  3. The DOWNSTREAM workflow is designed to handle tasks that require cross-read comparisons, including potentially across multiple runs.