diff --git a/CITATIONS/index.html b/CITATIONS/index.html index 1e62130a..e6ab813f 100644 --- a/CITATIONS/index.html +++ b/CITATIONS/index.html @@ -193,19 +193,48 @@
Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441]
+++NĂ©ron, Bertrand, Eloi Littner, Matthieu Haudiquet, Amandine Perrin, Jean Cury, and Eduardo P.C. Rocha. 2022. IntegronFinder 2.0: Identification and Analysis of Integrons across Bacteria, with a Focus on Antibiotic Resistance in Klebsiella Microorganisms 10, no. 4: 700. https://doi.org/10.3390/microorganisms10040700
+
Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4
++Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732
+
+Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808.
++Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142
+
++Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. "Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014.
+
Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690.
diff --git a/index.html b/index.html index f8f045a1..203c9a55 100644 --- a/index.html +++ b/index.html @@ -345,5 +345,5 @@Citing ARETE
diff --git a/search/search_index.json b/search/search_index.json index 403b0ecd..959d4da8 100644 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"What is ARETE? ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE is to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, coevolution, and phylogenetic tree comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be fed into packages such as Coeus and MicroReact. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some groups than others! The tools in ARETE work best at the species and genus level of relatedness. A key design principle of ARETE is finding the right choice of software packages and parameter settings to support datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets ranging in size from fewer than ten to over 10,000 genomes from a multitude of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key principle is letting the user choose which subsets of the pipeline they wish to run; you may already have assembled genomes, or you may not care about, say, recombination detection. There are also cases where it is useful to manually review the outputs from a particular step before moving on to the next one. ARETE makes this easy to do. Table of Contents About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE About the pipeline The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow orgnaization provides suitable entry and exit points for users who want to run only a portion of the full pipeline. Genome subsetting: The user can optionally subdivide their set of genomes into lineages as defined by PopPUNK ( See documentation ). PopPUNK quickly subdivides a set of genomes into 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced. Short-read processing and assembly: Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm ) Annotation: Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond ) Phylogenomics: ( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites ) Recombination detection: Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins ) Coevolution: Identification of coordinated gain and loss of features using EvolCCM (to add) Lateral gene transfer: Phylogenetic inference of LGT using rSPR (to add) Gene order: Comparison of genomic neighbourhoods using the Gene Order Workflow (to add) See our roadmap for a full list of future development targets. Quick Start Install nextflow Install Docker , Singularity , or, as a last resort, Conda . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. We have minimized reliance on conda and suggest using it only as a last resort (see docs ). Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test,-stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all params currently implemented in the pipeline and which ones are required. Testing To test the worklow on a minimal dataset you can use the test configuration (with either docker, conda, or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker Due to download speed of the Kraken2, Bakta and CAZY databases this will take ~35 minutes. However to accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. As well as set --bakta_db to the directory containing the Bakta database. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light Examples The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples: Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers. Annotation to evolutionary dynamics on 300-ish genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ --use_ppanggolin \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers. Annotation to evolutionary dynamics on 10,000 genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --use_ppanggolin \\ --run_recombination \\ --enable_subsetting \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers. Credits The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada. Contributing to ARETE Thank you for your interest in contributing to ARETE. We are currently in the process of formalizing contribution guidelines. In the meantime, please feel free to open an issue describing your suggested changes. Citing ARETE Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Home"},{"location":"#what-is-arete","text":"ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE is to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, coevolution, and phylogenetic tree comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be fed into packages such as Coeus and MicroReact. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some groups than others! The tools in ARETE work best at the species and genus level of relatedness. A key design principle of ARETE is finding the right choice of software packages and parameter settings to support datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets ranging in size from fewer than ten to over 10,000 genomes from a multitude of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key principle is letting the user choose which subsets of the pipeline they wish to run; you may already have assembled genomes, or you may not care about, say, recombination detection. There are also cases where it is useful to manually review the outputs from a particular step before moving on to the next one. ARETE makes this easy to do.","title":"What is ARETE?"},{"location":"#table-of-contents","text":"About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE","title":"Table of Contents"},{"location":"#about-the-pipeline","text":"The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow orgnaization provides suitable entry and exit points for users who want to run only a portion of the full pipeline. Genome subsetting: The user can optionally subdivide their set of genomes into lineages as defined by PopPUNK ( See documentation ). PopPUNK quickly subdivides a set of genomes into 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced. Short-read processing and assembly: Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm ) Annotation: Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond ) Phylogenomics: ( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites ) Recombination detection: Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins ) Coevolution: Identification of coordinated gain and loss of features using EvolCCM (to add) Lateral gene transfer: Phylogenetic inference of LGT using rSPR (to add) Gene order: Comparison of genomic neighbourhoods using the Gene Order Workflow (to add) See our roadmap for a full list of future development targets.","title":"About the pipeline "},{"location":"#quick-start","text":"Install nextflow Install Docker , Singularity , or, as a last resort, Conda . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. We have minimized reliance on conda and suggest using it only as a last resort (see docs ). Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test, -stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all params currently implemented in the pipeline and which ones are required.","title":"Quick Start "},{"location":"#testing","text":"To test the worklow on a minimal dataset you can use the test configuration (with either docker, conda, or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker Due to download speed of the Kraken2, Bakta and CAZY databases this will take ~35 minutes. However to accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. As well as set --bakta_db to the directory containing the Bakta database. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light","title":"Testing"},{"location":"#examples","text":"The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples:","title":"Examples "},{"location":"#assembly-annotation-and-pan-genome-inference-from-a-modestly-sized-dataset-50-or-so-genomes-from-paired-end-reads","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers.","title":"Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads"},{"location":"#annotation-to-evolutionary-dynamics-on-300-ish-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ --use_ppanggolin \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers.","title":"Annotation to evolutionary dynamics on 300-ish genomes"},{"location":"#annotation-to-evolutionary-dynamics-on-10000-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --use_ppanggolin \\ --run_recombination \\ --enable_subsetting \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers.","title":"Annotation to evolutionary dynamics on 10,000 genomes"},{"location":"#credits","text":"The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada.","title":"Credits "},{"location":"#contributing-to-arete","text":"Thank you for your interest in contributing to ARETE. We are currently in the process of formalizing contribution guidelines. In the meantime, please feel free to open an issue describing your suggested changes.","title":"Contributing to ARETE "},{"location":"#citing-arete","text":"Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Citing ARETE "},{"location":"CITATIONS/","text":"beiko-lab/ARETE: Citations nf-core Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. Nextflow Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. Pipeline tools CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406 Software packaging/containerisation tools Anaconda Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. Bioconda Gr\u00fcning B, Dale R, Sj\u00f6din A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, K\u00f6ster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Citations"},{"location":"CITATIONS/#beiko-labarete-citations","text":"","title":"beiko-lab/ARETE: Citations"},{"location":"CITATIONS/#nf-core","text":"Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.","title":"nf-core"},{"location":"CITATIONS/#nextflow","text":"Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.","title":"Nextflow"},{"location":"CITATIONS/#pipeline-tools","text":"CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406","title":"Pipeline tools"},{"location":"CITATIONS/#software-packagingcontainerisation-tools","text":"Anaconda Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. Bioconda Gr\u00fcning B, Dale R, Sj\u00f6din A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, K\u00f6ster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Software packaging/containerisation tools"},{"location":"ROADMAP/","text":"A list in no particular order of outstanding development features, both in-progress and planned: Sensible default QC parameters to allow automated end-to-end execution with little-to-no required user intervention Integration of additional tools and scripts: Phylogenetic inference of lateral gene transfer events using rspr Inference of concerted gain and loss of genes and mobile genetic elements using the Community Coevolution Model Partner applications for analysis and visualization of phylogenetic distributions of genes and MGEs and gene-order clustering (For example, Coeus ).","title":"Roadmap"},{"location":"faq/","text":"Frequently Asked Questions How do I run ARETE in a Slurm HPC environment? Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account= ' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ).","title":"FAQ"},{"location":"faq/#frequently-asked-questions","text":"","title":"Frequently Asked Questions"},{"location":"faq/#how-do-i-run-arete-in-a-slurm-hpc-environment","text":"Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account= ' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ).","title":"How do I run ARETE in a Slurm HPC environment?"},{"location":"output/","text":"beiko-lab/ARETE: Output Introduction The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. Pipeline overview The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements PopPUNK Subworkflow PopPUNK - Genome clustering Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline Assembly FastQC read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. fastp read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads. Kraken2 read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination. Unicycler assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE. Quast assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast. Annotation Bakta annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs Prokka annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. RGI annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models. MobRecon annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases DIAMOND annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively. IslandPath annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes. IntegronFinder Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes. PhiSpy annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions. PopPUNK poppunk_results/ poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes. Recombination Verticall recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination. SKA2 recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers. Gubbins recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Phylogenomics and Pangenomics Panaroo pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline. PPanGGoLiN pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes FastTree phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences IQTree phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. SNPsites phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment. Pipeline information pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. MultiQC multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"Output"},{"location":"output/#beiko-labarete-output","text":"","title":"beiko-lab/ARETE: Output"},{"location":"output/#introduction","text":"The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.","title":"Introduction"},{"location":"output/#pipeline-overview","text":"The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements PopPUNK Subworkflow PopPUNK - Genome clustering Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline","title":"Pipeline overview"},{"location":"output/#assembly","text":"","title":"Assembly"},{"location":"output/#fastqc","text":"read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.","title":"FastQC"},{"location":"output/#fastp","text":"read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads.","title":"fastp"},{"location":"output/#kraken2","text":"read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination.","title":"Kraken2"},{"location":"output/#unicycler","text":"assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE.","title":"Unicycler"},{"location":"output/#quast","text":"assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast.","title":"Quast"},{"location":"output/#annotation","text":"","title":"Annotation"},{"location":"output/#bakta","text":"annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs","title":"Bakta"},{"location":"output/#prokka","text":"annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.","title":"Prokka"},{"location":"output/#rgi","text":"annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models.","title":"RGI"},{"location":"output/#mobrecon","text":"annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases","title":"MobRecon"},{"location":"output/#diamond","text":"annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively.","title":"DIAMOND"},{"location":"output/#islandpath","text":"annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes.","title":"IslandPath"},{"location":"output/#integronfinder","text":"Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes.","title":"IntegronFinder"},{"location":"output/#phispy","text":"annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.","title":"PhiSpy"},{"location":"output/#poppunk","text":"poppunk_results/ poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes.","title":"PopPUNK"},{"location":"output/#recombination","text":"","title":"Recombination"},{"location":"output/#verticall","text":"recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination.","title":"Verticall"},{"location":"output/#ska2","text":"recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers.","title":"SKA2"},{"location":"output/#gubbins","text":"recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions.","title":"Gubbins"},{"location":"output/#phylogenomics-and-pangenomics","text":"","title":"Phylogenomics and Pangenomics"},{"location":"output/#panaroo","text":"pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline.","title":"Panaroo"},{"location":"output/#ppanggolin","text":"pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes","title":"PPanGGoLiN"},{"location":"output/#fasttree","text":"phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences","title":"FastTree"},{"location":"output/#iqtree","text":"phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood.","title":"IQTree"},{"location":"output/#snpsites","text":"phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment.","title":"SNPsites"},{"location":"output/#pipeline-information","text":"pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.","title":"Pipeline information"},{"location":"output/#multiqc","text":"multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"MultiQC"},{"location":"params/","text":"beiko-lab/ARETE pipeline parameters AMR/VF LGT-focused bacterial genomics workflow Input/output options Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string True outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string None email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string Reference genome options Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string Kraken2 Options for the Kraken2 taxonomic classification Parameter Description Type Default Required Hidden skip_kraken Don't run Kraken2 taxonomic classification boolean Annotation Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string None use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet Phylogenomics Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True PopPUNK Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string None run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99 Recombination Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean Institutional config options Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True Max job request options Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True Generic options Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string None True","title":"Parameters"},{"location":"params/#beiko-labarete-pipeline-parameters","text":"AMR/VF LGT-focused bacterial genomics workflow","title":"beiko-lab/ARETE pipeline parameters"},{"location":"params/#inputoutput-options","text":"Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string True outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string None email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string","title":"Input/output options"},{"location":"params/#reference-genome-options","text":"Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string","title":"Reference genome options"},{"location":"params/#kraken2","text":"Options for the Kraken2 taxonomic classification Parameter Description Type Default Required Hidden skip_kraken Don't run Kraken2 taxonomic classification boolean","title":"Kraken2"},{"location":"params/#annotation","text":"Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string None use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet","title":"Annotation"},{"location":"params/#phylogenomics","text":"Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True","title":"Phylogenomics"},{"location":"params/#poppunk","text":"Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string None run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99","title":"PopPUNK"},{"location":"params/#recombination","text":"Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean","title":"Recombination"},{"location":"params/#institutional-config-options","text":"Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True","title":"Institutional config options"},{"location":"params/#max-job-request-options","text":"Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True","title":"Max job request options"},{"location":"params/#generic-options","text":"Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string None True","title":"Generic options"},{"location":"subsampling/","text":"PopPUNK subsetting The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ . Example command The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Subsampling"},{"location":"subsampling/#poppunk-subsetting","text":"The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ .","title":"PopPUNK subsetting"},{"location":"subsampling/#example-command","text":"The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Example command"},{"location":"usage/","text":"beiko-lab/ARETE: Usage Introduction The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE. Samplesheet input No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]' Full workflow or assembly samplesheet The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline. Annotation only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline. Phylogenomics and Pangenomics only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow. Reference Genome For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta Running the pipeline The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation. Assembly Entry To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Assembly QC Entry To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Annotation Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker PopPUNK Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker Phylogenomics and Pangenomics Entry To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE Reproducibility It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. Core Nextflow arguments NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen). -profile Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud conda Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. A generic configuration profile to be used with Conda Pulls most software from Bioconda test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters -resume Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names. -c Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information. Custom resource requests Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information. Running in the background Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). Nextflow memory requirements In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g'","title":"Usage"},{"location":"usage/#beiko-labarete-usage","text":"","title":"beiko-lab/ARETE: Usage"},{"location":"usage/#introduction","text":"The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE.","title":"Introduction"},{"location":"usage/#samplesheet-input","text":"No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]'","title":"Samplesheet input"},{"location":"usage/#full-workflow-or-assembly-samplesheet","text":"The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline.","title":"Full workflow or assembly samplesheet"},{"location":"usage/#annotation-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline.","title":"Annotation only samplesheet"},{"location":"usage/#phylogenomics-and-pangenomics-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow.","title":"Phylogenomics and Pangenomics only samplesheet"},{"location":"usage/#reference-genome","text":"For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta","title":"Reference Genome"},{"location":"usage/#running-the-pipeline","text":"The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation.","title":"Running the pipeline"},{"location":"usage/#assembly-entry","text":"To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly Entry"},{"location":"usage/#assembly-qc-entry","text":"To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly QC Entry"},{"location":"usage/#annotation-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"Annotation Entry"},{"location":"usage/#poppunk-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"PopPUNK Entry"},{"location":"usage/#phylogenomics-and-pangenomics-entry","text":"To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker","title":"Phylogenomics and Pangenomics Entry"},{"location":"usage/#updating-the-pipeline","text":"When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE","title":"Updating the pipeline"},{"location":"usage/#reproducibility","text":"It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.","title":"Reproducibility"},{"location":"usage/#core-nextflow-arguments","text":"NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).","title":"Core Nextflow arguments"},{"location":"usage/#-profile","text":"Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud conda Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. A generic configuration profile to be used with Conda Pulls most software from Bioconda test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters","title":"-profile"},{"location":"usage/#-resume","text":"Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names.","title":"-resume"},{"location":"usage/#-c","text":"Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.","title":"-c"},{"location":"usage/#custom-resource-requests","text":"Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information.","title":"Custom resource requests"},{"location":"usage/#running-in-the-background","text":"Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).","title":"Running in the background"},{"location":"usage/#nextflow-memory-requirements","text":"In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g'","title":"Nextflow memory requirements"}]} \ No newline at end of file +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"What is ARETE? ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE is to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, coevolution, and phylogenetic tree comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be fed into packages such as Coeus and MicroReact. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some groups than others! The tools in ARETE work best at the species and genus level of relatedness. A key design principle of ARETE is finding the right choice of software packages and parameter settings to support datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets ranging in size from fewer than ten to over 10,000 genomes from a multitude of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key principle is letting the user choose which subsets of the pipeline they wish to run; you may already have assembled genomes, or you may not care about, say, recombination detection. There are also cases where it is useful to manually review the outputs from a particular step before moving on to the next one. ARETE makes this easy to do. Table of Contents About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE About the pipeline The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow orgnaization provides suitable entry and exit points for users who want to run only a portion of the full pipeline. Genome subsetting: The user can optionally subdivide their set of genomes into lineages as defined by PopPUNK ( See documentation ). PopPUNK quickly subdivides a set of genomes into 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced. Short-read processing and assembly: Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm ) Annotation: Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond ) Phylogenomics: ( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites ) Recombination detection: Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins ) Coevolution: Identification of coordinated gain and loss of features using EvolCCM (to add) Lateral gene transfer: Phylogenetic inference of LGT using rSPR (to add) Gene order: Comparison of genomic neighbourhoods using the Gene Order Workflow (to add) See our roadmap for a full list of future development targets. Quick Start Install nextflow Install Docker , Singularity , or, as a last resort, Conda . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. We have minimized reliance on conda and suggest using it only as a last resort (see docs ). Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test, -stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all params currently implemented in the pipeline and which ones are required. Testing To test the worklow on a minimal dataset you can use the test configuration (with either docker, conda, or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker Due to download speed of the Kraken2, Bakta and CAZY databases this will take ~35 minutes. However to accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. As well as set --bakta_db to the directory containing the Bakta database. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light Examples The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples: Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers. Annotation to evolutionary dynamics on 300-ish genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ --use_ppanggolin \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers. Annotation to evolutionary dynamics on 10,000 genomes nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --use_ppanggolin \\ --run_recombination \\ --enable_subsetting \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers. Credits The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada. Contributing to ARETE Thank you for your interest in contributing to ARETE. We are currently in the process of formalizing contribution guidelines. In the meantime, please feel free to open an issue describing your suggested changes. Citing ARETE Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Home"},{"location":"#what-is-arete","text":"ARETE (Antimicrobial Resistance: Emergence, Transmission, and Ecology) is a bioinformatics best-practice analysis pipeline for profiling the genomic repertoire and evolutionary dynamics of microorganisms with a particular focus on pathogens. We use ARETE is to identify important genes (e.g., those that confer antimicrobial resistance or contribute to virulence) and mobile genetic elements such as plasmids and genomic islands, and infer important routes by which these are transmitted using evidence from recombination, coevolution, and phylogenetic tree comparisons. ARETE produces a range of useful outputs (see outputs ), including those generated by each tool integrated into the pipeline, as well as summaries across the entire dataset such as phylogenetic profiles. Outputs from ARETE can also be fed into packages such as Coeus and MicroReact. Although ARETE was primarily developed with pathogens in mind, inference of pan-genomes, mobilomes, and phylogenomic histories can be performed for any set of microbial genomes, with the proviso that reference databases are much more complete for some groups than others! The tools in ARETE work best at the species and genus level of relatedness. A key design principle of ARETE is finding the right choice of software packages and parameter settings to support datasets of different sizes, introducing heuristics and swapping out tools as necessary. ARETE has been benchmarked on datasets ranging in size from fewer than ten to over 10,000 genomes from a multitude of species and genera including Enterococcus faecium , Escherichia coli , Listeria , and Salmonella . Another key principle is letting the user choose which subsets of the pipeline they wish to run; you may already have assembled genomes, or you may not care about, say, recombination detection. There are also cases where it is useful to manually review the outputs from a particular step before moving on to the next one. ARETE makes this easy to do.","title":"What is ARETE?"},{"location":"#table-of-contents","text":"About the pipeline Quick start A couple of examples Credits Contributing to ARETE Citing ARETE","title":"Table of Contents"},{"location":"#about-the-pipeline","text":"The pipeline is built using Nextflow , a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker / Singularity containers making installation trivial and results highly reproducible. Like other workflow languages it provides useful features like -resume to only rerun tasks that haven't already been completed (e.g., allowing editing of inputs/tasks and recovery from crashes without a full re-run). The nf-core project provided overall project template, pre-written software modules when available, and general best practice recommendations. ARETE is organized as a series of subworkflows, each of which executes a different conceptual step of the pipeline. The subworkflow orgnaization provides suitable entry and exit points for users who want to run only a portion of the full pipeline. Genome subsetting: The user can optionally subdivide their set of genomes into lineages as defined by PopPUNK ( See documentation ). PopPUNK quickly subdivides a set of genomes into 'lineages' based on core and accessory genome identity. If this option is selected, all genomes will still be annotated, but cross-genome comparisons (e.g., pan-genome inference and phylogenomics) will use only a single representative genome. The user can run PopPUNK with a spread of different thresholds and decide how to proceed based on the number of lineages produced. Short-read processing and assembly: Raw Read QC ( FastQC ) Read Trimming ( fastp ) Trimmed Read QC ( FastQC ) Taxonomic Profiling ( kraken2 ) Unicycler ( unicycler ) QUAST QC ( quast ) CheckM QC ( checkm ) Annotation: Genome annotation with Bakta ( bakta ) or Prokka ( prokka ) Feature prediction: AMR genes with the Resistance Gene Identifier ( RGI ) Plasmids with MOB-Suite ( mob_suite ) Genomic Islands with IslandPath ( IslandPath ) Phages with PhiSpy ( PhiSpy ) ( optionally ) Integrons with IntegronFinder Specialized databases: CAZY, VFDB, BacMet and ICEberg2 using DIAMOND homology search ( diamond ) Phylogenomics: ( optionally ) Genome subsetting with PopPUNK ( See documentation ) Pan-genome inference using PPanGGOLiN ( PPanGGOLiN ) or Panaroo ( panaroo ) Reference and gene tree inference using FastTree ( fasttree ) or IQTree ( iqtree ) ( optionally ) SNP-sites ( SNPsites ) Recombination detection: Recombination detection is performed within lineages identified by PopPUNK ( poppunk ). Note that this application of PopPUNK is different from the subsetting described above. Genome alignment using SKA2 ( ska2 ) Recombination detection using Verticall ( verticall ) and/or Gubbins ( gubbins ) Coevolution: Identification of coordinated gain and loss of features using EvolCCM (to add) Lateral gene transfer: Phylogenetic inference of LGT using rSPR (to add) Gene order: Comparison of genomic neighbourhoods using the Gene Order Workflow (to add) See our roadmap for a full list of future development targets.","title":"About the pipeline "},{"location":"#quick-start","text":"Install nextflow Install Docker , Singularity , or, as a last resort, Conda . Also ensure you have a working curl installed (should be present on almost all systems). 2.1. Note: this workflow should also support Podman , Shifter or Charliecloud execution for full pipeline reproducibility. We have minimized reliance on conda and suggest using it only as a last resort (see docs ). Configure mail on your system to send an email on workflow success/failure (without this you may get a small error at the end Failed to invoke workflow.onComplete event handler but this doesn't mean the workflow didn't finish successfully). Download the pipeline and test with a stub-run . The stub-run will ensure that the pipeline is able to download and use containers as well as execute in the proper logic. nextflow run beiko-lab/ARETE -profile test, -stub-run 3.1. Please check nf-core/configs to see if a custom config file to run nf-core pipelines already exists for your Institute. If so, you can simply use -profile in your command. This will enable either docker or singularity and set the appropriate execution settings for your local compute environment. 3.2. If you are using singularity then the pipeline will auto-detect this and attempt to download the Singularity images directly as opposed to performing a conversion from Docker images. If you are persistently observing issues downloading Singularity images directly due to timeout or network issues then please use the --singularity_pull_docker_container parameter to pull and convert the Docker image instead. Start running your own analysis (ideally using -profile docker or -profile singularity for stability)! nextflow run beiko-lab/ARETE \\ -profile \\ --input_sample_table samplesheet.csv \\ --poppunk_model bgmm samplesheet.csv must be formatted sample,fastq_1,fastq_2 Note : If you get this error at the end Failed to invoke `workflow.onComplete` event handler it isn't a problem, it just means you don't have an sendmail configured and it can't send an email report saying it finished correctly i.e., its not that the workflow failed. See usage docs for all of the available options when running the pipeline. See the parameter docs for a list of all params currently implemented in the pipeline and which ones are required.","title":"Quick Start "},{"location":"#testing","text":"To test the worklow on a minimal dataset you can use the test configuration (with either docker, conda, or singularity - replace docker below as appropriate): nextflow run beiko-lab/ARETE -profile test,docker Due to download speed of the Kraken2, Bakta and CAZY databases this will take ~35 minutes. However to accelerate it you can download/cache the database files to a folder (e.g., test/db_cache ) and provide a database cache parameter. As well as set --bakta_db to the directory containing the Bakta database. nextflow run beiko-lab/ARETE \\ -profile test,docker \\ --db_cache $PWD/test/db_cache \\ --bakta_db $PWD/baktadb/db-light","title":"Testing"},{"location":"#examples","text":"The fine details of how to run ARETE are described in the command reference and documentation, but here are a couple of illustrative examples:","title":"Examples "},{"location":"#assembly-annotation-and-pan-genome-inference-from-a-modestly-sized-dataset-50-or-so-genomes-from-paired-end-reads","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --annotation_tools 'mobsuite,rgi,vfdb,bacmet,islandpath,phispy,report' \\ --poppunk_model bgmm \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --annotation_tools - Select the annotation tools and modules to be executed (See the parameter documentation for defaults) --poppunk_model - Model to be used by PopPUNK -profile docker - Run tools in docker containers.","title":"Assembly, annotation, and pan-genome inference from a modestly sized dataset (50 or so genomes) from paired-end reads"},{"location":"#annotation-to-evolutionary-dynamics-on-300-ish-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --run_recombination \\ --run_gubbins \\ --use_ppanggolin \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --run_gubbins - Run Gubbins as part of the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers.","title":"Annotation to evolutionary dynamics on 300-ish genomes"},{"location":"#annotation-to-evolutionary-dynamics-on-10000-genomes","text":"nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --poppunk_model dbscan \\ --use_ppanggolin \\ --run_recombination \\ --enable_subsetting \\ -entry annotation \\ -profile docker Parameters used: --input_sample_table - Input dataset in samplesheet format (See usage ) --poppunk_model - Model to be used by PopPUNK . --run_recombination - Run the recombination subworkflow. --use_ppanggolin - Use PPanGGOLiN for calculating the pangenome. Tends to perform better on larger input sets. --enable_subsetting - Enable subsetting workflow based on genome similarity (See subsetting documentation ) -entry annotation - Run annotation subworkflow and further steps (See usage ). -profile docker - Run tools in docker containers.","title":"Annotation to evolutionary dynamics on 10,000 genomes"},{"location":"#credits","text":"The ARETE software was originally written and developed by Finlay Maguire and Alex Manuele , and is currently developed by Jo\u00e3o Cavalcante . Rob Beiko is the PI of the ARETE project. The project Co-PI is Fiona Brinkman. Other project leads include Andrew MacArthur, Cedric Chauve, Chris Whidden, Gary van Domselaar, John Nash, Rahat Zaheer, and Tim McAllister. Many students, postdocs, developers, and staff scientists have made invaluable contributions to the design and application of ARETE and its components, including Haley Sanderson, Kristen Gray, Julia Lewandowski, Chaoyue Liu, Kartik Kakadiya, Bryan Alcock, Amos Raphenya, Amjad Khan, Ryan Fink, Aniket Mane, Chandana Navanekere Rudrappa, Kyrylo Bessonov, James Robertson, Jee In Kim, and Nolan Woods. ARETE development has been supported from many sources, including Genome Canada, ResearchNS, Genome Atlantic, Genome British Columbia, The Canadian Institutes for Health Research, The Natural Sciences and Engineering Research Council of Canada, and Dalhousie University's Faculty of Computer Science. We have received tremendous support from federal agencies, most notably the Public Health Agency of Canada and Agriculture / Agri-Food Canada.","title":"Credits "},{"location":"#contributing-to-arete","text":"Thank you for your interest in contributing to ARETE. We are currently in the process of formalizing contribution guidelines. In the meantime, please feel free to open an issue describing your suggested changes.","title":"Contributing to ARETE "},{"location":"#citing-arete","text":"Please cite the tools used in your ARETE run: A comprehensive list can be found in the CITATIONS.md file. An early version of ARETE was used for assembly and feature prediction in the following paper : Sanderson H, Gray KL, Manuele A, Maguire F, Khan A, Liu C, Navanekere Rudrappa C, Nash JHE, Robertson J, Bessonov K, Oloni M, Alcock BP, Raphenya AR, McAllister TA, Peacock SJ, Raven KE, Gouliouris T, McArthur AG, Brinkman FSL, Fink RC, Zaheer R, Beiko RG. Exploring the mobilome and resistome of Enterococcus faecium in a One Health context across two continents. Microb Genom. 2022 Sep;8(9):mgen000880. doi: 10.1099/mgen.0.000880. PMID: 36129737; PMCID: PMC9676038. This pipeline uses code and infrastructure developed and maintained by the nf-core initative, and reused here under the MIT license . The nf-core framework for community-curated bioinformatics pipelines. Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.","title":"Citing ARETE "},{"location":"CITATIONS/","text":"beiko-lab/ARETE: Citations nf-core Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031. Nextflow Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311. Pipeline tools CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] IntegronFinder N\u00e9ron, Bertrand, Eloi Littner, Matthieu Haudiquet, Amandine Perrin, Jean Cury, and Eduardo P.C. Rocha. 2022. IntegronFinder 2.0: Identification and Analysis of Integrons across Bacteria, with a Focus on Antibiotic Resistance in Klebsiella Microorganisms 10, no. 4: 700. https://doi.org/10.3390/microorganisms10040700 Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PPanGGoLiN Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SKA2 Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142 Gubbins Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. \"Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins\". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014. Verticall SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406 Software packaging/containerisation tools Anaconda Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. Bioconda Gr\u00fcning B, Dale R, Sj\u00f6din A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, K\u00f6ster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Citations"},{"location":"CITATIONS/#beiko-labarete-citations","text":"","title":"beiko-lab/ARETE: Citations"},{"location":"CITATIONS/#nf-core","text":"Ewels PA, Peltzer A, Fillinger S, Patel H, Alneberg J, Wilm A, Garcia MU, Di Tommaso P, Nahnsen S. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol. 2020 Mar;38(3):276-278. doi: 10.1038/s41587-020-0439-x. PubMed PMID: 32055031.","title":"nf-core"},{"location":"CITATIONS/#nextflow","text":"Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017 Apr 11;35(4):316-319. doi: 10.1038/nbt.3820. PubMed PMID: 28398311.","title":"Nextflow"},{"location":"CITATIONS/#pipeline-tools","text":"CheckM Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes . Genome Research, 25: 1043\u20131055. DIAMOND Buchfink B, Xie C, Huson DH Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 12, 59\u201360 (2015) FastQC FastP Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281. FastTree Morgan N. Price, Paramvir S. Dehal, Adam P. Arkin, FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix, Molecular Biology and Evolution, Volume 26, Issue 7, July 2009, Pages 1641\u20131650, https://doi.org/10.1093/molbev/msp077 IQ-TREE2 Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, Lanfear R. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020 May 1;37(5):1530-1534. doi: 10.1093/molbev/msaa015. Erratum in: Mol Biol Evol. 2020 Aug 1;37(8):2461. PMID: 32011700; PMCID: PMC7182206. Kraken2 Wood, D et al., 2019. Improved metagenomic analysis with Kraken 2. Genome Biology volume 20, Article number: 257. doi: 10.1186/s13059-019-1891-0. MOB-SUITE Robertson, James, and John H E Nash. \u201cMOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies.\u201d Microbial genomics vol. 4,8 (2018): e000206. doi:10.1099/mgen.0.000206 Robertson, James et al. \u201cUniversal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance.\u201d Microbial genomics vol. 6,10 (2020): mgen000435. doi:10.1099/mgen.0.000435 MultiQC Ewels P, Magnusson M, Lundin S, K\u00e4ller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924. Bakta Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685 Prokka Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014 Jul 15;30(14):2068-9. doi: 10.1093/bioinformatics/btu153. Epub 2014 Mar 18. PMID: 24642063. QUAST Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013 Apr 15;29(8):1072-5. doi: 10.1093/bioinformatics/btt086. Epub 2013 Feb 19. PMID: 23422339; PMCID: PMC3624806. RGI Alcock et al. 2020. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Research, Volume 48, Issue D1, Pages D517-525 [PMID 31665441] IntegronFinder N\u00e9ron, Bertrand, Eloi Littner, Matthieu Haudiquet, Amandine Perrin, Jean Cury, and Eduardo P.C. Rocha. 2022. IntegronFinder 2.0: Identification and Analysis of Integrons across Bacteria, with a Focus on Antibiotic Resistance in Klebsiella Microorganisms 10, no. 4: 700. https://doi.org/10.3390/microorganisms10040700 Panaroo Tonkin-Hill, G., MacAlasdair, N., Ruis, C. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 21, 180 (2020). https://doi.org/10.1186/s13059-020-02090-4 PPanGGoLiN Gautreau G et al. (2020) PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLOS Computational Biology 16(3): e1007732. https://doi.org/10.1371/journal.pcbi.1007732 PopPUNK Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW, Weiser JN, Corander J, Bentley SD, Croucher NJ. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res. 2019 Feb;29(2):304-316. doi: 10.1101/gr.241455.118. Epub 2019 Jan 24. PMID: 30679308; PMCID: PMC6360808. SKA2 Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142 Gubbins Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. \"Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins\". doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014. Verticall SNP-sites Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, Harris SR. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom. 2016 Apr 29;2(4):e000056. doi: 10.1099/mgen.0.000056. PMID: 28348851; PMCID: PMC5320690. Unicycler Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol. 2017 Jun 8;13(6):e1005595. doi: 10.1371/journal.pcbi.1005595. PMID: 28594827; PMCID: PMC5481147. IslandPath Claire Bertelli, Fiona S L Brinkman, Improved genomic island predictions with IslandPath-DIMOB, Bioinformatics, Volume 34, Issue 13, 01 July 2018, Pages 2161\u20132167, https://doi.org/10.1093/bioinformatics/bty095 PhiSpy Sajia Akhter, Ramy K. Aziz, Robert A. Edwards; PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucl Acids Res 2012; 40 (16): e126. doi: 10.1093/nar/gks406","title":"Pipeline tools"},{"location":"CITATIONS/#software-packagingcontainerisation-tools","text":"Anaconda Anaconda Software Distribution. Computer software. Vers. 2-2.4.0. Anaconda, Nov. 2016. Web. Bioconda Gr\u00fcning B, Dale R, Sj\u00f6din A, Chapman BA, Rowe J, Tomkins-Tinch CH, Valieris R, K\u00f6ster J; Bioconda Team. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018 Jul;15(7):475-476. doi: 10.1038/s41592-018-0046-7. PubMed PMID: 29967506. BioContainers da Veiga Leprevost F, Gr\u00fcning B, Aflitos SA, R\u00f6st HL, Uszkoreit J, Barsnes H, Vaudel M, Moreno P, Gatto L, Weber J, Bai M, Jimenez RC, Sachsenberg T, Pfeuffer J, Alvarez RV, Griss J, Nesvizhskii AI, Perez-Riverol Y. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017 Aug 15;33(16):2580-2582. doi: 10.1093/bioinformatics/btx192. PubMed PMID: 28379341; PubMed Central PMCID: PMC5870671. Docker Dirk Merkel. 2014. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 239, Article 2 (March 2014). Singularity Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One. 2017 May 11;12(5):e0177459. doi: 10.1371/journal.pone.0177459. eCollection 2017. PubMed PMID: 28494014; PubMed Central PMCID: PMC5426675.","title":"Software packaging/containerisation tools"},{"location":"ROADMAP/","text":"A list in no particular order of outstanding development features, both in-progress and planned: Sensible default QC parameters to allow automated end-to-end execution with little-to-no required user intervention Integration of additional tools and scripts: Phylogenetic inference of lateral gene transfer events using rspr Inference of concerted gain and loss of genes and mobile genetic elements using the Community Coevolution Model Partner applications for analysis and visualization of phylogenetic distributions of genes and MGEs and gene-order clustering (For example, Coeus ).","title":"Roadmap"},{"location":"faq/","text":"Frequently Asked Questions How do I run ARETE in a Slurm HPC environment? Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account= ' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ).","title":"FAQ"},{"location":"faq/#frequently-asked-questions","text":"","title":"Frequently Asked Questions"},{"location":"faq/#how-do-i-run-arete-in-a-slurm-hpc-environment","text":"Set a config file under ~/.nextflow/config to use the slurm executor: process { executor = 'slurm' pollInterval = '60 sec' submitRateLimit = '60/1min' queueSize = 100 // If an account is necessary: clusterOptions = '--account= ' } See the Nextflow documentation for a description of these options. Now, when running ARETE, you'll need to set additional options if your compute nodes don't have network access - as is common for most Slurm clusters. The example below uses the default test data, i.e. the test profile, for demonstration purposes only. nextflow run beiko-lab/ARETE \\ --db_cache path/to/db_cache \\ --bakta_db path/to/baktadb \\ -profile test,singularity Apart from -profile singularity , which just makes ARETE use Singularity/Apptainer containers for running the tools, there are two additional parameters: --db_cache should be the location for the pre-downloaded databases used in the DIAMOND alignments (i.e. Bacmet, VFDB, ICEberg2 and CAZy FASTA files) and in the Kraken2 taxonomic read classification. Although these tools run by default, you can change the selection of annotation tools by changing --annotation_tools and skip Kraken2 by adding --skip_kraken . See the parameter documentation for a full list of parameters and their defaults. --bakta_db should be the location of the pre-downloaded Bakta database Alternatively, you can use Prokka for annotating your assemblies, since it doesn't require a downloaded database ( --use_prokka ).","title":"How do I run ARETE in a Slurm HPC environment?"},{"location":"output/","text":"beiko-lab/ARETE: Output Introduction The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. Pipeline overview The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements PopPUNK Subworkflow PopPUNK - Genome clustering Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline Assembly FastQC read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. fastp read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads. Kraken2 read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination. Unicycler assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE. Quast assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast. Annotation Bakta annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs Prokka annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. RGI annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models. MobRecon annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases DIAMOND annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively. IslandPath annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes. IntegronFinder Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes. PhiSpy annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions. PopPUNK poppunk_results/ poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes. Recombination Verticall recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination. SKA2 recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers. Gubbins recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions. Phylogenomics and Pangenomics Panaroo pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline. PPanGGoLiN pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes FastTree phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences IQTree phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood. SNPsites phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment. Pipeline information pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. MultiQC multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"Output"},{"location":"output/#beiko-labarete-output","text":"","title":"beiko-lab/ARETE: Output"},{"location":"output/#introduction","text":"The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.","title":"Introduction"},{"location":"output/#pipeline-overview","text":"The pipeline is built using Nextflow and processes data using the following steps (steps in italics don't run by default): Short-read processing and assembly FastQC - Raw and trimmed read QC FastP - Read trimming Kraken2 - Taxonomic assignment Unicycler - Short read assembly Quast - Assembly quality score Annotation Bakta or Prokka - Gene detection and annotation MobRecon - Reconstruction and typing of plasmids RGI - Detection and annotation of AMR determinants IslandPath - Predicts genomic islands in bacterial and archaeal genomes. PhiSpy - Prediction of prophages from bacterial genomes IntegronFinder - Finds integrons in DNA sequences Diamond - Detection and annotation of genes using external databases. CAZy: Carbohydrate metabolism VFDB: Virulence factors BacMet: Metal resistance determinants ICEberg: Integrative and conjugative elements PopPUNK Subworkflow PopPUNK - Genome clustering Recombination Verticall - Conduct pairwise assembly comparisons between genomes in a same PopPUNK cluster SKA2 - Generate a whole-genome FASTA alignment for each genome within a cluster. Gubbins - Detection of recombination events within genomes of the same cluster. Phylogenomics and Pangenomics Panaroo or PPanGGoLiN - Pangenome alignment FastTree or IQTree - Maximum likelihood core genome phylogenetic tree SNPsites - Extracts SNPs from a multi-FASTA alignment Pipeline information Report metrics generated during the workflow execution MultiQC - Aggregate report describing results and QC from the whole pipeline","title":"Pipeline overview"},{"location":"output/#assembly","text":"","title":"Assembly"},{"location":"output/#fastqc","text":"read_processing/*_fastqc/ *_fastqc.html : FastQC report containing quality metrics for your untrimmed raw fastq files. *_fastqc.zip : Zip archive containing the FastQC report, tab-delimited data file and plot images. NB: The FastQC plots in this directory are generated relative to the raw, input reads. They may contain adapter sequence and regions of low quality. To see how your reads look after adapter and quality trimming please refer to the FastQC reports in the trimgalore/fastqc/ directory. FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages . NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality.","title":"FastQC"},{"location":"output/#fastp","text":"read_processing/fastp/ ${meta.id} : Trimmed files and trimming reports for each input sample. fastp is a all-in-one fastq preprocessor for read/adapter trimming and quality control. It is used in this pipeline for trimming adapter sequences and discard low-quality reads.","title":"fastp"},{"location":"output/#kraken2","text":"read_processing/kraken2/ *.kraken2.report.txt : Text file containing genome-wise information of Kraken2 findings. See here for details. *.classified(_(1|2))?.fastq.gz : Fasta file containing classified reads. If paired-end, one file per end. *.unclassified(_(1|2))?.fastq.gz : Fasta file containing unclassified reads. If paired-end, one file per end. Kraken2 is a read classification software which will assign taxonomy to each read comprising a sample. These results may be analyzed as an indicator of contamination.","title":"Kraken2"},{"location":"output/#unicycler","text":"assembly/unicycler/ *.assembly.gfa *.scaffolds.fa *.unicycler.log Short/hybrid read assembler. For now only handles short reads in ARETE.","title":"Unicycler"},{"location":"output/#quast","text":"assembly/quast/ report.tsv : A tab-seperated report compiling all QC metrics recorded over all genomes quast/ report.(html|tex|pdf|tsv|txt) : The Quast report in different file formats transposed_report.(tsv|txt) : Transpose of the Quast report quast.log : Log file of all Quast runs icarus_viewers/ contig_size_viewer.html basic_stats/ : Directory containing various summary plots generated by Quast.","title":"Quast"},{"location":"output/#annotation","text":"","title":"Annotation"},{"location":"output/#bakta","text":"annotation/bakta/ ${sample_id}/ : Bakta results will be in one directory per genome. ${sample_id}.tsv : annotations as simple human readble TSV ${sample_id}.gff3 : annotations & sequences in GFF3 format ${sample_id}.gbff : annotations & sequences in (multi) GenBank format ${sample_id}.embl : annotations & sequences in (multi) EMBL format ${sample_id}.fna : replicon/contig DNA sequences as FASTA ${sample_id}.ffn : feature nucleotide sequences as FASTA ${sample_id}.faa : CDS/sORF amino acid sequences as FASTA ${sample_id}.hypotheticals.tsv : further information on hypothetical protein CDS as simple human readble tab separated values ${sample_id}.hypotheticals.faa : hypothetical protein CDS amino acid sequences as FASTA ${sample_id}.txt : summary as TXT ${sample_id}.png : circular genome annotation plot as PNG ${sample_id}.svg : circular genome annotation plot as SVG Bakta is a tool for the rapid & standardized annotation of bacterial genomes and plasmids from both isolates and MAGs","title":"Bakta"},{"location":"output/#prokka","text":"annotation/prokka/ ${sample_id}/ : Prokka results will be in one directory per genome. ${sample_id}.err : Unacceptable annotations ${sample_id}.faa : Protein FASTA file of translated CDS sequences ${sample_id}.ffn : Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA) ${sample_id}.fna : Nucleotide FASTA file of input contig sequences ${sample_id}.fsa : Nucleotide FASTA file of the input contig sequences, used by \"tbl2asn\" to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines. ${sample_id}.gff : This is the master annotation in GFF3 format, containing both sequences and annotations. ${sample_id}.gbk : This is a standard Genbank file derived from the master .gff. ${sample_id}.log : Contains all the output that Prokka produced during its run. This is a record of what settings used, even if the --quiet option was enabled. ${sample_id}.sqn : An ASN1 format \"Sequin\" file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc. ${sample_id}.tbl : Feature Table file, used by \"tbl2asn\" to create the .sqn file. ${sample_id}.tsv : Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product ${sample_id}.txt : Statistics relating to the annotated features found. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files.","title":"Prokka"},{"location":"output/#rgi","text":"annotation/rgi/ ${sample_id}_rgi.txt : A TSV report containing all AMR predictions for a given genome. For more info see here RGI predicts AMR determinants using the CARD ontology and various trained models.","title":"RGI"},{"location":"output/#mobrecon","text":"annotation/mob_recon ${sample_id}_mob_recon/ : MobRecon results will be in one directory per genome. contig_report.txt - This file describes the assignment of the contig to chromosome or a particular plasmid grouping. mge.report.txt - Blast HSP of detected MGE's/repetitive elements with contextual information. chromosome.fasta - Fasta file of all contigs found to belong to the chromosome. plasmid_*.fasta - Each plasmid group is written to an individual fasta file which contains the assigned contigs. mobtyper_results - Aggregate MOB-typer report files for all identified plasmid. MobRecon reconstructs individual plasmid sequences from draft genome assemblies using the clustered plasmid reference databases","title":"MobRecon"},{"location":"output/#diamond","text":"annotation/(vfdb|bacmet|cazy|iceberg2)/ ${sample_id}/${sample_id}_(VFDB|BACMET|CAZYDB|ICEberg2).txt : Blast6 formatted TSVs indicating BlastX results of the genes from each genome against VFDB, BacMet, and CAZy databases. (VFDB|BACMET|CAZYDB|ICEberg2).txt : Table with all hits to this database, with a column describing which genome the match originates from. Sorted and filtered by the match's coverage. Diamond is a sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. We use DIAMOND to predict the presence of virulence factors, heavy metal resistance determinants, carbohydrate-active enzymes, and integrative and conjugative elements using VFDB , BacMet , CAZy , and ICEberg2 respectively.","title":"DIAMOND"},{"location":"output/#islandpath","text":"annotation/islandpath/ ${sample_id}/ : IslandPath results will be in one directory per genome. ${sample_id}.tsv : IslandPath results Dimob.log : IslandPath execution log IslandPath is a standalone software to predict genomic islands in bacterial and archaeal genomes based on the presence of dinucleotide biases and mobility genes.","title":"IslandPath"},{"location":"output/#integronfinder","text":"Disabled by default. Enable by adding --run_integronfinder to your command. annotation/integron_finder/ Results_Integron_Finder_${sample_id}/ : IntegronFinder results will be in one directory per genome. Integron Finder is a bioinformatics tool to find integrons in bacterial genomes.","title":"IntegronFinder"},{"location":"output/#phispy","text":"annotation/phispy/ ${sample_id}/ : PhiSpy results will be in one directory per genome. See the PhiSpy documentation for an extensive description of the output. PhiSpy is a tool for identification of prophages in Bacterial (and probably Archaeal) genomes. Given an annotated genome it will use several approaches to identify the most likely prophage regions.","title":"PhiSpy"},{"location":"output/#poppunk","text":"poppunk_results/ poppunk_${poppunk_model}/ - Results from PopPUNK's fit-model command poppunk_visualizations/ - Results from the poppunk_visualise command PopPUNK is a tool for clustering genomes.","title":"PopPUNK"},{"location":"output/#recombination","text":"","title":"Recombination"},{"location":"output/#verticall","text":"recombination/verticall/ verticall_cluster*.tsv - Verticall results for the genomes within this PopPUNK cluster. Verticall is a tool to help produce bacterial genome phylogenies which are not influenced by horizontally acquired sequences such as recombination.","title":"Verticall"},{"location":"output/#ska2","text":"recombination/ska2/ cluster_*.aln - SKA2 results for the genomes within this PopPUNK cluster. SKA2 (Split Kmer Analysis) is a toolkit for prokaryotic (and any other small, haploid) DNA sequence analysis using split kmers.","title":"SKA2"},{"location":"output/#gubbins","text":"recombination/gubbins/ cluster_*/ - Gubbins results for the genomes within this PopPUNK cluster. Gubbins is an algorithm that iteratively identifies loci containing elevated densities of base substitutions while concurrently constructing a phylogeny based on the putative point mutations outside of these regions.","title":"Gubbins"},{"location":"output/#phylogenomics-and-pangenomics","text":"","title":"Phylogenomics and Pangenomics"},{"location":"output/#panaroo","text":"pangenomics/panaroo/results/ See the panaroo documentation for an extensive description of output provided. Panaroo is a Bacterial Pangenome Analysis Pipeline.","title":"Panaroo"},{"location":"output/#ppanggolin","text":"pangenomics/ppanggolin/ See the PPanGGoLiN documentation for an extensive description of output provided. PPanGGoLiN is a tool to build a partitioned pangenome graph from microbial genomes","title":"PPanGGoLiN"},{"location":"output/#fasttree","text":"phylogenomics/fasttree/ *.tre : Newick formatted maximum likelihood tree of core-genome alignment. FastTree infers approximately-maximum-likelihood phylogenetic trees from alignments of nucleotide or protein sequences","title":"FastTree"},{"location":"output/#iqtree","text":"phylogenomics/iqtree/ *.treefile : Newick formatted maximum likelihood tree of core-genome alignment. IQTree is a fast and effective stochastic algorithm to infer phylogenetic trees by maximum likelihood.","title":"IQTree"},{"location":"output/#snpsites","text":"phylogenomics/snpsites/ filtered_alignment.fas : Variant fasta file. constant.sites.txt : Text file containing counts of constant sites. SNPsites is a tool to rapidly extract SNPs from a multi-FASTA alignment.","title":"SNPsites"},{"location":"output/#pipeline-information","text":"pipeline_info/ Reports generated by Nextflow: execution_report.html , execution_timeline.html , execution_trace.txt and pipeline_dag.dot / pipeline_dag.svg . Reports generated by the pipeline: pipeline_report.html , pipeline_report.txt and software_versions.csv . Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv . Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.","title":"Pipeline information"},{"location":"output/#multiqc","text":"multiqc/ multiqc_report.html : a standalone HTML file that can be viewed in your web browser. multiqc_data/ : directory containing parsed statistics from the different tools used in the pipeline. multiqc_plots/ : directory containing static images from the report in various formats. MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory. Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info .","title":"MultiQC"},{"location":"params/","text":"beiko-lab/ARETE pipeline parameters AMR/VF LGT-focused bacterial genomics workflow Input/output options Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string True outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string None email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string Reference genome options Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string Kraken2 Options for the Kraken2 taxonomic classification Parameter Description Type Default Required Hidden skip_kraken Don't run Kraken2 taxonomic classification boolean Annotation Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string None use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet Phylogenomics Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True PopPUNK Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string None run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99 Recombination Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean Institutional config options Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True Max job request options Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True Generic options Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string None True","title":"Parameters"},{"location":"params/#beiko-labarete-pipeline-parameters","text":"AMR/VF LGT-focused bacterial genomics workflow","title":"beiko-lab/ARETE pipeline parameters"},{"location":"params/#inputoutput-options","text":"Define where the pipeline should find input data and save output data. Parameter Description Type Default Required Hidden input_sample_table Path to comma-separated file containing information about the samples in the experiment. Help You will need to create a design file with information about the samples in your experiment before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 3 columns, and a header row. string True outdir Path to the output directory where the results will be saved. string ./results db_cache Directory where the databases are located string None email Email address for completion summary. Help Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file ( ~/.nextflow/config ) then you don't need to specify this on the command line for every run. string multiqc_title MultiQC report title. Printed as page header, used for filename if not otherwise specified. string","title":"Input/output options"},{"location":"params/#reference-genome-options","text":"Reference and outgroup genome fasta files required for the workflow. Parameter Description Type Default Required Hidden reference_genome Path to FASTA reference genome file. string","title":"Reference genome options"},{"location":"params/#kraken2","text":"Options for the Kraken2 taxonomic classification Parameter Description Type Default Required Hidden skip_kraken Don't run Kraken2 taxonomic classification boolean","title":"Kraken2"},{"location":"params/#annotation","text":"Parameters for the annotation subworkflow Parameter Description Type Default Required Hidden annotation_tools Comma-separated list of annotation tools to run string mobsuite,rgi,cazy,vfdb,iceberg,bacmet,islandpath,phispy,report bakta_db Path to the BAKTA database string None use_prokka Use Prokka (not Bakta) for annotating assemblies boolean min_pident Minimum match identity percentage for filtering integer 60 min_qcover Minimum coverage of each match for filtering number 0.6 skip_profile_creation Skip annotation feature profile creation boolean feature_profile_columns Columns to include in the feature profile string mobsuite,rgi,cazy,vfdb,iceberg,bacmet","title":"Annotation"},{"location":"params/#phylogenomics","text":"Parameters for the phylogenomics subworkflow Parameter Description Type Default Required Hidden skip_phylo Skip Pangenomics and Phylogenomics subworkflow boolean use_ppanggolin Use ppanggolin for calculating the pangenome boolean use_full_alignment Use full alignment boolean use_fasttree Use FastTree boolean True","title":"Phylogenomics"},{"location":"params/#poppunk","text":"Parameters for the lineage subworkflow Parameter Description Type Default Required Hidden skip_poppunk Skip PopPunk boolean poppunk_model Which PopPunk model to use (bgmm, dbscan, refine, threshold or lineage) string None run_poppunk_qc Whether to run the QC step for PopPunk boolean enable_subsetting Enable subsetting workflow based on genome similarity boolean core_similarity Similarity threshold for core genomes number 99.99 accessory_similarity Similarity threshold for accessory genes number 99","title":"PopPUNK"},{"location":"params/#recombination","text":"Parameters for the recombination subworkflow Parameter Description Type Default Required Hidden run_recombination Run Recombination boolean run_verticall Run Verticall recombination tool boolean True run_gubbins Run Gubbins recombination tool boolean","title":"Recombination"},{"location":"params/#institutional-config-options","text":"Parameters used to describe centralised config profiles. These should not be edited. Parameter Description Type Default Required Hidden custom_config_version Git commit id for Institutional configs. string master True custom_config_base Base directory for Institutional configs. Help If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter. string https://raw.githubusercontent.com/nf-core/configs/master True hostnames Institutional configs hostname. string True config_profile_name Institutional config name. string True config_profile_description Institutional config description. string True config_profile_contact Institutional config contact information. string True config_profile_url Institutional config URL link. string True","title":"Institutional config options"},{"location":"params/#max-job-request-options","text":"Set the top limit for requested resources for any single job. Parameter Description Type Default Required Hidden max_cpus Maximum number of CPUs that can be requested for any single job. Help Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1 integer 16 True max_memory Maximum amount of memory that can be requested for any single job. Help Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB' string 128.GB True max_time Maximum amount of time that can be requested for any single job. Help Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h' string 240.h True","title":"Max job request options"},{"location":"params/#generic-options","text":"Less common options for the pipeline, typically set in a config file. Parameter Description Type Default Required Hidden help Display help text. boolean True publish_dir_mode Method used to save pipeline results to output directory. Help The Nextflow publishDir option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details. string copy True email_on_fail Email address for completion summary, only when pipeline fails. Help An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully. string True plaintext_email Send plain-text email instead of HTML. boolean True max_multiqc_email_size File size limit when attaching MultiQC reports to summary emails. string 25.MB True monochrome_logs Do not use coloured log outputs. boolean True multiqc_config Custom config file to supply to MultiQC. string True tracedir Directory to keep pipeline Nextflow logs and reports. string ${params.outdir}/pipeline_info True validate_params Boolean whether to validate parameters against the schema at runtime boolean True True show_hidden_params Show all params when using --help Help By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help . Specifying this option will tell the pipeline to show all parameters. boolean True enable_conda Run this workflow with Conda. You can also use '-profile conda' instead of providing this parameter. boolean True singularity_pull_docker_container Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. Help This may be useful for example if you are unable to directly pull Singularity containers to run the pipeline due to http/https proxy issues. boolean True schema_ignore_params string genomes,modules multiqc_logo string None True","title":"Generic options"},{"location":"subsampling/","text":"PopPUNK subsetting The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ . Example command The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Subsampling"},{"location":"subsampling/#poppunk-subsetting","text":"The subsampling subworkflow is executed if you want to reduce the number of genomes that get added to the phylogenomics subworkflow. By reducing the number of genomes, you can potentially reduce resource requirements for the pangenomics and phylogenomics tools. To enable this subworkflow, add --enable_subsetting when running beiko-lab/ARETE. This will subset genomes based on their core genome similarity and accessory genome similarity, as calculated via their PopPUNK distances. By default, the threshold is --core_similarity 99.9 and --accessory_similarity 99 . But these can be changed by adding these parameters to your execution. What happens then is if any pair of genomes is this similar, only one genome from this pair will be included in the phylogenomic section. All of the removed genome IDs will be present under poppunk_results/removed_genomes.txt . By adding --enable_subsetting , you'll be adding two processes to the execution DAG: POPPUNK_EXTRACT_DISTANCES: This process will extract pair-wise distances between all genomes, returning a table under poppunk_results/distances/ . This table will be used to perform the subsetting. MAKE_HEATMAP: This process will create a heatmap showing different similarity thresholds and the number of genomes that'd be present in each of the possible subsets. It'll also be under poppunk_results/distances/ .","title":"PopPUNK subsetting"},{"location":"subsampling/#example-command","text":"The command below will execute the 'annotation' ARETE entry with subsetting enabled, with a core similarity threshold of 99% and an accessory similarity of 95%. nextflow run beiko-lab/ARETE \\ --input_sample_table samplesheet.csv \\ --enable_subsetting \\ --core_similarity 99 \\ --accessory_similarity 95 \\ -profile docker \\ -entry annotation Be sure to not include --skip_poppunk in your command, because that will then disable all PopPUNK-related processes, including the subsetting subworkflow.","title":"Example command"},{"location":"usage/","text":"beiko-lab/ARETE: Usage Introduction The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE. Samplesheet input No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]' Full workflow or assembly samplesheet The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline. Annotation only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline. Phylogenomics and Pangenomics only samplesheet The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow. Reference Genome For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta Running the pipeline The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation. Assembly Entry To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Assembly QC Entry To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker Annotation Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker PopPUNK Entry To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker Phylogenomics and Pangenomics Entry To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker Updating the pipeline When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE Reproducibility It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future. Core Nextflow arguments NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen). -profile Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud conda Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. A generic configuration profile to be used with Conda Pulls most software from Bioconda test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters -resume Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names. -c Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information. Custom resource requests Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information. Running in the background Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). Nextflow memory requirements In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g'","title":"Usage"},{"location":"usage/#beiko-labarete-usage","text":"","title":"beiko-lab/ARETE: Usage"},{"location":"usage/#introduction","text":"The ARETE pipeline can is designed as an end-to-end workflow manager for genome assembly, annotation, and phylogenetic analysis, beginning with read data. However, in some cases a user may wish to stop the pipeline prior to annotation or use the annotation features of the work flow with pre-existing assemblies. Therefore, ARETE allows users different use cases: Run the full pipeline end-to-end. Input a set of reads and stop after assembly. Input a set of assemblies and perform QC. Input a set of assemblies and perform annotation and taxonomic analyses. Input a set of assemblies and perform genome clustering with PopPUNK. Input a set of assemblies and perform phylogenomic and pangenomic analysis. This document will describe how to perform each workflow. \"Running the pipeline\" will show some example command on how to use these different entries to ARETE.","title":"Introduction"},{"location":"usage/#samplesheet-input","text":"No matter your use case, you will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. For full runs and assembly, it has to be a comma-separated file with 3 columns, and a header row as shown in the examples below. --input_sample_table '[path to samplesheet file]'","title":"Samplesheet input"},{"location":"usage/#full-workflow-or-assembly-samplesheet","text":"The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below. A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where TREATMENT_REP3 has been sequenced twice. sample,fastq_1,fastq_2 CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension \".fastq.gz\" or \".fq.gz\". An example samplesheet has been provided with the pipeline.","title":"Full workflow or assembly samplesheet"},{"location":"usage/#annotation-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the annotation and reporting features of the workflow. Users may use the assembly_qc entry point to perform QC on the assemblies. Note that the QC workflow does not automatically filter low quality assemblies, it simply generates QC reports! annotation , assembly_qc and poppunk workflows accept the same format of sample sheet. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. fna_file_path Full path to fna file for assembly or genome. File must have .fna file extension. An example samplesheet has been provided with the pipeline.","title":"Annotation only samplesheet"},{"location":"usage/#phylogenomics-and-pangenomics-only-samplesheet","text":"The ARETE pipeline allows users to provide pre-existing assemblies to make use of the phylogenomic and pangenomic features of the workflow. The sample sheet must be a 2 column, comma-seperated CSV file with header. Column Description sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. gff_file_path Full path to GFF file for assembly or genome. File must have .gff or .gff3 file extension. These files can be the ones generated by Prokka or Bakta in ARETE's annotation subworkflow.","title":"Phylogenomics and Pangenomics only samplesheet"},{"location":"usage/#reference-genome","text":"For full workflow or assembly, users may provide a path to a reference genome in fasta format for use in assembly evaluation. --reference_genome ref.fasta","title":"Reference Genome"},{"location":"usage/#running-the-pipeline","text":"The typical command for running the pipeline is as follows: nextflow run beiko-lab/ARETE --input_sample_table samplesheet.csv --reference_genome ref.fasta --poppunk_model bgmm -profile docker This will launch the pipeline with the docker configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: work # Directory containing the nextflow working files results # Finished results (configurable, see below) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. As written above, the pipeline also allows users to execute only assembly or only annotation.","title":"Running the pipeline"},{"location":"usage/#assembly-entry","text":"To execute assembly (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly Entry"},{"location":"usage/#assembly-qc-entry","text":"To execute QC on pre-existing assemblies (reference genome optional): nextflow run beiko-lab/ARETE -entry assembly_qc --input_sample_table samplesheet.csv --reference_genome ref.fasta -profile docker","title":"Assembly QC Entry"},{"location":"usage/#annotation-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry annotation --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"Annotation Entry"},{"location":"usage/#poppunk-entry","text":"To execute annotation of pre-existing assemblies (PopPUNK model can be either bgmm, dbscan, refine, threshold or lineage): nextflow run beiko-lab/ARETE -entry poppunk --input_sample_table samplesheet.csv --poppunk_model bgmm -profile docker","title":"PopPUNK Entry"},{"location":"usage/#phylogenomics-and-pangenomics-entry","text":"To execute phylogenomic and pangenomics analysis on pre-existing assemblies: nextflow run beiko-lab/ARETE -entry phylogenomics --input_sample_table samplesheet.csv -profile docker","title":"Phylogenomics and Pangenomics Entry"},{"location":"usage/#updating-the-pipeline","text":"When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: nextflow pull beiko-lab/ARETE","title":"Updating the pipeline"},{"location":"usage/#reproducibility","text":"It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the ARETE releases page and find the latest version number - numeric only (eg. 1.3.1 ). Then specify this when running the pipeline with -r (one hyphen) - eg. -r 1.3.1 . This version number will be logged in reports when you run the pipeline, so that you'll know what you used when you look back in the future.","title":"Reproducibility"},{"location":"usage/#core-nextflow-arguments","text":"NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).","title":"Core Nextflow arguments"},{"location":"usage/#-profile","text":"Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud) - see below. We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility. The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation . Note that multiple profiles can be loaded, for example: -profile test,docker - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. If -profile is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH . This is not recommended. docker A generic configuration profile to be used with Docker singularity A generic configuration profile to be used with Singularity podman A generic configuration profile to be used with Podman shifter A generic configuration profile to be used with Shifter charliecloud A generic configuration profile to be used with Charliecloud conda Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter or Charliecloud. A generic configuration profile to be used with Conda Pulls most software from Bioconda test A profile with a complete configuration for automated testing Can run in personal computers with at least 6GB of RAM and 2 CPUs Includes links to test data so needs no other parameters","title":"-profile"},{"location":"usage/#-resume","text":"Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. You can also supply a run name to resume a specific run: -resume [run-name] . Use the nextflow log command to show previous run names.","title":"-resume"},{"location":"usage/#-c","text":"Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.","title":"-c"},{"location":"usage/#custom-resource-requests","text":"Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of 143 (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process UNICYCLER 32GB of memory, you could use the following config: process { withName: UNICYCLER { memory = 32.GB } } To find the exact name of a process you wish to modify the compute resources, check the live-status of a nextflow run displayed on your terminal or check the nextflow error for a line like so: Error executing process > 'bwa' . In this case the name to specify in the custom config file is bwa . See the main Nextflow documentation for more information.","title":"Custom resource requests"},{"location":"usage/#running-in-the-background","text":"Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. The Nextflow -bg flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use screen / tmux or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).","title":"Running in the background"},{"location":"usage/#nextflow-memory-requirements","text":"In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in ~/.bashrc or ~./bash_profile ): NXF_OPTS='-Xms1g -Xmx4g'","title":"Nextflow memory requirements"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 1cb7c0a6..5d1c9386 100644 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ