Revising trycycler and select assembly implementations #61

fredjaya · 2024-11-04T00:15:56Z

Tested and works on all Vibrio and Tenacibaculum barcodes. Currently on /scratch/er01/fj9712/2411_wholetest - to be moved.

Things to discuss and address either in this PR or later ones:

How should the results directory be structured? What files should be output here? (e.g. Medaka output for consensus chromosomes is split into 2 subdirectories #28)
Should the select assembly stats be reported?
What should go in the MultiQC report?
~~Does it work on the cat and dog data? (Fred currently testing)~~ Yes, except for barcode24 - /scratch/tj48/fj9712/02_work/2411_catdogs

Implementation

Reference-free chromosome assembly selection

Addresses #54, #23

For the chromosomal assembly, every barcode is assembled by flye and unicycler, and polished. The single "best" polished assembly out of flye, unicycler, and optionally trycycler (consensus assembly), is selected for downstream annotation and analyses.

To avoid biasing assemblies to published references, the assembly with the most complete BUSCOs is considered the best one. This now allows unicycler assemblies to be considered too. QUAST is also run but not used for selecting the best assembly.

Now only has a implementation for chromosomal assembly, instead of two independent ones, to make updating the criteria for selecting an assembly easier. For example, to incorporate QUAST outputs, or add additional tools like Merqury.

Trycycler implementation

Addresses #43, #60

Trycycler processes are now self-contained. Additional assemblers can be implemented easier to generate better consensus assemblies if required.

Added more error handling for too-few-contigs (trycycler cluster filters more out). If trycycler correctly fails at any point, the pipeline will still continue and select either the flye or unicycler assembly for downstream processes.

Input/output process definitions are more explicit (i.e. specific files instead of globs) for better error handling. A lot more operators and groovy in the workflow scope as a result.

Modularising assembly steps for incoming revision of select_assembly. Current implementation branches channels according to the trycycler_cluster output. Move counting of contigs per assembly prior to trycycler so all trycycler (sub)processes can be run directly one after the other.

Refactoring iteratively so everything doesn't break

TIL the clustering step filters out small contigs (default < 5000 nt) and will run into < 2 contigs error in trycycler_classify again.

Consider flye and unicycler assemblies as "de novo". This impacts which data are grouped together for processing and channelling. Previously, medaka was hardcoded to polish only trycycler and flye assemblies. This commit adds an assembly-agnostic module for medaka polishing (WIP). Introduce val(assembler_name) for tagging, reporting etc.

Rename channels and comments to clarify differences between de novo vs. consensus assemblies. combined assemblies should be both de novo and consensus (all).

Add more flexible quast module, some temp medaka changes to keep old implementation running for now

To better align with best practices and readability. Be more explicit with error strategy and process script def. This required some additional groovy in workflow{} though

- Remove manual file moving or bash conditionals in process script def - Remove if/else channel operators/groovy - Output dir no longer has barcode, might re-add later, but might be ok because it's output with the barcode tuple

Change process outputs to recurse through barcode and cluster directories (e.g. **/out_file)

Mainly tidying medaka denovo and consensus implementations to look for the polished assembly in the process outputs.

Fixes inconsistent publishing for assemblies and qc results

Diffs I have intentionally kept separate - a lot of things add temporary tweaks to get this current version running during development.

Some comment tidying

Also fix `trycycler_reconcile_new` inconsistent `2_all_seqs.fasta` process output

For trycycler and flye-specific downstream processes, modules, and config. Commenting out existing "chromosome" implementations and will re-add progressively.

`select_assembly_new` didn't cache properly as it was outputing a `stdout` - best assembly now stored in a text file. Update bakta and amrfinderplus processes for chromosome annotation to handle new metadata, reduce `mkdir` and file movement within script, and decouple output definitions from hardcoded paths etc. No longer need `helper.patch`

Clarify map syntax, module tags, publishDir handling

Modules suffixed with `*_new` replaces existing modules

fredjaya · 2024-11-04T07:51:56Z

This is what the current results/ folder looks like for a single barcode:

results
├── annotations
│   ├── barcode01
│   │   ├── abricate
│   │   │   └── barcode01_consensus_chr.txt
│   │   ├── amrfinderplus
│   │   │   └── barcode01_consensus_chr.tsv
│   │   ├── bakta
│   │   │   ├── barcode01_consensus_chr.faa
│   │   │   └── barcode01_consensus_chr.txt
│   │   └── plasmids
│   │       └── barcode01_bakta
├── assemblies
│   ├── barcode01_consensus
│   │   └── consensus.fasta
│   ├── barcode01_flye
│   │   └── consensus.fasta
│   ├── barcode01_plassembler
│   │   ├── flye_output
│   │   ├── logs
│   │   ├── plassembler_1730446699.3410256.log
│   │   ├── plassembler_plasmids.fasta
│   │   ├── plassembler_plasmids.gfa
│   │   ├── plassembler_summary.tsv
│   │   └── unicycler_output
│   ├── barcode01_unicycler
│   │   └── consensus.fasta
├── quality_control
│   ├── barcode01
│   │   ├── barcode01_consensus
│   │   ├── barcode01_consensus_busco
│   │   ├── barcode01_consensus.tsv
│   │   ├── barcode01_flye
│   │   ├── barcode01_flye_busco
│   │   ├── barcode01_flye.tsv
│   │   ├── barcode01_unicycler
│   │   ├── barcode01_unicycler_busco
│   │   └── barcode01_unicycler.tsv
│   ├── barcode01_kraken2
│   │   └── barcode01.k2report
├── report
│   ├── barcode01_consensus
│   ├── barcode01_flye
│   ├── barcode01_unicycler
├── run_info
│   ├── dag.svg
│   ├── gadi-nf-core-trace-*.txt
│   ├── report.html
│   └── timeline.html
├── taxonomy
│   ├── abricate_vfdb_output.txt
│   ├── amrfinderplus_output.txt
│   ├── barcode_species_table_mqc.txt
│   ├── combined_plot_mqc.png
│   └── phylogeny
└── tree

350 directories, 234 files

Suggestions:

publish only the selected chromosomal assembly, remove assembler from name
...

georgiesamaha · 2024-11-11T11:37:05Z

All works lovely until run_orthofinder:

Run script:

#!/bin/bash

#PBS -P er01
#PBS -l walltime=10:00:00
#PBS -l ncpus=1
#PBS -l mem=5GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -l storage=scratch/tj48
#PBS -l jobfs=100GB

## RUN FROM PROJECT DIRECTORY WITH: bash test/run_test.sh

# Load version of nextflow with plug-in functionality enabled 
module load nextflow/24.04.1 
module load singularity 

# Define inputs 
samplesheet=/scratch/tj48/gs5517/ONT-bacpac-nf/samplesheet.csv 
k2db=/scratch/tj48/databases/kraken2_db/ 
sequencing_summary=/scratch/tj48/fj9712/00_raw/sequencing_summary.txt 
gadi_account=er01 #e.g. aa00
gadi_storage=scratch/tj48+scratch/er01 

# Unhash this command to run pipeline with samplesheet
nextflow run main.nf \
	--samplesheet ${samplesheet} \
	--kraken2_db ${k2db} \
	--sequencing_summary ${sequencing_summary} \
	--gadi_account ${gadi_account} \
	--gadi_storage ${gadi_storage} \
	-resume -profile gadi #you can remove ,high_accuracy if you want to run fast basecalling samples

Error message:

ERROR ~ Error executing process > 'run_orthofinder (GENERATE PHYLOGENY)'

Caused by:
  Process `run_orthofinder (GENERATE PHYLOGENY)` terminated with an error exit status (1)


Command executed:

  # Description: Generate a phylogeny tree with orthofinder tool 
  
  # Using mafft and fastree
   orthofinder \
        -f phylogeny \
        -o phylogeny_tree \
        -n tree \
        -t 16 \
        -a 16

Command exit status:
  1

Command output:
  
  OrthoFinder version 2.5.5 Copyright (C) 2014 David Emms
  
  2024-11-11 22:29:50 : Starting OrthoFinder 2.5.5
  16 thread(s) for highly parallel tasks (BLAST searches etc.)
  16 thread(s) for OrthoFinder algorithm
  
  Checking required programs are installed
  ----------------------------------------
  Test can run "mcl -h" - ok
  Test can run "fastme -i phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.phy -o phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.tre" - ok
  
  WARNING: Files have been ignored as they don't appear to be FASTA files:
  Escherichia_coli_REF_GCF_000005845.2_ASM584v2.fna
  OrthoFinder expects FASTA files to have one of the following extensions: fas, fasta, pep, fa, faa
  ERROR: At least two species are required
  ERROR: An error occurred, ***please review the error messages*** they may contain useful information about the problem.

Command error:
  /usr/local/bin/scripts_of/tree.py:367: SyntaxWarning: invalid escape sequence '\-'
    """
  /usr/local/bin/scripts_of/tree.py:1422: SyntaxWarning: invalid escape sequence '\-'
    """
  /usr/local/bin/scripts_of/newick.py:54: SyntaxWarning: invalid escape sequence '\['
    _ILEGAL_NEWICK_CHARS = ":;(),\[\]\t\n\r="
  /usr/local/bin/scripts_of/newick.py:57: SyntaxWarning: invalid escape sequence '\['
    _NHX_RE = "\[&&NHX:[^\]]*\]"
  /usr/local/bin/scripts_of/newick.py:58: SyntaxWarning: invalid escape sequence '\d'
    _FLOAT_RE = "[+-]?\d+\.?\d*(?:[eE][-+]\d+)?"
  /usr/local/bin/scripts_of/newick.py:60: SyntaxWarning: invalid escape sequence '\['
    _NAME_RE = "[^():,;\[\]]+"
  /usr/local/bin/scripts_of/newick.py:337: SyntaxWarning: invalid escape sequence '\s'
    MATCH = '%s\s*%s\s*(%s)?' % (FIRST_MATCH, SECOND_MATCH, _NHX_RE)
  /usr/local/bin/scripts_of/probroot.py:10: SyntaxWarning: invalid escape sequence '\i'
    """
  /usr/local/bin/scripts_of/probroot.py:201: SyntaxWarning: invalid escape sequence '\l'
    """
  /usr/local/bin/scripts_of/probroot.py:267: SyntaxWarning: invalid escape sequence '\l'
    """

Work dir:
  /scratch/tj48/gs5517/ONT-bacpac-nf/work/ee/062a5772a8ab3e36b326533776093a

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

fredjaya added 30 commits October 3, 2024 13:54

MAINT: temp fix for trycycler_cluster input and ch

976a6ad

Refactoring iteratively so everything doesn't break

STY: groovy and comment formatting

efbe12a

MAINT: Re-add contig count after trycycler_cluster

ccc9f97

TIL the clustering step filters out small contigs (default < 5000 nt) and will run into < 2 contigs error in trycycler_classify again.

STY: rename "combined_assembly" to "denovo"

2d9ee91

Rename channels and comments to clarify differences between de novo vs. consensus assemblies. combined assemblies should be both de novo and consensus (all).

WIP: quast updates

cf58c6c

Add more flexible quast module, some temp medaka changes to keep old implementation running for now

DEV: assembly qc publishdir per barcode

4e4732e

ENH: add busco for assembly qc

3942568

DEV: Save progress for new trycycler partition

a88288a

MAINT: Improve trycycler_cluster and refactor outs

2ab75c9

To better align with best practices and readability. Be more explicit with error strategy and process script def. This required some additional groovy in workflow{} though

MAINT: Simplify reconcile

f5167cc

- Remove manual file moving or bash conditionals in process script def - Remove if/else channel operators/groovy - Output dir no longer has barcode, might re-add later, but might be ok because it's output with the barcode tuple

MAINT: Tidying reconcile

71b58a8

MAINT: Add tidied msa implementation

320ee92

MAINT: new trycycler partition WIP

f941394

MAINT: Fix partition and reconcile implementations

033be77

Change process outputs to recurse through barcode and cluster directories (e.g. **/out_file)

MAINT: add trycycler_consensus_new

f579462

MAINT: Update consensus and denovo polishing

6c985e5

MAINT: forgot module

99ad57a

DEV: Now outputs QC for all assemblies!

029a169

Mainly tidying medaka denovo and consensus implementations to look for the polished assembly in the process outputs.

DEV: Concat per-cluster consensus fastas + tidying

65a879c

MAINT: No longer output medaka subfolders

3a98a58

Fixes inconsistent publishing for assemblies and qc results

DEV: Add temp fix patch file

7b2c3a0

Diffs I have intentionally kept separate - a lot of things add temporary tweaks to get this current version running during development.

BUG: Fix concat due to file name collision

094789e

WIP: Add new select assembly based on busco

03bce1e

Some comment tidying

DEV: Update patch

17b8245

ENH: Selects assembly according to buscos

62f1fae

Also fix `trycycler_reconcile_new` inconsistent `2_all_seqs.fasta` process output

WIP/STY: Delete old implementations

29642b6

For trycycler and flye-specific downstream processes, modules, and config. Commenting out existing "chromosome" implementations and will re-add progressively.

MAINT: Tidy chrom. annotation modules and formats

39d69aa

Clarify map syntax, module tags, publishDir handling

fredjaya added 4 commits October 31, 2024 17:06

DEV: Replace trycycler and medaka impl.

a093eb9

Modules suffixed with `*_new` replaces existing modules

MAINT: select assembly and multiqc fixes

13583fa

It works!

9a76a77

Merge branch 'main' into issue-54

2e9e869

fredjaya marked this pull request as ready for review November 4, 2024 00:18

georgiesamaha self-requested a review November 11, 2024 09:04

Change publish mode for results to copy

85bb83a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revising trycycler and select assembly implementations #61

Revising trycycler and select assembly implementations #61

fredjaya commented Nov 4, 2024 •

edited

Loading

fredjaya commented Nov 4, 2024 •

edited

Loading

georgiesamaha commented Nov 11, 2024

Revising trycycler and select assembly implementations #61

Are you sure you want to change the base?

Revising trycycler and select assembly implementations #61

Conversation

fredjaya commented Nov 4, 2024 • edited Loading

Implementation

Reference-free chromosome assembly selection

Trycycler implementation

fredjaya commented Nov 4, 2024 • edited Loading

georgiesamaha commented Nov 11, 2024

fredjaya commented Nov 4, 2024 •

edited

Loading

fredjaya commented Nov 4, 2024 •

edited

Loading