Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revising trycycler and select assembly implementations #61

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

fredjaya
Copy link
Member

@fredjaya fredjaya commented Nov 4, 2024

Tested and works on all Vibrio and Tenacibaculum barcodes. Currently on /scratch/er01/fj9712/2411_wholetest - to be moved.

Things to discuss and address either in this PR or later ones:

  1. How should the results directory be structured? What files should be output here? (e.g. Medaka output for consensus chromosomes is split into 2 subdirectories #28)
  2. Should the select assembly stats be reported?
  3. What should go in the MultiQC report?
  4. Does it work on the cat and dog data? (Fred currently testing) Yes, except for barcode24 - /scratch/tj48/fj9712/02_work/2411_catdogs

Implementation

image

Reference-free chromosome assembly selection

Addresses #54, #23

For the chromosomal assembly, every barcode is assembled by flye and unicycler, and polished. The single "best" polished assembly out of flye, unicycler, and optionally trycycler (consensus assembly), is selected for downstream annotation and analyses.

To avoid biasing assemblies to published references, the assembly with the most complete BUSCOs is considered the best one. This now allows unicycler assemblies to be considered too. QUAST is also run but not used for selecting the best assembly.

Now only has a implementation for chromosomal assembly, instead of two independent ones, to make updating the criteria for selecting an assembly easier. For example, to incorporate QUAST outputs, or add additional tools like Merqury.

Trycycler implementation

Addresses #43, #60

Trycycler processes are now self-contained. Additional assemblers can be implemented easier to generate better consensus assemblies if required.

Added more error handling for too-few-contigs (trycycler cluster filters more out). If trycycler correctly fails at any point, the pipeline will still continue and select either the flye or unicycler assembly for downstream processes.

Input/output process definitions are more explicit (i.e. specific files instead of globs) for better error handling. A lot more operators and groovy in the workflow scope as a result.

Modularising assembly steps for incoming revision of select_assembly.

Current implementation branches channels according to the trycycler_cluster output. Move counting of contigs per assembly prior to trycycler so all trycycler (sub)processes can be run directly one after the other.
Refactoring iteratively so everything doesn't break
TIL the clustering step filters out small contigs (default < 5000 nt)
and will run into < 2 contigs error in trycycler_classify again.
Consider flye and unicycler assemblies as "de novo". This impacts which
data are grouped together for processing and channelling.

Previously, medaka was hardcoded to polish only trycycler and flye
assemblies. This commit adds an assembly-agnostic module for medaka
polishing (WIP).

Introduce val(assembler_name) for tagging, reporting etc.
Rename channels and comments to clarify differences between de novo vs. consensus assemblies.

combined assemblies should be both de novo and consensus (all).
Add more flexible quast module, some temp medaka changes to keep old
implementation running for now
To better align with best practices and readability. Be more explicit
with error strategy and process script def. This required some
additional groovy in workflow{} though
- Remove manual file moving or bash conditionals in process script def
- Remove if/else channel operators/groovy
- Output dir no longer has barcode, might re-add later, but might be ok
because it's output with the barcode tuple
Change process outputs to recurse through barcode and cluster
directories (e.g. **/out_file)
Mainly tidying medaka denovo and consensus implementations to look for
the polished assembly in the process outputs.
Fixes inconsistent publishing for assemblies and qc results
Diffs I have intentionally kept separate - a lot of things add temporary
tweaks to get this current version running during development.
Also fix `trycycler_reconcile_new` inconsistent `2_all_seqs.fasta`
process output
For trycycler and flye-specific downstream processes, modules, and
config. Commenting out existing "chromosome" implementations and will re-add progressively.
`select_assembly_new` didn't cache properly as it was outputing a
`stdout` - best assembly now stored in a text file.

Update bakta and amrfinderplus processes for chromosome annotation to
handle new metadata, reduce `mkdir` and file movement within script, and
decouple output definitions from hardcoded paths etc.

No longer need `helper.patch`
Clarify map syntax, module tags, publishDir handling
@fredjaya fredjaya marked this pull request as ready for review November 4, 2024 00:18
@fredjaya
Copy link
Member Author

fredjaya commented Nov 4, 2024

This is what the current results/ folder looks like for a single barcode:

results
├── annotations
│   ├── barcode01
│   │   ├── abricate
│   │   │   └── barcode01_consensus_chr.txt
│   │   ├── amrfinderplus
│   │   │   └── barcode01_consensus_chr.tsv
│   │   ├── bakta
│   │   │   ├── barcode01_consensus_chr.faa
│   │   │   └── barcode01_consensus_chr.txt
│   │   └── plasmids
│   │       └── barcode01_bakta
├── assemblies
│   ├── barcode01_consensus
│   │   └── consensus.fasta
│   ├── barcode01_flye
│   │   └── consensus.fasta
│   ├── barcode01_plassembler
│   │   ├── flye_output
│   │   ├── logs
│   │   ├── plassembler_1730446699.3410256.log
│   │   ├── plassembler_plasmids.fasta
│   │   ├── plassembler_plasmids.gfa
│   │   ├── plassembler_summary.tsv
│   │   └── unicycler_output
│   ├── barcode01_unicycler
│   │   └── consensus.fasta
├── quality_control
│   ├── barcode01
│   │   ├── barcode01_consensus
│   │   ├── barcode01_consensus_busco
│   │   ├── barcode01_consensus.tsv
│   │   ├── barcode01_flye
│   │   ├── barcode01_flye_busco
│   │   ├── barcode01_flye.tsv
│   │   ├── barcode01_unicycler
│   │   ├── barcode01_unicycler_busco
│   │   └── barcode01_unicycler.tsv
│   ├── barcode01_kraken2
│   │   └── barcode01.k2report
├── report
│   ├── barcode01_consensus
│   ├── barcode01_flye
│   ├── barcode01_unicycler
├── run_info
│   ├── dag.svg
│   ├── gadi-nf-core-trace-*.txt
│   ├── report.html
│   └── timeline.html
├── taxonomy
│   ├── abricate_vfdb_output.txt
│   ├── amrfinderplus_output.txt
│   ├── barcode_species_table_mqc.txt
│   ├── combined_plot_mqc.png
│   └── phylogeny
└── tree

350 directories, 234 files

Suggestions:

  • publish only the selected chromosomal assembly, remove assembler from name
  • ...

@georgiesamaha
Copy link
Member

All works lovely until run_orthofinder:

Run script:

#!/bin/bash

#PBS -P er01
#PBS -l walltime=10:00:00
#PBS -l ncpus=1
#PBS -l mem=5GB
#PBS -W umask=022
#PBS -q copyq
#PBS -l wd
#PBS -l storage=scratch/tj48
#PBS -l jobfs=100GB

## RUN FROM PROJECT DIRECTORY WITH: bash test/run_test.sh

# Load version of nextflow with plug-in functionality enabled 
module load nextflow/24.04.1 
module load singularity 

# Define inputs 
samplesheet=/scratch/tj48/gs5517/ONT-bacpac-nf/samplesheet.csv 
k2db=/scratch/tj48/databases/kraken2_db/ 
sequencing_summary=/scratch/tj48/fj9712/00_raw/sequencing_summary.txt 
gadi_account=er01 #e.g. aa00
gadi_storage=scratch/tj48+scratch/er01 

# Unhash this command to run pipeline with samplesheet
nextflow run main.nf \
	--samplesheet ${samplesheet} \
	--kraken2_db ${k2db} \
	--sequencing_summary ${sequencing_summary} \
	--gadi_account ${gadi_account} \
	--gadi_storage ${gadi_storage} \
	-resume -profile gadi #you can remove ,high_accuracy if you want to run fast basecalling samples

Error message:

ERROR ~ Error executing process > 'run_orthofinder (GENERATE PHYLOGENY)'

Caused by:
  Process `run_orthofinder (GENERATE PHYLOGENY)` terminated with an error exit status (1)


Command executed:

  # Description: Generate a phylogeny tree with orthofinder tool 
  
  # Using mafft and fastree
   orthofinder \
        -f phylogeny \
        -o phylogeny_tree \
        -n tree \
        -t 16 \
        -a 16

Command exit status:
  1

Command output:
  
  OrthoFinder version 2.5.5 Copyright (C) 2014 David Emms
  
  2024-11-11 22:29:50 : Starting OrthoFinder 2.5.5
  16 thread(s) for highly parallel tasks (BLAST searches etc.)
  16 thread(s) for OrthoFinder algorithm
  
  Checking required programs are installed
  ----------------------------------------
  Test can run "mcl -h" - ok
  Test can run "fastme -i phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.phy -o phylogeny_tree/Results_tree/WorkingDirectory/dependencies/SimpleTest.tre" - ok
  
  WARNING: Files have been ignored as they don't appear to be FASTA files:
  Escherichia_coli_REF_GCF_000005845.2_ASM584v2.fna
  OrthoFinder expects FASTA files to have one of the following extensions: fas, fasta, pep, fa, faa
  ERROR: At least two species are required
  ERROR: An error occurred, ***please review the error messages*** they may contain useful information about the problem.

Command error:
  /usr/local/bin/scripts_of/tree.py:367: SyntaxWarning: invalid escape sequence '\-'
    """
  /usr/local/bin/scripts_of/tree.py:1422: SyntaxWarning: invalid escape sequence '\-'
    """
  /usr/local/bin/scripts_of/newick.py:54: SyntaxWarning: invalid escape sequence '\['
    _ILEGAL_NEWICK_CHARS = ":;(),\[\]\t\n\r="
  /usr/local/bin/scripts_of/newick.py:57: SyntaxWarning: invalid escape sequence '\['
    _NHX_RE = "\[&&NHX:[^\]]*\]"
  /usr/local/bin/scripts_of/newick.py:58: SyntaxWarning: invalid escape sequence '\d'
    _FLOAT_RE = "[+-]?\d+\.?\d*(?:[eE][-+]\d+)?"
  /usr/local/bin/scripts_of/newick.py:60: SyntaxWarning: invalid escape sequence '\['
    _NAME_RE = "[^():,;\[\]]+"
  /usr/local/bin/scripts_of/newick.py:337: SyntaxWarning: invalid escape sequence '\s'
    MATCH = '%s\s*%s\s*(%s)?' % (FIRST_MATCH, SECOND_MATCH, _NHX_RE)
  /usr/local/bin/scripts_of/probroot.py:10: SyntaxWarning: invalid escape sequence '\i'
    """
  /usr/local/bin/scripts_of/probroot.py:201: SyntaxWarning: invalid escape sequence '\l'
    """
  /usr/local/bin/scripts_of/probroot.py:267: SyntaxWarning: invalid escape sequence '\l'
    """

Work dir:
  /scratch/tj48/gs5517/ONT-bacpac-nf/work/ee/062a5772a8ab3e36b326533776093a

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants