From 42fee8e6f38368674b915652f3e37df9ca89b0f6 Mon Sep 17 00:00:00 2001 From: Ksenia Date: Thu, 5 Sep 2024 15:27:52 +0100 Subject: [PATCH] Update output.md --- docs/output.md | 37 ++++++++++++++++++++++++------------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/docs/output.md b/docs/output.md index 5a92b0b..1da8d5c 100644 --- a/docs/output.md +++ b/docs/output.md @@ -4,7 +4,14 @@ This document describes the output produced by the genomeassembly pipeline. -The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. +The standard assembly pipeline contains running hifiasm on the HiFi reads, purging the primary contigs with purge_dups, and scaffolding them up with YaHS. +Optionally, if Illumina 10X data is provided, the purged contigs and haplotigs can be polished. + +In case of a diploid genome when HiFi and HiC data is coming from the same individual addtionally hifiasm can be run in HiC mode to produce a phased assembly. In that case the produced haplotypes are not purged but scaffolded up directly with YaHS. + +Optionally, the organelles assembly can be triggered. The mitochondrion and (if relevant) plastid sequences are produced using MitoHiFi and OATK. + +The directories listed below will be created in the --outdir directory after the pipeline has finished. All paths are relative to the top-level --outdir directory. ## Subworkflows @@ -43,13 +50,16 @@ This subworkflow generates a KMER database and coverage model used in [PURGE_DUP - primary assembly in GFA and FASTA format; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html) - .\*hifiasm.\*/.*a_ctg.[g]fa - haplotigs in GFA and FASTA format; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html) + - .\*hifiasm-hic.\*/.*hap1.p_ctg.[g]fa + - fully phased hap1 if hifiasm is run in HiC mode; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html) + - .\*hifiasm-hic.\*/.*hap2.p_ctg.[g]fa + - fully phased hap2 if hifiasm is run in HiC mode; for more details refer to [hifiasm output](https://hifiasm.readthedocs.io/en/latest/interpreting-output.html) - .\*hifiasm.\*/.*bin - internal binary hifiasm files; for more details refer [here](https://hifiasm.readthedocs.io/en/latest/faq.html#id12) This subworkflow generates a raw assembly(-ies). First, hifiasm is run on the input HiFi reads then raw contigs are converted from GFA into FASTA format, this assembly is due to purging, polishing (optional) and scaffolding further down the pipeline. -In case hifiasm HiC mode is switched on, it is performed as an extra step with results stored in hifiasm-hic folder.

![Raw assembly subworkflow](images/v1/raw_assembly.png) @@ -68,6 +78,7 @@ In case hifiasm HiC mode is switched on, it is performed as an extra step with r Retained haplotype is identified in primary assembly. The alternate contigs are updated correspondingly. The subworkflow relies on kmer coverage model to identify coverage thresholds. For more details see [purge_dups](https://github.com/dfguan/purge_dups) +The two haplotype assemblies produced by hifiasm in HiC mode are not purged.

@@ -98,9 +109,9 @@ This subworkflow uses read mapping of the Illumina 10X short read data to fix sh
Output files - - \*.hifiasm..\*/scaffolding/.*_merged_sorted.bed + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/.*_merged_sorted.bed - bed file obtained from merged mkdup bam - - \*.hifiasm..\*/scaffolding/.*mkdup.bam + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/.*mkdup.bam - final read mapping bam with mapped reads
@@ -113,11 +124,11 @@ This subworkflow implements alignment of the Illumina HiC short reads to the pri
Output files - - \*.hifiasm..\*/scaffolding/.*.stats + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/.*.stats - output of samtools stats - - \*.hifiasm..\*/scaffolding/.*.idxstats + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/.*.idxstats - output of samtools idxstats - - \*.hifiasm..\*/scaffolding/.*.flagstat + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/.*.flagstat - output of samtools flagstat
@@ -128,17 +139,17 @@ This subworkflow produces statistcs for a bam file containing read mapping. It i
Output files - - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/out_scaffolds_final.fa + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.fa - scaffolds in FASTA format - - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/out_scaffolds_final.agp + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/out_scaffolds_final.agp - coordinates of contigs relative to scaffolds - - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/alignments_sorted.txt + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/alignments_sorted.txt - Alignments for Juicer in text format - - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/yahs_scaffolds.hic + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/yahs_scaffolds.hic - Juicer HiC map - - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/*cool + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*cool - HiC map for cooler - - \*.hifiasm..\*/scaffolding/yahs/out.break.yahs/*.FullMap.png + - \*.hifiasm.\*/scaffolding[_hap1/_hap2/^$]/yahs/out.break.yahs/*.FullMap.png - Pretext snapshot