-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insights needed for comparison with other assemblers #46
Comments
Happy to see that PenguiN is considered as your potential assembly tool. From the command calls you used I guess the final contigs from PenguiN's default parameters are too short, so increasing the Regarding 1) On the other hand, it must also be said that PenguiN's approach carries the risk of producing redundant contigs. To overcome the issue of dead ends in low coverage regions during the greedy iterative assembly strategy, PenguiN (and Plass) re-uses reads. More precisely, different contigs can be extended with the same read. In principle the same genomic region can be built multiple times in parallel. We introduced a few ideas to minimize the effect however it cannot be prevented completely. This is why we integrated the Linclust algorithm [Steinegger and Söding, 2018] in PenguiN as the last step and only output the cluster representatives as final contigs. However Linclust's speed comes with at the expense of some loss in sensitivity. In cases where redundancy is problematic, I suggest using a more sensitive all-against-all clustering in a post-processing step after the assembly. In our Paper benchmarks, we used for example an additional clustering step using the nucleotide clustering workflow of the MMseqs2 software suite. Regarding 2) |
Hi there,
Congratulations on your tool - I'm really excited about PenguiN as this could be an interesting alternative to explore. As such, I've set out to compare it our group's gold-standard for environmental metagenomics, metaSPAdes and am getting some really interesting data that maybe you could help me interpret to see if we should consider changing to using PenguiN or not?
Here's how everything has been run so far on an example environmental metagenome:
metaSPAdes:
PenguiN:
PenguiN_wmods:
Using bowtie2, I mapped each assembly to its reads as below, after filtering each assembly to contain only scaffolds/contigs >1000 bp:
Looking at the log files here's what I see:
Is this something you see a lot in your experience? In principle, I'd say that higher percentages of 'aligned concordantly >1 times' should be indicative of multimapping and thus not a good sign?
Here's a quick plot of median/mean contig lengths (scaffolds for metaSPAdes), with standard deviations as vertical lines from each point:
This makes sense when looking at length frequency distributions for each assembly:
Do you contemplate adding a scaffolding module to PenguiN? I wonder how these values could change with that!
I think there's a lot of potential in PenguiN - I'm still reading up on it, but will take any insights you're willing to offer as you look at this data! I can also share rps3 taxonomic profiles I've run on each assembly if you'd want.
Thanks in advance for the attention!
The text was updated successfully, but these errors were encountered: