Releases: Gaius-Augustus/GALBA
v1.0.11 - deleting transcripts with CDS features on opposite strands
X-Mas Release 2023: DIAMOND denoises AUGUSTUS predictions
This released was inspired by the manuscript
Newly Sequenced Genomes Reveal Patterns of Gene Family Expansion in select Dragonflies (Odonata: Anisoptera)
by
Ethan R. Tolman, Christopher D. Beatty, Paul B. Frandsen, Jonas Bush, Or R. Bruchim, Ella Simone Driever, Kathleen M. Harding, Dick Jordan, Manpreet K. Kohli, Jiwoo Park, Seojun Park, Kelly Reyes, Mira Rosari, Jisong L. Ryu, Vincent Wade, Jessica L. Ware
https://doi.org/10.1101/2023.12.11.569651
The authors state in the manuscript:
"While our genome annotations initially had a high (>50,000) number of genes compared to the annotation of P. flavescens, by conservatively retaining only genes which had a BLAST hit to a protein sequence from P. flavescens [27], we were able to generate highly complete annotations (fig 1. A,B), further supporting the efficacy of this pipeline in insects."
I adopted the idea, added a new script filter_gtf_by_diamond_against_ref.py that does the same thing, using DIAMOND. I chose diamond only because of speed, the result should be highly similar to using BLAST.
Calling the script is integrated into galba.pl . This approach can substantially increase specificity for a marginal tradeoff in specificity.
Accuracy comparison before and after DIAMOND filter for denoising AUGUSTUS predictions in GALBA:
D. melanogaster
before
gene_Sn 71.07
gene_Sp 71.09
trans_Sn 48.45
trans_Sp 63.74
cds_Sn 78.45
cds_Sp 87.54
after
gene_Sn 71.02
gene_Sp 73.28
trans_Sn 48.42
trans_Sp 65.42
cds_Sn 78.43
cds_Sp 88.90
Mus musculus
before
gene_Sn 70.64
gene_Sp 38.34
trans_Sn 28.70
trans_Sp 35.26
cds_Sn 77.43
cds_Sp 82.34
after
gene_Sn 70.29
gene_Sp 66.63
trans_Sn 28.55
trans_Sp 56.33
cds_Sn 77.10
cds_Sp 92.23
Acknowledgement
We thank Tolman et al. for describing this very simple but highly effective idea!
Debugged accuracy evaluation, improved training gene selection
- @tomasbruna changed miniprothint to additionally output only the best gene per locus (instead of several) -> these are now training genes in GALBA
- debugged automated accuracy evaluation
Improved training gene selection
- @tomasbruna extended miniprothint to output training genes for GALBA. His implementation is much better than the original implementation in GALBA. GALBA therefore now uses this miniprothint functionality
- @tomasbruna also improved specificity of hints, it should now be safter to use proteins of more distant degree of relatedness (accuracy tests on large scale still pending)
- galba_cleanup has been ported to python (no change in functionality)
Fixing redundant augargs
Related to this issue: #32 (comment)
I fixed that augargs** are not passed twice to pygustus when running AUGUSTUS in ab initio mode
Alternative transcript prediction restored
Key difference to the previous release is a bugfix that restores prediction of alternative transcripts if evidence for such is present
Running miniprot only once
Better jsonfile protection
- pygustus jsonfile is now locked during fixing, this makes it safe to run multiple Galba processes in parallel
- usexisting disappears from instructions in case of error
pygustus json config file fix
- automatically update an outdated (typo containing) json file with pygustus and augustus parameters in $AUGUSTUS_CONFIG_PATH/parameters/ if necessary
- redirect miniprot stderr output to file
- catch star containing lines in miniprot output
- bugfix in miniprothint (for case of only 1 reference proteome/coverage 1, actual fix is in miniprothint repository)
miniprothint integration
- initial miniprothint integration boosts accuracy
- iterative training boosts accuracy
- runtime is much worse than previous release
- will in the future take measures to speed up GALBA