This repository contains all data and scripts used for the experiments of the paper TSEBRA: Transcript Selector for BRAKER.
Both experiments require that TSEBRA has been downloaded and that its bin
folder is added to $PATH
, for example with:
git clone https://github.com/Gaius-Augustus/TSEBRA
export PATH="$(pwd)/TSEBRA/bin:$PATH"
The first experiment was carried out with all species listed in species.tab
. It demonstrates the increase of accuarcies of TSEBRA compared to BRAKER1 and BRAKER2.
Choose a species and replace "Enter species" with a species name from species.tab
, e.g. "Bombus_terrestris".
species="Enter_species"
species_dir=$(pwd)"/$species"
mkdir -p $species_dir
You have to choose a level of exclusion for the protein database of the BRAKER2 run.
We used the levels 'species_excluded', 'family_excluded', 'order_excluded' for the species listed in model_species.tab
. For all other species we only used 'order_excluded'.
### Choose a test level and remove the '#'
### Only choose species_excluded or family_excluded if your species is listed in model_species.tab
#level="species_excluded"
#level="family_excluded"
#level="order_excluded"
braker1_dir="${species_dir}/braker1/"
braker2_dir="${species_dir}/braker2/${level}"
Before you run BRAKER you have to
- install BRAKER v2.1.5,
- prepare genome and annotation as described in EukSpecies-BRAKER2 into
$species_dir
, - prepare RNA-seq hints and protein data as described in BRAKER2-exp into
$species_dir
.
You have to run BRAKER1 only once per species!
Run BRAKER1:
braker.pl --genome=${species_dir}/data/genome.fasta.masked --hints=${species_dir}/varus/varus.gff --softmasking --species=${species}_braker1 --workingdir=$braker1_dir
Run BRAKER2:
mkdir -p $braker2_dir
braker.pl --genome=${species_dir}/data/genome.fasta.masked --prot_seq=${species_dir}/data/${level}.fa --softmasking --species=${species}_${level} --epmode --prg=ph --nocleanup --AUGUSTUS_ab_initio --workingdir=$braker2_dir
Ensure that the transcript and gene IDs of the BRAKER prediction files are in order:
fix_gtf_ids.py --gtf ${braker1_dir}/braker.gtf --out ${braker1_dir}/braker_fixed.gtf
fix_gtf_ids.py --gtf ${braker2_dir}/braker.gtf --out ${braker2_dir}/braker_fixed.gtf
Run TSEBRA:
tsebra_default=${species_dir}/tsebra_default/${level}/
mkdir -p $tsebra_default
tsebra.py -g ${braker1_dir}/braker_fixed.gtf,${braker2_dir}/braker_fixed.gtf -c ${species_dir}/../config/default.cfg -e ${braker1_dir}/hintsfile.gff,${braker2_dir}/hintsfile.gff -o ${tsebra_default}/tsebra_default.gtf
Compute F1-score, Sensitivity, Specificity:
eval_exp1.py --species_dir $species_dir --test_level $level
The TSEBRA and evaluation results are located at $species_dir/tsebra_default/$level/
.
Experiment 2 (Comparison to EVidenceModeler3)
The second experiment was carried out with all model species listed in model_species.tab
and the following manual only works for these species. In this experiment we compared TSEBRA with EVidenceModeler (EVM).
For this experiment you need to
- perform Experiment 1,
- install EVidenceModeler (commit 68e724e).
If you haven't done it for the first Experiment:
- prepare genome and annotation as described in EukSpecies-BRAKER2 into
$species_dir
, - install TSEBRA
Choose a species and replace "Enter species" with a species name from model_species.tab
, e.g. "Drosophila_melanogaster".
species="Enter_species"
species_dir=$(pwd)"/$species"
Choose a level of exclusion for the protein database of the BRAKER2 run and remove the corresponding '#'.
### Choose a test level and remove the '#':
#level="species_excluded"
#level="family_excluded"
#level="order_excluded"
PASA4
To generate RNA-seq based hints with PASA, you have to make your own VARUS run from scratch as decribed in BRAKER2-exp into $species_dir/varus/
and you have to install:
Add them to your $PATH
variable.
Extract RNA-read sequences from VARUS.bam
samtools fasta ${species_dir}/varus/VARUS.bam > ${species_dir}/varus/VARUS.fasta
Run Trinity5 (you can change the max_memory and CPU to suit your system):
Trinity --seqType fa --max_memory 4G --CPU 4 --single ${species_dir}/varus/VARUS.fasta --run_as_paired --output ${species_dir}/trinity/
Run PASA:
mkdir ${species_dir}/pasa/
cp ${species_dir}/../config/alignAssembly.config ${species_dir}/pasa/
sed -i "s,pathtoyourpasadir,$species_dir/pasa,g" ${species_dir}/pasa/alignAssembly.config
Launch_PASA_pipeline.pl -c ${species_dir}/pasa/alignAssembly.config -C -R -g ${species_dir}/genome/genome.fasta.masked --ALIGNERS blat,gmap -t ${species_dir}/trinity/Trinity.fasta
Enter here the absolute path to the directory where EVidenceModeler is installed
evm_path="ENTER EVM PATH"
Partiton and prepare all data for EVM, TSEBRA and their evaluation:
partition.py --species_dir $species_dir --test_level $level --evm_path $evm_path --out ${species_dir}/EVM/${level}/
If you want to reconstruct the results from the TSEBRA paper then use the provided partition test set:
sed -i "s,pathtoyoutpartitions,$species_dir/EVM/$level/partitions/,g" ${species_dir}/EVM/${level}/partitions/part_test.lst
sed -i "s,pathtoyoutpartitions,$species_dir/EVM/$level/partitions/,g" ${species_dir}/EVM/${level}/partitions/part_train.lst
Or you can sample 90% of the partions as the test set:
sample_partitions.py --partition_dir ${species_dir}/EVM/${level}/partitions/ --seed ${species_dir}/EVM/${level}/seed_value.out
Run EVM for all test partitions. (adjust the number of threads to fit your system)
runEVM.py --species_dir $species_dir --test_level $level --evm_path $evm_path --threads 4
Run TSEBRA for all test partitions. (adjust the number of threads to fit your system)
runTSEBRA.py --species_dir $species_dir --test_level $level --threads 4
Evaluate the test partitions. (adjust the number of threads to fit your system)
eval_exp2.py --species_dir $species_dir --test_level $level --threads 4
The results are located at $species_dir/EVM/$level/evaluation/
.
Enter here the path to the directory containing the folders for the species for which you have performed the experiments.
parent_dir="Enter path to parent dir"
Create table with all available results.
eval_summary.py --parent_dir $parent_dir
Each row contains the result for a species and test level. A row contains the result for the second experiment if results for both experiments are present.
You can find the table in $parent_dir/evaluation/
.
You can plot the average Sensitivity and Specificity for all species in $parent_dir
for gene, transcript and CDS level from the second experiment. This requires that matplotlib
is installed, e.g. with :
pip install matplotlib
Create plot:
eval_summary.py --parent_dir $parent_dir
You can find the plot at $parent_dir/evaluation/plot_exp2.png
.
All source code, i.e. bin/*.py
and bin/*.pl
are under the Artistic License (see https://opensource.org/licenses/Artistic-2.0).
[1] Hoff, Katharina J, Simone Lange, Alexandre Lomsadze, Mark Borodovsky, and Mario Stanke. 2015. “BRAKER1: Unsupervised Rna-Seq-Based Genome Annotation with Genemark-et and Augustus.” Bioinformatics 32 (5). Oxford University Press: 767--69.↑
[2] Tomas Bruna, Katharina J. Hoff, Alexandre Lomsadze, Mario Stanke and Mark Borodvsky. 2021. “BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics 3(1):lqaa108.↑
[3] Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., ... & Wortman, J. R. 2008. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology, 9(1), 1-22.↑
[4] Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr, R.K., Jr., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D. et al. 2003 Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res, 31, 5654-5666.↑
[5] Grabherr, M. G., Haas, B. J., Yassour, M., Levin, J. Z., Thompson, D. A., Amit, I., ... & Regev, A. 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), 644.↑