Tree topology inference from multiple sequence alignments using Deep Learning

This repository contains R (>=3.5.0) and Python (>=3.6) scripts that were used in the project "Accurate inference of tree topologies from multiple sequence alignments using deep learning"

Citation:
Anton Suvorov, Joshua Hochuli, Daniel R. Schrider (2019). Accurate inference of tree topologies from multiple sequence alignments using deep learning. Systematic Biology, DOI

R scripts can be found in INDELible directory. They will generate various control files for INDElible program that simulates MSAs under given tree topology, branch lengths and different substitution as well as indel model parameters.

Required CRAN R packages:
phangorn
MCMCpack
dplyr
scales

indelible_controlgen_INDEL001.R and indelible_controlgen_NOINDEL.R
These scripts generate control files for MSA simulation with (INDEL001) and without (NOINDEL) indels/gaps. The control files will be stored in three directories (topo1, topo2 and topo3) that correspond to three topologies. These scripts are used to generate MSAs for generating TRAINING, VALIDATION and TEST data sets.
Example: Rscript indelible_controlgen_NOINDEL.R 4 1000 500(generates 1000 MSAs of length 500 per topology)
indelible_controlgen_REGIONS_INDEL001.R and indelible_controlgen_REGIONS_NOINDEL0.R
These scripts generate control files for MSA simulation with (INDEL001) and without (NOINDEL) indels/gaps. The scripts will generate EXP, FA, FAE, FAT, FE, FEE, LONG, LONGOUT, LONGULTRA, SHORT, SHORTINT, SHORTOUT and SHORTULTRA directories each with topo1, topo2 and topo3 subdirectories. These correspond to heterogeneous branch length regions, namely Truncated exponential (EXP), Farris zone (FA), Extended Farris zone (FAE), "Twisted" Farris zone (FAT), Felsenstein zone (FE), Extended Felsenstein zone (FEE), Long branches (LONG), Single long branch (LONGOUT), Extra-long branches (LONGULTRA), Short branches (SHORT), Short internal branch (SHORTINT), Single short branch (SHORTOUT) and Extra-short branches (SHORTULTRA). These MSAs were used to test performance of different tree inference methods.
Example: Rscript indelible_controlgen_REGIONS_INDEL001.R 4 1000 500(generates 1000 MSAs of length 500 per topology for each region)
indelible_controlgen_INDEL001_WARNOW.R
This script generates control files for MSA simulation with no substitutions, only indels (i.e. p_inv=1). This is the scenario under which maximum likelihood (ML) tree inference has been shown to be statistically inconsistent (Warnow, 2012). These MSAs were used to test performance of different tree inference methods.
Example: Rscript indelible_controlgen_INDEL001_WARNOW.R 4 1000 500 (generates 1000 MSAs of length 500 per topology)
indelible_controlgen_INDEL001_ANTI_WARNOW.R
This script generates control files for MSA simulation with indels and allowing all MSA sites to vary (i.e. p_inv=0). These MSAs were used to test performance of different tree inference methods.
Example: Rscript indelible_controlgen_INDEL001_ANTI_WARNOW.R 4 1000 500 (generates 1000 MSAs of length 500 per topology)

Python scripts can be found in KERAS directory. They are used for building, training, validating and testing Convolutional Neuronal Networks (CNNs). These scripts are optimized to run on GPUs.

Required Python dependencies:
Tensorflow
Keras API
SciPy
pandas

keras_CNN_TOPO.py
This script builds, trains, validates and tests CNN. As an input it takes TRAINING, VALIDATION and TESTING MSAs generated by INDELible and saved in .npy array using fasta2numeric.py script. As an input this utility script takes TRAINING, VALIDATION and TESTING datasets produced by concatinating MSAs. E.g. cat topo1/* topo2/* topo3/* > TRAINING The keras model (keras.h5) and optimal CNN weights (best_weights_clas) will be outputted by the script after testing is completed.
Example: keras_CNN_TOPO.py -t TRAIN.npy -v VALID.npy --test TEST.npy -N 4 (tested only on 4-taxon MSA cases i.e. -N 4)

Options:
 -h, --help   
 -t Training dataset in .npy
 -v Validation dataset in .npy
 --test Test dataset in .npy
 -N N taxa

keras_CNN_apply.py
This script infers a tree from an MSA. It requires keras model and weights files produced by keras_CNN_TOPO.py, a data set in FASTA format.
Example: keras_CNN_apply.py -t TEST.fasta -w best_weights_clas -k keras_model.h5 -N 4

Options:
  -h, --help 
  -t Evaluation dataset in FASTA
  -w Weights file
  -k Keras model
  -N N taxa

keras_CNN_BOOT.py
This script performs MSA nonparametric bootstrapping. It requires keras model and weights files produced by keras_CNN_TOPO.py, a data set in FASTA format and labeles for data set.
Example: keras_CNN_BOOT.py --test TEST.fasta --lab labels.txt -w best_weights_clas -k keras_model.h5 -b 100 -N 4

Options:
  -h, --help
  --test Test dataset in FASTA
  --lab Labels of TEST dataset
  -w Weights file
  -k Keras model
  -b N bootstrap replicates
  -N N taxa

Python scripts that were used to reconstruct error surface are avalible here.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
INDELible		INDELible
KERAS		KERAS
Utils		Utils
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tree topology inference from multiple sequence alignments using Deep Learning

About

Releases 1

Packages

Languages

SchriderLab/Tree_learning

Folders and files

Latest commit

History

Repository files navigation

Tree topology inference from multiple sequence alignments using Deep Learning

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages