Skip to content

Code for accurate inference of tree topologies from multiple sequence alignments using deep learning

Notifications You must be signed in to change notification settings

SchriderLab/Tree_learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Tree topology inference from multiple sequence alignments using Deep Learning

This repository contains R (>=3.5.0) and Python (>=3.6) scripts that were used in the project "Accurate inference of tree topologies from multiple sequence alignments using deep learning"

Citation:
Anton Suvorov, Joshua Hochuli, Daniel R. Schrider (2019). Accurate inference of tree topologies from multiple sequence alignments using deep learning. Systematic Biology, DOI

R scripts can be found in INDELible directory. They will generate various control files for INDElible program that simulates MSAs under given tree topology, branch lengths and different substitution as well as indel model parameters.

Required CRAN R packages:
phangorn
MCMCpack
dplyr
scales

  1. indelible_controlgen_INDEL001.R and indelible_controlgen_NOINDEL.R
    These scripts generate control files for MSA simulation with (INDEL001) and without (NOINDEL) indels/gaps. The control files will be stored in three directories (topo1, topo2 and topo3) that correspond to three topologies. These scripts are used to generate MSAs for generating TRAINING, VALIDATION and TEST data sets.
    Example: Rscript indelible_controlgen_NOINDEL.R 4 1000 500(generates 1000 MSAs of length 500 per topology)

  2. indelible_controlgen_REGIONS_INDEL001.R and indelible_controlgen_REGIONS_NOINDEL0.R
    These scripts generate control files for MSA simulation with (INDEL001) and without (NOINDEL) indels/gaps. The scripts will generate EXP, FA, FAE, FAT, FE, FEE, LONG, LONGOUT, LONGULTRA, SHORT, SHORTINT, SHORTOUT and SHORTULTRA directories each with topo1, topo2 and topo3 subdirectories. These correspond to heterogeneous branch length regions, namely Truncated exponential (EXP), Farris zone (FA), Extended Farris zone (FAE), "Twisted" Farris zone (FAT), Felsenstein zone (FE), Extended Felsenstein zone (FEE), Long branches (LONG), Single long branch (LONGOUT), Extra-long branches (LONGULTRA), Short branches (SHORT), Short internal branch (SHORTINT), Single short branch (SHORTOUT) and Extra-short branches (SHORTULTRA). These MSAs were used to test performance of different tree inference methods.
    Example: Rscript indelible_controlgen_REGIONS_INDEL001.R 4 1000 500(generates 1000 MSAs of length 500 per topology for each region)

  3. indelible_controlgen_INDEL001_WARNOW.R
    This script generates control files for MSA simulation with no substitutions, only indels (i.e. p_inv=1). This is the scenario under which maximum likelihood (ML) tree inference has been shown to be statistically inconsistent (Warnow, 2012). These MSAs were used to test performance of different tree inference methods.
    Example: Rscript indelible_controlgen_INDEL001_WARNOW.R 4 1000 500 (generates 1000 MSAs of length 500 per topology)

  4. indelible_controlgen_INDEL001_ANTI_WARNOW.R
    This script generates control files for MSA simulation with indels and allowing all MSA sites to vary (i.e. p_inv=0). These MSAs were used to test performance of different tree inference methods.
    Example: Rscript indelible_controlgen_INDEL001_ANTI_WARNOW.R 4 1000 500 (generates 1000 MSAs of length 500 per topology)

Python scripts can be found in KERAS directory. They are used for building, training, validating and testing Convolutional Neuronal Networks (CNNs). These scripts are optimized to run on GPUs.

Required Python dependencies:
Tensorflow
Keras API
SciPy
pandas

  1. keras_CNN_TOPO.py
    This script builds, trains, validates and tests CNN. As an input it takes TRAINING, VALIDATION and TESTING MSAs generated by INDELible and saved in .npy array using fasta2numeric.py script. As an input this utility script takes TRAINING, VALIDATION and TESTING datasets produced by concatinating MSAs. E.g. cat topo1/* topo2/* topo3/* > TRAINING The keras model (keras.h5) and optimal CNN weights (best_weights_clas) will be outputted by the script after testing is completed.
    Example: keras_CNN_TOPO.py -t TRAIN.npy -v VALID.npy --test TEST.npy -N 4 (tested only on 4-taxon MSA cases i.e. -N 4)
Options:
 -h, --help   
 -t Training dataset in .npy
 -v Validation dataset in .npy
 --test Test dataset in .npy
 -N N taxa 
  1. keras_CNN_apply.py
    This script infers a tree from an MSA. It requires keras model and weights files produced by keras_CNN_TOPO.py, a data set in FASTA format.
    Example: keras_CNN_apply.py -t TEST.fasta -w best_weights_clas -k keras_model.h5 -N 4
Options:
  -h, --help 
  -t Evaluation dataset in FASTA
  -w Weights file
  -k Keras model
  -N N taxa
  1. keras_CNN_BOOT.py
    This script performs MSA nonparametric bootstrapping. It requires keras model and weights files produced by keras_CNN_TOPO.py, a data set in FASTA format and labeles for data set.
    Example: keras_CNN_BOOT.py --test TEST.fasta --lab labels.txt -w best_weights_clas -k keras_model.h5 -b 100 -N 4
Options:
  -h, --help
  --test Test dataset in FASTA
  --lab Labels of TEST dataset
  -w Weights file
  -k Keras model
  -b N bootstrap replicates
  -N N taxa

Python scripts that were used to reconstruct error surface are avalible here.

About

Code for accurate inference of tree topologies from multiple sequence alignments using deep learning

Resources

Stars

Watchers

Forks

Packages

No packages published