CEDAR is a program aimed at manipulating rooted phylogenetic trees encoded as vecors, as described in the paper "A Vector Representation for Phylogenetic Trees".
WARNING. CEDAR is still in development and comes with no warranty.
CEDAR can be used in command line (described below) or wihin python progams using the class TreeVec
implemented in the file
TreeVec.py.
The directory example contains an example of using CEDAR through command-line.
The directory experiments contains the code to reproduce the experiments described in the paper "A Vector Representation for Phylogenetic Trees".
CEDAR implements a vector representation of rooted phylogenetic trees.
Consider the set of all rooted binary phylogenetic trees with totally ordered leaves, labelled
A vector representation of a tree is a vector
-
$v_1=1$ ; -
$\forall i \in {1,\dots,n}$ ,$i$ appears exactly twice in$v$ ; -
$\forall i \in {2,\dots,n}$ , the first occurrence of$i$ in$v$ appears before the second occurrence of$i-1$ ; -
$\forall i \in {2,\dots,n}$ , the second occurrence of$i$ in$v$ appears after the second occurrence of$i-1$ .
There is a one-to-one correspondence between the set of rooted binary phylogenetic trees with leaves labelled
Given a rooted binary phylogenetic tree
- the entries of the vector are separated by commas (
,
); - if the entry
$i$ corresponds to an internal node, it is written asi:N:b
whereN
is the name of the node andb
the length of the branch to its parent; - if the entry
$i$ corresponds to a leaf it is written asb
where here toob
is the length of the branch to its parent.
Note that the label$i$ of leaves does not need to be written as it is implicit that the$i^{th}$ leaf is labelled by$i$ . To recover the names of leaves, a CEDAR file always starts with a first line that contains the order on the leaves used to define the vector representation.
For example, a CEDAR file encoding the two trees
((b:1.0,d:2.0)e:2.0,(a:1.0,c:1.0)g:1.0)f;
(((b:1.0,d:2.0)x:2.0,a:1.0)y:1.0,c:1.0)z;
where the first tree has internal nodes named e,f,g
and the second tree has internal nodes named x,y,z
, and where leaves are ordered as
1=a,2=b,3=c,4=d
, is
#a,b,c,d
1::0.0,2:f:0.0,3:g:1.0,1.0,4:e:2.0,1.0,1.0,2.0;
1::0.0,3:z:0.0,2:y:1.0,1.0,4:x:2.0,1.0,1.0,2.0;
corresponding to vectors
The command-line script is CEDAR.py, and allows to perform the following tasks:
-
Converting trees written in Newick format into the CEDAR format:
python src/CEDAR.py fromNewick --input_file Newick_file --output_file CEDAR_file [--order_file orde]
Encodes the files from
Newick_file
in the CEDAR format in fileCEDAR_file
. If the fileorder
is provided, it is used to determine the taxa order used for the encoding; otherwise, the taxa order is defined by a postorder travrsal of the first tree inNewick_file
. The format of a taxa order file is a single line, starting by#order
followed by the list of all leaves separated by commas.Assumption: all trees in
Newick_file
are rooted phylogenetic trees on the same set of taxa. -
Creating random taxa orders:
python src/CEDAR.py orders --input_file CEDAR_file --output_dir out_dir [--nb_orders N] [--output_prefix out_pref]
Creates
N
random taxa orders in filesout_dir/out_pref_I.txt
forI
from1
toN
. Default values:N=1
,out_pref=CEDAR_random_order
.Assumption: in all commands, all trees in a
CEDAR_file
are rooted phylogenetic trees on the same set of taxa and having been encoded into vector using the same order on taxa, written in the first line of the file. -
Converting trees written in CEDAR format into the Newick format:
python src/CEDAR.py toNewick --input_file CEDAR_file --output_file Newick_file [--format Newick_format]
The optional
format
agument definesthe Newick format of the created Newick file (default:1
). -
Computing the HOP similarity between trees:
python src/CEDAR.py HOP_sim --input_file CEDAR_file --output_file similarity_file [--mode distance_mode]
Creates a file that records the HOP similarity between the trees in CEDAR_file.
- If
mode=pairwise
the HOP similariy is computed between all pairs of trees. - If
mode=sequence
the HOP similariy is computed between successive pairs of trees. - If
mode=first
the HOP similariy is computed between the first tree and all other trees. - The default mode is
sequence
.
- If
-
Computing the HOP neighbourhood size:
python src/CEDAR.py HOP_ngb1 --input_file CEDAR_file --output_file size_file
Computes the size of the HOP neighbourhood for every tree in
CEDAR_file
-
Computing the HOP neighbourhoods:
python src/CEDAR.py HOP_ngb2 --input_file CEDAR_file --output_dir out_dir [--output_prefix out_pref]
Computes the trees in the HOP neighbourhood for every tree in
CEDAR_file
. The HOP neighbourhood for treei
is in the fileout_dir/out_pref.i.vec
. The default value forout_pref
isCEDAR_HOP_neighbourhood
. -
Computing HOP paths between successive trees:
python src/CEDAR.py HOP_path --input_file CEDAR_file --output_file CEDAR_path_file
Creates a new CEDAR file where a sequence of trees, each differing from the previous one by a single HOP, is inserted between all pairs of successive trees in the input file
CEDAR_file
.
The python class TreeVec
is implemented in the file TreeVec.py.
A TreeVec
object, that encodes a rooted phylogenetic tree as a vector, can be instantiated from
- a Newick string, expected to be in format 1, or
- an ete3
Tree
object, or - a string encoding a
TreeVec
vector itself.
The main methods of the class, aside of the constuctor, are:
copy
: creates a copy of aTreeVec
object;check_vector
: checks that a TreVec object is a valid vector representation of a phylogenetic tree;extract_leaves_order
: extracts from aTreeVec
object the taxa order used to create the vector representation of the tree;treevec2tree
andtree2treevec
: conversion between aTreeVec
object and an ete3Tree
object;treevec2str
andstr2treevec
: conversion between aTreeVec
object and a string encoding the vector;treevec2newick
andnewick2treevec
: conversion between aTreeVec
object and a Newick string;hop
: performs a HOP tree rearrangement on the tree encoded by the object;hop_neighbourhood_size
: computes the size of the HOP neighbourhood of the tree encoded by the object;hop_neighbourhood
: computes theTreeVec
representations of the trees in the HOP neighbourhood of the tree encoded by the object;hop_similarity
: computes the HOP similarity or a HOP LCS (see paper for an explanation) with another tree;hop_next
: creates a tree one HOP closer to a target tree;random_hop
: peform a andom HOP on the current tree.