-
Notifications
You must be signed in to change notification settings - Fork 0
Molecular clocking
In this tutorial, we will elaborate at length how to reconstruct a time-resolved evolutionary history using the fine-grained genome alignment from MetaClock with an external software BEAST2.
Step1. Setting up an evolutionary model using BEAUti
We used the combined MSA as input from the last MetaClock tutorial to set up the Bayesian evolution analysis, focusing on eight model combinations which are commonly used in bacterial evolution analysis. The carbon dated age of each ancient sample is used for the tip calibration.
We first load the genome alignment in BEAUti and set the Site Model
using Gamma Site Model
with four categories and choose the most commonly used substitution model GTR
, leaving other parameters as default.
We choose Strict Clock
as Clock Model
with an initial rate being set as 1.0E-8 (per/site/year) which can be a reasonable evolutionary rate for common microbiome species. And the Tree
prior (indicating the population growth pattern) was set as Coalescent Constant Population
. Clicking Add Prior
in order to add carbon dated ages of ancient samples for calibrating the whole genome phylogeny:
Taking sample ERR3003614 as an example, the appropriate age from indirect evidence is ~ 207.5 years old. To model the age uncertainty, we employed a normal distribution with median of 207.5 and sigma of 6 (In realistic situation, the distribution should reflect the age estimates from direct C14 dating technology). Checking Tipsonly
allows this value to be used as tip age instead of node age. Once all age priors have been added, users can set clockRate
prior distribution which controls the range of possible rates being sampled in the process of simulation.
Finally, the Chain Length
can be specified. Due to the complexity of priors and models used we highly suggest to set a long MCMC chain (e.g. 50 million iterations here), or run several independent chains and merge them in the end using a LogCombiner, a utility provided by BEAST2 package.
Here, we just demonstrated the setting for Strict Clock + Coalescent Constant Population
, the XML of other model settings can be found below:
Model | Clock model | Population model | XML file |
---|---|---|---|
1 | Strict | Constant | Model_1.xml |
2 | Strict | Exponential | Model_2.xml |
3 | Strict | Bayesian skyline | Model_3.xml |
4 | Strict | Extended bayesian skyline | Model_4.xml |
5 | Log normal | Constant | Model_5.xml |
6 | Log normal | Exponential | Model_6.xml |
7 | Log normal | Bayesian skyline | Model_7.xml |
8 | Log normal | Extended Bayesian skyline | Model_8.xml |
Since Bayesian simulation can take hours to weeks (even months when a large dataset is used) to complete, we just demonstrate here the model 1 and provide the respective results of eight models below:
java -jar /shares/CIBIO-Storage/CM/mir/tools/beast-2.5.1/beast/lib/beast.jar --threads 4 Model_1.xml
For more advanced usage of BEAST2, please visit here.
BEAST2 results |
---|
Model_1 |
Model_2 |
Model_3 |
Model_4 |
Model_5 |
Model_6 |
Model_7 |
Model_8 |
We use PathSampler
and PathSampleAnalyser
to calculate the marginal likelihood for each model.
Notice: make sure PathSampler
and PathSamplerAnalyser
have been installed correctly in BEAST2 appstore.
Generate running scripts for each step:
<path_to_beast_package>/beast/bin/appstore PathSampler -model1 Model_1.xml -nrOfSteps 50 -rootdir <dir_for_results> -doNotRun
Estimate likelihood for each step:
./run_0.sh
Estimate marginal likelihood for each model:
<path_to_beast_package>/beast/bin/appstore PathSampleAnalyser -rootdir <folder_of_all_steps> -nrOfSteps 50
As a result:
Step theta likelihood contribution ESS
0 1 -1570515.5832 -104318.4764 2.5716
1 0.9336 -1570525.838 -99368.3271 9.4538
2 0.8703 -1570544.6213 -94556.7488 8.5284
3 0.8101 -1570539.5237 -89879.7197 10.8803
4 0.7529 -1570556.5696 -85339.0703 16.0715
5 0.6985 -1570555.5031 -80931.4009 32.745
6 0.647 -1570584.2963 -76658.2638 17.2062
7 0.5982 -1570588.3488 -72515.857 20.13
8 0.552 -1570589.9528 -68504.3468 9.7112
9 0.5084 -1570600.7527 -64623.0847 8.639
10 0.4673 -1570613.0011 -60870.6692 8.5125
11 0.4285 -1570625.9547 -57245.9775 15.9313
12 0.3921 -1570634.7112 -53747.683 18.1038
13 0.3578 -1570631.0436 -50374.5156 27.8726
14 0.3258 -1570642.3846 -47126.1077 56.7735
15 0.2958 -1570649.0543 -44000.7181 19.7647
16 0.2677 -1570643.3899 -40996.9042 49.4321
17 0.2416 -1570657.4905 -38114.3763 66.4495
18 0.2174 -1570665.4892 -35351.1338 55.9891
19 0.1949 -1570680.8288 -32706.2298 20.5128
20 0.174 -1570704.5712 -30178.32 45.6671
21 0.1548 -1570685.9538 -27765.1298 88.8799
22 0.1372 -1570704.1308 -25466.8342 67.3396
23 0.1209 -1570725.3657 -23281.2942 53.5553
24 0.1061 -1570727.5613 -21206.843 19.6047
25 0.0926 -1570773.9015 -19242.8054 68.2124
26 0.0804 -1570823.3195 -17387.1225 11.3609
27 0.0693 -1570808.1565 -15637.3746 39.1773
28 0.0593 -1570894.2697 -13993.8193 30.9428
29 0.0504 -1570882.4807 -12452.9459 29.8216
30 0.0425 -1570942.4683 -11014.4189 35.0841
31 0.0355 -1571022.5111 -9675.835 121.0904
32 0.0293 -1571197.3343 -8436.1606 7.6591
33 0.024 -1571216.3202 -7291.5257 68.1923
34 0.0193 -1571373.9281 -6241.9512 33.7858
35 0.0154 -1571516.6275 -5284.3599 27.4281
36 0.012 -1571773.0059 -4417.0336 47.7465
37 0.0092 -1572212.9966 -3637.8045 76.2228
38 0.0069 -1572906.4074 -2944.0981 71.3962
39 0.005 -1573338.9504 -2332.3156 49.8483
40 0.0035 -1574930.1692 -1802.0667 14.4604
41 0.0024 -1577163.0687 -1348.9279 53.7188
42 0.0015 -1580916.3736 -969.7171 49.8171
43 0.0009 -1591247.5496 -664.3742 59.4118
44 0.0005 -1613787.9044 -423.0577 20.4938
45 0.0002 -1653183.9041 -243.0643 12.2233
46 0.0001 -1700263.9677 -114.7446 16.9172
47 0 -1744209.2426 -37.0274 6.6625
48 0 -2209026.7688 -5.2916 5.6688
49 0 -3098331.254 0 10.0024
marginal L estimate = -1570725.8741238378
Here, we summarized major statistics for each model simulation:
Model | Rate estimation [95% HPD interval] (substitution/site/year) | Root age mean [95% HPD interval] (years) | Marginal likelihood | Log Bayes factor | #Simulation states |
---|---|---|---|---|---|
*Model_1 | 5.452E-7 [3.5469E-7, 7.4024E-7] | 3561 [2,982, 4,227] | -1570725.8741238378 | 0 | 13,356,000 |
Model_2 | 5.411E-7 [4.3642E-7, 6.4524E-7] | 3583 [3,007, 4,226] | -1570725.31723956 | 0.5568842702 | 276,674,000 |
Model_3 | 5.41E-7 [4.3766E-7, 6.4299E-7] | 3,583 [3,005, 4,208] | -1570953.12441898 | -227.2502951 | 271,047,000 |
Model_4 | 5.452E-7 [4.3748E-7, 6.5046E-7] | 3,560 [2,971, 4,199] | -1571260.40271093 | -534.5285871 | 296,687,000 |
+Model_5 | 7.1109E-9 [1.2793E-13, 2.1842E-8] | 3.9781E6 [23,499, 5.858E6] | -1570305.62983684 | 420.244287 | 312,094,000 |
Model_6 | 1.6783E-8 [4.6468E-10, 4.0671E-8] | 2.217E5 [19,437, 6.5124E5] | -Infinity | NA | 296,903,000 |
Model_7 | 1.814E-8 [5.7131E-9, 3.5978E-8] | 1.2914E5 [31,927, 2.516E5] | -Infinity | NA | 429,496,000 |
Model_8 | - | - | - | - | - |
Model_8 lacks adequate MCMC mixing so we exclude this model here for model comparison (But we highly recommend users to run the simulation for each model as long as possible if computational expense is affordable). * indicates the null model (Here null model is Model_1 and the remain are alternative models for comparison. Log Bayes factor = marginal L estimate of Model_X - marginal L estimate of Model_1. If Log Bayes factor >0 Model_X is preferred to Model_1 based on the same data. Otherwise, Model_1 is preferred to Model_X. The greater the Log Bayes factor is, the better the model fits the data. + points to the model with highest Log Bayes factor. Hence, we will select Model 5 (Relaxed clock + Coalescent Constant Population) as a representative for the evolutionary history of M. orali.
To interpret the time-resolved genome phylogeny we estimated above, one of neat ways is to summarize a consensus tree using TreeAnnotator - a utility in the BEAST2 package. Here, we annotated simulated trees from Model_5 using 10% burnin, 95% posterior probability limit to generate a maximum clade credibility tree with common ancestor heights being summarized. The summarized tree was visualized using FigTree.
The time-resolved tree dispalys various evolutionary rates between strains and an absolute divergence time of strains of a given microbial species.