Skip to content

Molecular clocking

Kun Huang edited this page Apr 2, 2021 · 3 revisions

Inferring a time-resolved evolutionary history at the strain level

In this tutorial, we will elaborate at length how to reconstruct a time-resolved evolutionary history using the fine-grained genome alignment from MetaClock with an external software BEAST2.

Requirements

Step1. Setting up an evolutionary model using BEAUti

We used the combined MSA as input from the last MetaClock tutorial to set up the Bayesian evolution analysis, focusing on eight model combinations which are commonly used in bacterial evolution analysis. The carbon dated age of each ancient sample is used for the tip calibration.

Fig1

We first load the genome alignment in BEAUti and set the Site Model using Gamma Site Model with four categories and choose the most commonly used substitution model GTR, leaving other parameters as default.

Fig2 We choose Strict Clock as Clock Model with an initial rate being set as 1.0E-8 (per/site/year) which can be a reasonable evolutionary rate for common microbiome species. And the Tree prior (indicating the population growth pattern) was set as Coalescent Constant Population. Clicking Add Prior in order to add carbon dated ages of ancient samples for calibrating the whole genome phylogeny:

Fig3

Taking sample ERR3003614 as an example, the appropriate age from indirect evidence is ~ 207.5 years old. To model the age uncertainty, we employed a normal distribution with median of 207.5 and sigma of 6 (In realistic situation, the distribution should reflect the age estimates from direct C14 dating technology). Checking Tipsonly allows this value to be used as tip age instead of node age. Once all age priors have been added, users can set clockRate prior distribution which controls the range of possible rates being sampled in the process of simulation.

Fig4 Finally, the Chain Length can be specified. Due to the complexity of priors and models used we highly suggest to set a long MCMC chain (e.g. 50 million iterations here), or run several independent chains and merge them in the end using a LogCombiner, a utility provided by BEAST2 package.

Here, we just demonstrated the setting for Strict Clock + Coalescent Constant Population, the XML of other model settings can be found below:

Model Clock model Population model XML file
1 Strict Constant Model_1.xml
2 Strict Exponential Model_2.xml
3 Strict Bayesian skyline Model_3.xml
4 Strict Extended bayesian skyline Model_4.xml
5 Log normal Constant Model_5.xml
6 Log normal Exponential Model_6.xml
7 Log normal Bayesian skyline Model_7.xml
8 Log normal Extended Bayesian skyline Model_8.xml

Step2. Running all models with BEAST2 till sufficient mixture

Since Bayesian simulation can take hours to weeks (even months when a large dataset is used) to complete, we just demonstrate here the model 1 and provide the respective results of eight models below:

java -jar /shares/CIBIO-Storage/CM/mir/tools/beast-2.5.1/beast/lib/beast.jar --threads 4 Model_1.xml

For more advanced usage of BEAST2, please visit here.

BEAST2 results
Model_1
Model_2
Model_3
Model_4
Model_5
Model_6
Model_7
Model_8

Step3. Model comparison based on log-transformed Bayes factor

We use PathSampler and PathSampleAnalyser to calculate the marginal likelihood for each model. Notice: make sure PathSampler and PathSamplerAnalyser have been installed correctly in BEAST2 appstore.

Generate running scripts for each step:

<path_to_beast_package>/beast/bin/appstore PathSampler -model1 Model_1.xml -nrOfSteps 50 -rootdir <dir_for_results> -doNotRun

Estimate likelihood for each step:

./run_0.sh

Estimate marginal likelihood for each model:

<path_to_beast_package>/beast/bin/appstore PathSampleAnalyser -rootdir <folder_of_all_steps> -nrOfSteps 50

As a result:

Step        theta         likelihood   contribution ESS
0            1            -1570515.5832 -104318.4764 2.5716       
1            0.9336       -1570525.838 -99368.3271  9.4538       
2            0.8703       -1570544.6213 -94556.7488  8.5284       
3            0.8101       -1570539.5237 -89879.7197  10.8803      
4            0.7529       -1570556.5696 -85339.0703  16.0715      
5            0.6985       -1570555.5031 -80931.4009  32.745       
6            0.647        -1570584.2963 -76658.2638  17.2062      
7            0.5982       -1570588.3488 -72515.857   20.13        
8            0.552        -1570589.9528 -68504.3468  9.7112       
9            0.5084       -1570600.7527 -64623.0847  8.639        
10           0.4673       -1570613.0011 -60870.6692  8.5125       
11           0.4285       -1570625.9547 -57245.9775  15.9313      
12           0.3921       -1570634.7112 -53747.683   18.1038      
13           0.3578       -1570631.0436 -50374.5156  27.8726      
14           0.3258       -1570642.3846 -47126.1077  56.7735      
15           0.2958       -1570649.0543 -44000.7181  19.7647      
16           0.2677       -1570643.3899 -40996.9042  49.4321      
17           0.2416       -1570657.4905 -38114.3763  66.4495      
18           0.2174       -1570665.4892 -35351.1338  55.9891      
19           0.1949       -1570680.8288 -32706.2298  20.5128      
20           0.174        -1570704.5712 -30178.32    45.6671      
21           0.1548       -1570685.9538 -27765.1298  88.8799      
22           0.1372       -1570704.1308 -25466.8342  67.3396      
23           0.1209       -1570725.3657 -23281.2942  53.5553      
24           0.1061       -1570727.5613 -21206.843   19.6047      
25           0.0926       -1570773.9015 -19242.8054  68.2124      
26           0.0804       -1570823.3195 -17387.1225  11.3609      
27           0.0693       -1570808.1565 -15637.3746  39.1773      
28           0.0593       -1570894.2697 -13993.8193  30.9428      
29           0.0504       -1570882.4807 -12452.9459  29.8216      
30           0.0425       -1570942.4683 -11014.4189  35.0841      
31           0.0355       -1571022.5111 -9675.835    121.0904     
32           0.0293       -1571197.3343 -8436.1606   7.6591       
33           0.024        -1571216.3202 -7291.5257   68.1923      
34           0.0193       -1571373.9281 -6241.9512   33.7858      
35           0.0154       -1571516.6275 -5284.3599   27.4281      
36           0.012        -1571773.0059 -4417.0336   47.7465      
37           0.0092       -1572212.9966 -3637.8045   76.2228      
38           0.0069       -1572906.4074 -2944.0981   71.3962      
39           0.005        -1573338.9504 -2332.3156   49.8483      
40           0.0035       -1574930.1692 -1802.0667   14.4604      
41           0.0024       -1577163.0687 -1348.9279   53.7188      
42           0.0015       -1580916.3736 -969.7171    49.8171      
43           0.0009       -1591247.5496 -664.3742    59.4118      
44           0.0005       -1613787.9044 -423.0577    20.4938      
45           0.0002       -1653183.9041 -243.0643    12.2233      
46           0.0001       -1700263.9677 -114.7446    16.9172      
47           0            -1744209.2426 -37.0274     6.6625       
48           0            -2209026.7688 -5.2916      5.6688       
49           0            -3098331.254 0            10.0024      

marginal L estimate = -1570725.8741238378

Here, we summarized major statistics for each model simulation:

Model Rate estimation [95% HPD interval] (substitution/site/year) Root age mean [95% HPD interval] (years) Marginal likelihood Log Bayes factor #Simulation states
*Model_1 5.452E-7 [3.5469E-7, 7.4024E-7] 3561 [2,982, 4,227] -1570725.8741238378 0 13,356,000
Model_2 5.411E-7 [4.3642E-7, 6.4524E-7] 3583 [3,007, 4,226] -1570725.31723956 0.5568842702 276,674,000
Model_3 5.41E-7 [4.3766E-7, 6.4299E-7] 3,583 [3,005, 4,208] -1570953.12441898 -227.2502951 271,047,000
Model_4 5.452E-7 [4.3748E-7, 6.5046E-7] 3,560 [2,971, 4,199] -1571260.40271093 -534.5285871 296,687,000
+Model_5 7.1109E-9 [1.2793E-13, 2.1842E-8] 3.9781E6 [23,499, 5.858E6] -1570305.62983684 420.244287 312,094,000
Model_6 1.6783E-8 [4.6468E-10, 4.0671E-8] 2.217E5 [19,437, 6.5124E5] -Infinity NA 296,903,000
Model_7 1.814E-8 [5.7131E-9, 3.5978E-8] 1.2914E5 [31,927, 2.516E5] -Infinity NA 429,496,000
Model_8 - - - - -

Model_8 lacks adequate MCMC mixing so we exclude this model here for model comparison (But we highly recommend users to run the simulation for each model as long as possible if computational expense is affordable). * indicates the null model (Here null model is Model_1 and the remain are alternative models for comparison. Log Bayes factor = marginal L estimate of Model_X - marginal L estimate of Model_1. If Log Bayes factor >0 Model_X is preferred to Model_1 based on the same data. Otherwise, Model_1 is preferred to Model_X. The greater the Log Bayes factor is, the better the model fits the data. + points to the model with highest Log Bayes factor. Hence, we will select Model 5 (Relaxed clock + Coalescent Constant Population) as a representative for the evolutionary history of M. orali.

Step4. Interpreting the inferred evolutionary history

To interpret the time-resolved genome phylogeny we estimated above, one of neat ways is to summarize a consensus tree using TreeAnnotator - a utility in the BEAST2 package. Here, we annotated simulated trees from Model_5 using 10% burnin, 95% posterior probability limit to generate a maximum clade credibility tree with common ancestor heights being summarized. The summarized tree was visualized using FigTree.

Time Tree

The time-resolved tree dispalys various evolutionary rates between strains and an absolute divergence time of strains of a given microbial species.