Skip to content

Commit

Permalink
Merge pull request #10 from amkram/main
Browse files Browse the repository at this point in the history
Add miscellaneous experiments and clean up
  • Loading branch information
bpt26 authored May 3, 2022
2 parents 0206b9d + 852568f commit 1249cc2
Show file tree
Hide file tree
Showing 307 changed files with 2,292 additions and 40 deletions.
6 changes: 3 additions & 3 deletions 1_make_starting_tree/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This folder contains scripts to create a starting tree for subsequent analyses.


---
### Tl;dr
### Filtered SARS-CoV-2 Dataset

The file you are likely most interested in is:

Expand All @@ -31,7 +31,7 @@ If you're using this file for methods development, do note of course that there

---

## Steps to reproduce
## Steps to reproduce analyses

### Prepare input files
Some files are already included in the `input` folder. Download the rest of the files that we will use to make our starting tree.
Expand Down Expand Up @@ -105,4 +105,4 @@ matUtils extract -i output/publicMsa.2021-03-18.masked.retain_samples.save.minus
gzip output/publicMsa.2021-03-18.masked.retain_samples.save.minus_parsimony.binary.*
```

The final starting bifurcating (resolved) MAT excluding incomplete and ambiguous sequences and pruned of high parsimony samples is at `output/publicMsa.2021-03-18.masked.retain_samples.save.minus_parsimony.binary.pb.gz`. The multifurcating (collapsed) MAT is also in `output` along with Newick files for both trees.
The final starting bifurcating (resolved) MAT excluding incomplete and ambiguous sequences and pruned of high parsimony samples is at `output/publicMsa.2021-03-18.masked.retain_samples.save.minus_parsimony.binary.pb.gz`. The multifurcating (collapsed) MAT is also in `output` along with Newick files for both trees.
32 changes: 23 additions & 9 deletions 2_optimize_starting_tree/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# Optimize starting tree

In this folder, we evaluate the performance of [matOptimize](https://github.com/yatisht/usher), [TreeRearrange](https://github.com/yceh/usher), [IQ-TREE 2.1.3](http://www.iqtree.org/#download), and [FastTree2](http://www.microbesonline.org/fasttree/) as optimization strategies. The best tree (highest log likelihood, lowest parsimony) was chosen as a ground truth phylogeny for use in simulated data experiments.
In this folder, we evaluate the performance of [matOptimize](https://github.com/yatisht/usher), [matOptimize TreeRearrange](https://github.com/yceh/usher/tree/Refactor-FS-cleanup), [IQ-TREE 2](http://www.iqtree.org/#download), and [FastTree2](http://www.microbesonline.org/fasttree/) as optimization strategies on the starting tree. A final optimized tree was chosen as the "ground truth" phylogeny for downstream analyses.

**Input**: The "starting tree" from folder 1

**Output**: `output/after_usher_optimized_fasttree_iter6.tree.xz`: An optimized version of the starting tree (six iterations of FastTree2 + one iteration of matOptimize). This tree is used as input in the analyses in folder 4.
**Output**: `output/after_usher_optimized_fasttree_iter6.tree.xz`: An optimized version of the starting tree (six iterations of FastTree2 + two iterations of matOptimize). This tree is used as input in the analyses in folder 4.

The resulting trees and logs of each analysis are available in `results`.

Expand All @@ -30,6 +30,7 @@ The below steps test different strategies of optimizing the starting tree. Steps

### 2.2.1 IQ-TREE Apr27 Version
The below experiments were performed with IQ-TREE 2.1.2 built on April 27, 2021
https://github.com/iqtree/iqtree2/commit/0d24c6383ccb0397cd372708d4b1e9a90715169e

| Program | Iteration | Parsimony score | Runtime (seconds) | SPR radius/rounds | Command |
|-----------|-----------|-----------------|-------------------|-------------------|---------------------------
Expand All @@ -44,8 +45,11 @@ The other IQ-TREE times increase because I was tentatively increasing the SPR ra

* longer because I forgot to switch of ml branch length optimisation, and/or because it had TBR moves in as well (which never helped so I turned off)

The folder `test-spr-radii` contains an experiment that varies SPR radius from 10 to 200, demonstrating that additional gains (decreasing parsimony score) from wider radii plateau at approximately 100.

### 2.2.2 IQ-TREE May24 Version
The below experiments were performed with IQ-TREE 2.1.2 built on May 24, 2021
https://github.com/iqtree/iqtree2/commit/d52dab2d24e7283fc28c36228ccfa9dddbf7be5c

| Program | Iteration | Parsimony score | Runtime (seconds) | SPR radius/rounds | Command |
|---------|-----------|-----------------|-------------------|-------------------|---------|
Expand Down Expand Up @@ -101,14 +105,14 @@ After running these iterations, we updated matOptimize such that it does less se
| matOptimize-new | 1 | 294318 | 9419.3 | 10/1 |matOptimize -i iter-0.pb -v alignment.vcf -o iter-1.pb -r 10 -T 32 -s 259200|
| matOptimize-new | 1 | 294313 | 6537.2 | 100/1 |matOptimize -i iter-0.pb -v alignment.vcf -o iter-1.pb -r 100 -T 32 -s 259200|

### 2.2.4 TreeRearrange
### 2.2.4 matOptimize TreeRearrange

We also tested [TreeRearrange](https://github.com/yceh/usher/tree/Refactor-FS-cleanup) on both the starting tree (iteration 0) and the final tree (iteration 4) from step 2.2.3.2.

| Program | Iteration | Parsimony score | Runtime (seconds) | SPR radius/rounds | Command |
|-----------|-----------|-----------------|-------------------|-------------------|---------|
|TreeRearrange| 1 | 294022 | 4324 | 10/1 |tree_rearrange_new -v alignment.vcf -t starting.tree -o from_start_tree.pb -T 32<br />matUtils extract -i from_start_tree.pb -t from_start_tree.tree|
|TreeRearrange| 1 (After 4 rounds of matOptimize) | 294005 | 2323.05 | 10/1 |tree_rearrange_new -v alignment.vcf -t iter-4.tree -o continue_iter4.pb -T 32<br />matUtils extract -i continue_iter4.pb -t continue_iter4.tree|
|matOptimize TreeRearrange| 1 | 294022 | 4324 | 10/1 |tree_rearrange_new -v alignment.vcf -t starting.tree -o from_start_tree.pb -T 32<br />matUtils extract -i from_start_tree.pb -t from_start_tree.tree|
|matOptimize TreeRearrange| 1 (After 4 rounds of matOptimize) | 294005 | 2323.05 | 10/1 |tree_rearrange_new -v alignment.vcf -t iter-4.tree -o continue_iter4.pb -T 32<br />matUtils extract -i continue_iter4.pb -t continue_iter4.tree|

### Note

Expand All @@ -120,7 +124,7 @@ This uses the output of 5th iteration of IQ-TREE as starting tree.

| Program | Iteration | Parsimony score | Runtime (seconds) | SPR radius/rounds | Command |
|-----------|-----------|-----------------|---------------------|-------------------|---------|
|TreeRearrange| 1 | 293862 | 2400.49 (80 threads)| 10/1 | /usr/bin/time build/tree_rearrange_new -v alignment.vcf -t iqtree_iteration5.treefile -o after-iqtree-iter5.pb<br />matUtils extract -i after-iqtree-iter5.pb -t after-iqtree-iter5.tree |
|matOptimize TreeRearrange| 1 | 293862 | 2400.49 (80 threads)| 10/1 | /usr/bin/time build/tree_rearrange_new -v alignment.vcf -t iqtree_iteration5.treefile -o after-iqtree-iter5.pb<br />matUtils extract -i after-iqtree-iter5.pb -t after-iqtree-iter5.tree |

## 2.3 Optimise starting tree with pseudo-likelihood in FastTreeMP

Expand Down Expand Up @@ -161,7 +165,7 @@ As in the previous step, prior to running this command, I set `export OMP_NUM_TH

### 2.3.3 Optimize best ML tree for parsimony

Six iterations of FastTree2 (2.3.1) yielded the best log-likelihood. Now run one iteration of matOptimize.
Six iterations of FastTree2 (2.3.1) yielded the best log-likelihood. Now run one iteration of matOptimize and one iteration of matOptimize TreeRearrange.

#### 2.3.3.1 matOptimize

Expand All @@ -173,10 +177,20 @@ Six iterations of FastTree2 (2.3.1) yielded the best log-likelihood. Now run one

| Program | Iteration | Parsimony | Runtime (seconds) | SPR radius/rounds | Command |
|-----------|-----------|-----------|--------------------|-------------------|---------|
|treeRearrange| 1 | 293866 | 988.16 (80 threads)| 10/1 |/usr/bin/time build/tree_rearrange_new -v alignment.vcf -t usher-optimized-fasttree_iteration6.tree -o after_usher_optimized_fasttree_iter6.pb<br />matUtils extract -i after_usher_optimized_fasttree_iter6.pb -t after_usher_optimized_fasttree_iter6.tree|
|matOptimize TreeRearrange| 1 | 293866 | 988.16 (80 threads)| 10/1 |/usr/bin/time build/tree_rearrange_new -v alignment.vcf -t usher-optimized-fasttree_iteration6.tree -o after_usher_optimized_fasttree_iter6.pb<br />matUtils extract -i after_usher_optimized_fasttree_iter6.pb -t after_usher_optimized_fasttree_iter6.tree|


---

### 2.3.4 Sample divergence histogram
cd results/2.3.4
python3 plotHistogram.py
The above script plots a histogram depicting the proportion of total branches with each branch length. It demonstrates the extremely low average divergence between samples in the tree.

### 2.3.5 Tree composition

These scripts count the number of nodes, homoplasies, and unique samples in the ground truth tree.

---

After running the above experiments, the tree with the lowest parsimony and highest log likelihood was the one optimized by six iterations of FastTree 2 and one iteration of matOptimize. It is available in `output/after_usher_optimized_fasttree_iter6.tree`).
After running the above experiments, we chose as the ground truth tree the one optimized by six iterations of FastTree 2 and one iteration of matOptimize. It is available in `output/after_usher_optimized_fasttree_iter6.tree`).
Binary file removed 2_optimize_starting_tree/results.tar.xz
Binary file not shown.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Apr 27 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: c109762 (AVX2, FMA3, 1007 GB RAM)
Command: ../../parsimony-old/iqtree -n 0 -no-ml-dist -m JC -t iqtree_iteration1.treefile -s alignment_trimmed.fa -parsimony-spr 100 -parsimony-nni 100 -parsimony-tbr 100 -spr-radius 40 -tbr-radius 20 --suppress-list-of-sequences -blfix -nt 100 -fast -pre iqtree_iteration2
Seed: 48303 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Wed May 5 06:52:47 2021
Kernel: AVX+FMA - 100 threads (256 CPU cores detected)

Reading alignment file alignment_trimmed.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
WARNING: 385 sites contain only gaps or ambiguous characters.
Alignment has 364427 sequences with 29903 columns, 29248 distinct patterns
20720 parsimony-informative, 2717 singleton sites, 6466 constant sites
Reading input tree file iqtree_iteration1.treefile ...
Before doing (up to) 100 rounds of parsimony SPR, parsimony score was 294719
Applied 169 moves (out of 288) (173 still possible) in iteration 1 (parsimony now 294535) after 29 min 25 sec
Applied 11 moves (out of 22) (12 still possible) in iteration 2 (parsimony now 294522) after 57 min 57 sec
Applied 1 move (out of 3) (1 still possible) in iteration 3 (parsimony now 294521) after 1 hrs 26 min 37 sec
Applied 1 move (out of 4) (2 still possible) in iteration 4 (parsimony now 294520) after 1 hrs 55 min 5 sec
Applied 1 move (out of 1) (1 still possible) in iteration 5 (parsimony now 294519) after 2 hrs 22 min 59 sec
Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 294519) (total SPR moves examined 27479081400)
Before doing (up to) 100 rounds of parsimony TBR, parsimony score was 294519
Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 294519) (total TBR moves examined 1593080244)
Before doing (up to) 100 rounds of parsimony NNI, parsimony score was 294519
Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 294519) (total NNI moves examined 728848)

NOTE: 345657 MB RAM (337 GB) is required!
Estimate model parameters (epsilon = 0.05)
1. Initial log-likelihood: -3.50711e+06
Optimal log-likelihood: -3.50711e+06
Rate parameters: A-C: 1.00000 A-G: 1.00000 A-T: 1.00000 C-G: 1.00000 C-T: 1.00000 G-T: 1.00000
Base frequencies: A: 0.250 C: 0.250 G: 0.250 T: 0.250
Parameters optimization took 1 rounds (22.6298 sec)
NOTE: 3.138 seconds to dump checkpoint file, increase to 63.000
BEST SCORE FOUND : -3507114.419
Total tree length: 10.534

Total number of iterations: 0
CPU time used for tree search: 0.192 sec (0h:0m:0s)
Wall-clock time used for tree search: 0.213 sec (0h:0m:0s)
Total CPU time used: 482592.845 sec (134h:3m:12s)
Total wall-clock time used: 11324.531 sec (3h:8m:44s)

Analysis results written to:
IQ-TREE report: iqtree_iteration2.iqtree
Maximum-likelihood tree: iqtree_iteration2.treefile
Screen log file: iqtree_iteration2.log

NOTE: 4.779 seconds to dump checkpoint file, increase to 96.000
Date and Time: Wed May 5 10:12:43 2021

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Apr 27 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: c109762 (AVX2, FMA3, 1007 GB RAM)
Command: ../../parsimony-old/iqtree -n 0 -no-ml-dist -m JC -t iqtree_iteration2.treefile -s alignment_trimmed.fa -parsimony-spr 100 -parsimony-nni 100 -spr-radius 60 --suppress-list-of-sequences -blfix -nt 100 -fast -pre iqtree_iteration3
Seed: 229220 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Wed May 5 10:19:52 2021
Kernel: AVX+FMA - 100 threads (256 CPU cores detected)

Reading alignment file alignment_trimmed.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
WARNING: 385 sites contain only gaps or ambiguous characters.
Alignment has 364427 sequences with 29903 columns, 29248 distinct patterns
20720 parsimony-informative, 2717 singleton sites, 6466 constant sites
Reading input tree file iqtree_iteration2.treefile ...
Before doing (up to) 100 rounds of parsimony SPR, parsimony score was 294519
Applied 90 moves (out of 155) (90 still possible) in iteration 1 (parsimony now 294421) after 1 hrs 58 min 57 sec
Applied 10 moves (out of 19) (11 still possible) in iteration 2 (parsimony now 294411) after 3 hrs 56 min 1 sec
Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 294411) (total SPR moves examined 58029604976)
Before doing (up to) 100 rounds of parsimony NNI, parsimony score was 294411
Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 294411) (total NNI moves examined 728848)

NOTE: 345657 MB RAM (337 GB) is required!
Estimate model parameters (epsilon = 0.05)
1. Initial log-likelihood: -3.50601e+06
Optimal log-likelihood: -3.50601e+06
Rate parameters: A-C: 1.00000 A-G: 1.00000 A-T: 1.00000 C-G: 1.00000 C-T: 1.00000 G-T: 1.00000
Base frequencies: A: 0.250 C: 0.250 G: 0.250 T: 0.250
Parameters optimization took 1 rounds (20.348 sec)
NOTE: 3.133 seconds to dump checkpoint file, increase to 63.000
BEST SCORE FOUND : -3506005.839
Total tree length: 10.531

Total number of iterations: 0
CPU time used for tree search: 0.207 sec (0h:0m:0s)
Wall-clock time used for tree search: 0.207 sec (0h:0m:0s)
Total CPU time used: 897999.883 sec (249h:26m:39s)
Total wall-clock time used: 21459.441 sec (5h:57m:39s)

Analysis results written to:
IQ-TREE report: iqtree_iteration3.iqtree
Maximum-likelihood tree: iqtree_iteration3.treefile
Screen log file: iqtree_iteration3.log

NOTE: 4.709 seconds to dump checkpoint file, increase to 95.000
Date and Time: Wed May 5 16:28:57 2021

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
IQ-TREE multicore version 2.1.2 COVID-edition for Linux 64-bit built Apr 27 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host: c109762 (AVX2, FMA3, 1007 GB RAM)
Command: ../../parsimony-old/iqtree -n 0 -no-ml-dist -m JC -t iqtree_iteration3.treefile -s alignment_trimmed.fa -parsimony-spr 100 -parsimony-nni 100 -spr-radius 80 --suppress-list-of-sequences -blfix -nt 100 -fast -pre iqtree_iteration4
Seed: 184921 (Using SPRNG - Scalable Parallel Random Number Generator)
Time: Wed May 5 17:42:20 2021
Kernel: AVX+FMA - 100 threads (256 CPU cores detected)

Reading alignment file alignment_trimmed.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
WARNING: 385 sites contain only gaps or ambiguous characters.
Alignment has 364427 sequences with 29903 columns, 29248 distinct patterns
20720 parsimony-informative, 2717 singleton sites, 6466 constant sites
Reading input tree file iqtree_iteration3.treefile ...
Before doing (up to) 100 rounds of parsimony SPR, parsimony score was 294411
Applied 70 moves (out of 119) (71 still possible) in iteration 1 (parsimony now 294335) after 4 hrs 32 min 57 sec
Applied 5 moves (out of 8) (5 still possible) in iteration 2 (parsimony now 294330) after 8 hrs 57 min 9 sec
Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 294330) (total SPR moves examined 138377031952)
Before doing (up to) 100 rounds of parsimony NNI, parsimony score was 294330
Applied 0 moves (out of 0) (0 still possible) in last iteration (parsimony now 294330) (total NNI moves examined 728848)

NOTE: 345657 MB RAM (337 GB) is required!
Estimate model parameters (epsilon = 0.05)
1. Initial log-likelihood: -3.50516e+06
Optimal log-likelihood: -3.50516e+06
Rate parameters: A-C: 1.00000 A-G: 1.00000 A-T: 1.00000 C-G: 1.00000 C-T: 1.00000 G-T: 1.00000
Base frequencies: A: 0.250 C: 0.250 G: 0.250 T: 0.250
Parameters optimization took 1 rounds (16.9513 sec)
NOTE: 3.171 seconds to dump checkpoint file, increase to 64.000
BEST SCORE FOUND : -3505161.008
Total tree length: 10.528

Total number of iterations: 0
CPU time used for tree search: 0.213 sec (0h:0m:0s)
Wall-clock time used for tree search: 0.230 sec (0h:0m:0s)
Total CPU time used: 2150949.151 sec (597h:29m:9s)
Total wall-clock time used: 48675.574 sec (13h:31m:15s)

Analysis results written to:
IQ-TREE report: iqtree_iteration4.iqtree
Maximum-likelihood tree: iqtree_iteration4.treefile
Screen log file: iqtree_iteration4.log

NOTE: 4.737 seconds to dump checkpoint file, increase to 95.000
Date and Time: Thu May 6 07:25:15 2021

Large diffs are not rendered by default.

Loading

0 comments on commit 1249cc2

Please sign in to comment.