Likelihood #21

laurenzkeller · 2023-10-04T22:34:41Z

Hi Pawel,
I implemented the log-likelihood function and created some unit tests for it. However, I was struggling with comparing R with Python because there doesn't seem to be a corresponding R function that accepts the same inputs (tree, theta matrix and sampling rate) and returns the log-likelihood. I might need to ask Xiang.

…& _simulate.py

…me now passed as an argument

PR approved?

all errors should be fixed now.

…tive paths, created modified_io, added 1e-10 to return value of likelihood function

pawel-czyz

Hi Laurenz,

Truly exceptional work! I like it very much. I left some minor comments and a single important comment (regarding _io_modified.py).

I think this implementation is very nice and easy to follow. I think it'll be the model implementation against the future implementations can be compared with.

For the next (faster) implementation I'd consider these improvements:

Currently the complexity is $O(n_T^2)$, where $n_T$ is the number of subtrees of tree $T$ and it's executed using for loops, which are known to be slow in Python. Using a BFS approach you don't really need to use these nested loops to construct the Q matrix. This change alone is likely to speed up the implementation by a large factor.
Currently the list of subtrees of a given tree has to be calculated at every execution of the loglikelihood calculation, as the signature of the function is loglikelihood(tree, theta, sampling_time).
I think that in the future implementation it may be good to do caching:

class LoglikelihoodSingleTree:
    def __init__(self, tree):
       self._subtrees_list = calculate_subtrees(tree)
    
    def loglikelihood(self, theta, sampling_time):
        ... # Note that you have the list of subtrees already calculated

Using BFS + dynamic programming + forward substitution you won't need to construct the $V$ matrix explicitly, if I recall. The trick is that you can calculate the non-zero entries explicitly and solve the linear equation for a single new term in the x vector.

src/pmhn/_trees/_backend.py

src/pmhn/_trees/_io_modified.py

src/pmhn/_trees/_tree_utils.py

src/pmhn/_trees/_simulate.py

laurenzkeller · 2023-10-09T21:34:24Z

Hi Pawel,
Thanks for the review! I don't understand your second suggestion though. It is correct that in the current implementation the list of subtrees of a given tree is calculated when doing the loglikelihood calculation, however, I don't see why caching would lead to a speed up? For each distinct tree, the loglikelihood function is executed once and each distinct tree has a different list of subtrees.
So how would the cached list of subtrees be reused?

pawel-czyz · 2023-10-10T07:20:08Z

It is correct that in the current implementation the list of subtrees of a given tree is calculated when doing the loglikelihood calculation, however, I don't see why caching would lead to a speed up? For each distinct tree, the loglikelihood function is executed once and each distinct tree has a different list of subtrees.

The trick here is that we will calculate the likelihood of the whole data set (i.e., for each tree) many times. Think about it this way: if you have the data set $X$ and parameters $\theta$, you want to know the likelihood, which is
$$P(X\mid \theta) = \exp \sum_i \log P(X_i \mid\theta)$$

However, to understand which $\theta$ are good or bad, we need to check many of them. In other words, each loglikelihood term $\log P(X_i\mid \theta)$ will have to be calculated thousands of times (for different $\theta$).
If we don't cache the subtree list, we'll have to calculate it thousands of times as well.

Another suggestion, which I forgot to mention, is that keeping subtrees as a list may be suboptimal if you implement the calculation via BFS (whether you create the $Q$ matrix explicitly or use dynamic programming and forward substitution directly). Hash tables (e.g., Python's dict) may be suitable here (as you don't iterate over the subtrees explicitly, but you only need to do a quick lookup passing from the tree to index).

laurenzkeller · 2023-10-10T12:39:16Z

Thanks for the clarification

laurenzkeller · 2023-10-13T00:04:03Z

Hi Pawel,
I improved the code that I showed you on Monday for the likelihood calculation. Originally, it took approximately 12 seconds to calculate the likelihood of the 623 trees, and now it takes about 4.7 seconds. However, using forward substitution and not constructing the V matrix only speeds up the code by about 0.2 seconds (so the fastest version is at 4.7 seconds, and the second fastest, which still constructs the V matrix, is at 4.9 seconds). The speed up is mostly due to eliminating trivial inefficiencies, such as not only considering the entries that are in the upper half of the matrix and not ensuring that an off-diagonal entry is only considered if the difference between the sizes of subtrees i and j is 1.

I first tried to keep the matrix construction approach and I tried to loop only once over the subtrees list and construct a single row
of V by considering the exit nodes of a tree and checking if we added each one of the exit nodes (and subsequently removing it) to the tree whether the resulting tree is contained in the list of subtrees (I of course filtered the list of subtrees in terms of subtree sizes, so we don't need to loop over the entire list of subtrees again (I used a dictionary)). If yes, then the off-diagonal entry could be constructed. However, with this version, calculating the log-likelihood for the 623 trees took more than 2 minutes! The complexity is even higher than O(n_T²), where n_T is the number of subtrees. Therefore, I stuck to the diag_entry and off_diag_entry functions from the original code (which both employ BFS).

Furthermore, I'm not sure, but I think that using the BFS approach you suggested, which constructs an entire row of the V matrix, doesn't allow for forward substitution because that would require each column of the V matrix to be constructed in one go (so exit nodes of different trees have to be considered).

You can view the two versions (4.7 and 4.9 seconds) on Github. I did not push the version that had a runtime of more than 2 minutes.

pawel-czyz · 2023-10-13T14:59:20Z

Great work, thanks a lot for the update!

I improved the code that I showed you on Monday for the likelihood calculation. Originally, it took approximately 12 seconds to calculate the likelihood of the 623 trees, and now it takes about 4.7 seconds.

Great! I think it'll be around 1 second then for 120 trees. Doing 10,000 steps will be therefore around 3 hours on Euler, I think it's performant enough for our purposes 🙂 (Plus, we could resort to multiprocessing and probably reduce the runtime by a factor of 4x, if needed).

I first tried to keep the matrix construction approach and I tried to loop only once over the subtrees list and construct a single row
of V (...) However, with this version, calculating the log-likelihood for the 623 trees took more than 2 minutes!

This is interesting and quite a surprising thing to me. Let's discuss it on Tuesday, I'd love to understand that better!

I'm not sure, but I think that using the BFS approach you suggested, which constructs an entire row of the V matrix, doesn't allow for forward substitution because that would require each column of the V matrix to be constructed in one go (so exit nodes of different trees have to be considered).

I may have overlooked it – another point to discuss 🙂

pawel-czyz · 2023-10-13T15:07:10Z

I think the best way to proceed will be to separate the 30th commit (which introduces the new backends) into a new PR, so only the first 29 commits are in this one. (And after the minor comments are resolved, they are merged. For the 30th commit I'd like to provide some additional feedback).

Do you know how to split a PR into two, or should I do it for you?

laurenzkeller · 2023-10-13T18:00:34Z

Hi Pawel,
Thanks for the reply. I'll give it a try myself.

laurenzkeller · 2023-10-13T21:22:47Z

I implemented your suggestions. Should I resolve the comments and merge this branch into the main branch?

pawel-czyz · 2023-10-16T07:09:10Z

I implemented your suggestions. Should I resolve the comments and merge this branch into the main branch?

Sounds great! Note that there's a minor conflict in one file. It looks that it can be fixed using the online editor (it's like 3-line modification), but if you need any help, let me know!

laukeller and others added 29 commits September 22, 2023 14:55

Add files for listing children and subtrees

afa0462

added simulation of trees and comparison plots (not correct yet)

cc40f5e

removed plots and warmup directory, added plotting/csv related files …

6f51d22

…& _simulate.py

Merge branch 'main' into warmup

9d00b71

changed _simulate.py, write_csv.py and created a few unit tests

e2c3abd

Merge branch 'warmup' of github.com:cbg-ethz/pMHN into warmup

40c8a76

changed comment in _simulate.py

d154117

minor changes

39907ab

moved files to warmup dir

c4f96d3

remove not needed files

cb57917

change: draw new sampling time if tree is discarded, mean_sampling_ti…

e6ea6be

…me now passed as an argument

reformatted files with black

3131c24

changed unit tests

ea7cd53

reformat with black

a206230

Merge branch 'main' of github.com:cbg-ethz/pMHN

4bb2929

PR approved?

Merge branch 'warmup'

8773dec

PR approved?

Remove poetry.lock

34b0dc9

fixed ruff errors

9d6fc97

fixed ruff errors

41cd4c1

Merge branch 'warmup' of github.com:cbg-ethz/pMHN into warmup

bbe8ffc

Merge branch 'warmup'

c7596a0

all errors should be fixed now.

pyright fixed

05bb425

Merge branch 'warmup'

1c3bf80

modified _backend and added _tree_utils, created unit tests for both

ca76812

added files for likelihood comparison, changed absolute paths to rela…

d6cd286

…tive paths, created modified_io, added 1e-10 to return value of likelihood function

small change

2da22a5

added likelihood tests, modified test_tree_utils.py

6fedbdc

minor changes

01beb2f

minor change

f1cdb6b

pawel-czyz self-requested a review October 9, 2023 14:05

pawel-czyz approved these changes Oct 9, 2023

View reviewed changes

laurenzkeller force-pushed the likelihood branch from 52eea92 to f1cdb6b Compare October 13, 2023 17:48

pawel-czyz mentioned this pull request Oct 13, 2023

Optimized implementation of the likelihood #22

Merged

implemented suggestions from pawel

e4ff57a

laurenzkeller and others added 4 commits October 16, 2023 13:37

Merge branch 'main' into likelihood

4a7a735

changed io and test_io

7e0e3e7

Merge branch 'likelihood' of github.com:cbg-ethz/pMHN into likelihood

0ce34b7

deleted files

11a584f

laurenzkeller merged commit 3777ce4 into main Oct 17, 2023
1 check passed

laurenzkeller deleted the likelihood branch October 17, 2023 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Likelihood #21

Likelihood #21

laurenzkeller commented Oct 4, 2023

pawel-czyz left a comment •

edited

Loading

laurenzkeller commented Oct 9, 2023

pawel-czyz commented Oct 10, 2023 •

edited

Loading

laurenzkeller commented Oct 10, 2023

laurenzkeller commented Oct 13, 2023 •

edited

Loading

pawel-czyz commented Oct 13, 2023

pawel-czyz commented Oct 13, 2023

laurenzkeller commented Oct 13, 2023

laurenzkeller commented Oct 13, 2023

pawel-czyz commented Oct 16, 2023 •

edited

Loading

Likelihood #21

Likelihood #21

Conversation

laurenzkeller commented Oct 4, 2023

pawel-czyz left a comment • edited Loading

Choose a reason for hiding this comment

laurenzkeller commented Oct 9, 2023

pawel-czyz commented Oct 10, 2023 • edited Loading

laurenzkeller commented Oct 10, 2023

laurenzkeller commented Oct 13, 2023 • edited Loading

pawel-czyz commented Oct 13, 2023

pawel-czyz commented Oct 13, 2023

laurenzkeller commented Oct 13, 2023

laurenzkeller commented Oct 13, 2023

pawel-czyz commented Oct 16, 2023 • edited Loading

pawel-czyz left a comment •

edited

Loading

pawel-czyz commented Oct 10, 2023 •

edited

Loading

laurenzkeller commented Oct 13, 2023 •

edited

Loading

pawel-czyz commented Oct 16, 2023 •

edited

Loading