Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likelihood #21

Merged
merged 34 commits into from
Oct 17, 2023
Merged

Likelihood #21

merged 34 commits into from
Oct 17, 2023

Conversation

laurenzkeller
Copy link
Collaborator

Hi Pawel,
I implemented the log-likelihood function and created some unit tests for it. However, I was struggling with comparing R with Python because there doesn't seem to be a corresponding R function that accepts the same inputs (tree, theta matrix and sampling rate) and returns the log-likelihood. I might need to ask Xiang.

laukeller and others added 29 commits September 22, 2023 14:55
all errors should be fixed now.
…tive paths, created modified_io, added 1e-10 to return value of likelihood function
@pawel-czyz pawel-czyz self-requested a review October 9, 2023 14:05
Copy link
Member

@pawel-czyz pawel-czyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Laurenz,

Truly exceptional work! I like it very much. I left some minor comments and a single important comment (regarding _io_modified.py).

I think this implementation is very nice and easy to follow. I think it'll be the model implementation against the future implementations can be compared with.

For the next (faster) implementation I'd consider these improvements:

  1. Currently the complexity is $O(n_T^2)$, where $n_T$ is the number of subtrees of tree $T$ and it's executed using for loops, which are known to be slow in Python. Using a BFS approach you don't really need to use these nested loops to construct the Q matrix. This change alone is likely to speed up the implementation by a large factor.
  2. Currently the list of subtrees of a given tree has to be calculated at every execution of the loglikelihood calculation, as the signature of the function is loglikelihood(tree, theta, sampling_time).
    I think that in the future implementation it may be good to do caching:
class LoglikelihoodSingleTree:
    def __init__(self, tree):
       self._subtrees_list = calculate_subtrees(tree)
    
    def loglikelihood(self, theta, sampling_time):
        ... # Note that you have the list of subtrees already calculated
  1. Using BFS + dynamic programming + forward substitution you won't need to construct the $V$ matrix explicitly, if I recall. The trick is that you can calculate the non-zero entries explicitly and solve the linear equation for a single new term in the x vector.

src/pmhn/_trees/_backend.py Outdated Show resolved Hide resolved
src/pmhn/_trees/_io_modified.py Outdated Show resolved Hide resolved
src/pmhn/_trees/_tree_utils.py Outdated Show resolved Hide resolved
src/pmhn/_trees/_tree_utils.py Outdated Show resolved Hide resolved
src/pmhn/_trees/_simulate.py Outdated Show resolved Hide resolved
@laurenzkeller
Copy link
Collaborator Author

Hi Pawel,
Thanks for the review! I don't understand your second suggestion though. It is correct that in the current implementation the list of subtrees of a given tree is calculated when doing the loglikelihood calculation, however, I don't see why caching would lead to a speed up? For each distinct tree, the loglikelihood function is executed once and each distinct tree has a different list of subtrees.
So how would the cached list of subtrees be reused?

@pawel-czyz
Copy link
Member

pawel-czyz commented Oct 10, 2023

It is correct that in the current implementation the list of subtrees of a given tree is calculated when doing the loglikelihood calculation, however, I don't see why caching would lead to a speed up? For each distinct tree, the loglikelihood function is executed once and each distinct tree has a different list of subtrees.

The trick here is that we will calculate the likelihood of the whole data set (i.e., for each tree) many times. Think about it this way: if you have the data set $X$ and parameters $\theta$, you want to know the likelihood, which is
$$P(X\mid \theta) = \exp \sum_i \log P(X_i \mid\theta)$$

However, to understand which $\theta$ are good or bad, we need to check many of them. In other words, each loglikelihood term $\log P(X_i\mid \theta)$ will have to be calculated thousands of times (for different $\theta$).
If we don't cache the subtree list, we'll have to calculate it thousands of times as well.

Another suggestion, which I forgot to mention, is that keeping subtrees as a list may be suboptimal if you implement the calculation via BFS (whether you create the $Q$ matrix explicitly or use dynamic programming and forward substitution directly). Hash tables (e.g., Python's dict) may be suitable here (as you don't iterate over the subtrees explicitly, but you only need to do a quick lookup passing from the tree to index).

@laurenzkeller
Copy link
Collaborator Author

Thanks for the clarification

@laurenzkeller
Copy link
Collaborator Author

laurenzkeller commented Oct 13, 2023

Hi Pawel,
I improved the code that I showed you on Monday for the likelihood calculation. Originally, it took approximately 12 seconds to calculate the likelihood of the 623 trees, and now it takes about 4.7 seconds. However, using forward substitution and not constructing the V matrix only speeds up the code by about 0.2 seconds (so the fastest version is at 4.7 seconds, and the second fastest, which still constructs the V matrix, is at 4.9 seconds). The speed up is mostly due to eliminating trivial inefficiencies, such as not only considering the entries that are in the upper half of the matrix and not ensuring that an off-diagonal entry is only considered if the difference between the sizes of subtrees i and j is 1.

I first tried to keep the matrix construction approach and I tried to loop only once over the subtrees list and construct a single row
of V by considering the exit nodes of a tree and checking if we added each one of the exit nodes (and subsequently removing it) to the tree whether the resulting tree is contained in the list of subtrees (I of course filtered the list of subtrees in terms of subtree sizes, so we don't need to loop over the entire list of subtrees again (I used a dictionary)). If yes, then the off-diagonal entry could be constructed. However, with this version, calculating the log-likelihood for the 623 trees took more than 2 minutes! The complexity is even higher than O(n_T²), where n_T is the number of subtrees. Therefore, I stuck to the diag_entry and off_diag_entry functions from the original code (which both employ BFS).

Furthermore, I'm not sure, but I think that using the BFS approach you suggested, which constructs an entire row of the V matrix, doesn't allow for forward substitution because that would require each column of the V matrix to be constructed in one go (so exit nodes of different trees have to be considered).

You can view the two versions (4.7 and 4.9 seconds) on Github. I did not push the version that had a runtime of more than 2 minutes.

@pawel-czyz
Copy link
Member

Great work, thanks a lot for the update!

I improved the code that I showed you on Monday for the likelihood calculation. Originally, it took approximately 12 seconds to calculate the likelihood of the 623 trees, and now it takes about 4.7 seconds.

Great! I think it'll be around 1 second then for 120 trees. Doing 10,000 steps will be therefore around 3 hours on Euler, I think it's performant enough for our purposes 🙂 (Plus, we could resort to multiprocessing and probably reduce the runtime by a factor of 4x, if needed).

I first tried to keep the matrix construction approach and I tried to loop only once over the subtrees list and construct a single row
of V (...) However, with this version, calculating the log-likelihood for the 623 trees took more than 2 minutes!

This is interesting and quite a surprising thing to me. Let's discuss it on Tuesday, I'd love to understand that better!

I'm not sure, but I think that using the BFS approach you suggested, which constructs an entire row of the V matrix, doesn't allow for forward substitution because that would require each column of the V matrix to be constructed in one go (so exit nodes of different trees have to be considered).

I may have overlooked it – another point to discuss 🙂

@pawel-czyz
Copy link
Member

I think the best way to proceed will be to separate the 30th commit (which introduces the new backends) into a new PR, so only the first 29 commits are in this one. (And after the minor comments are resolved, they are merged. For the 30th commit I'd like to provide some additional feedback).

Do you know how to split a PR into two, or should I do it for you?

@laurenzkeller
Copy link
Collaborator Author

Hi Pawel,
Thanks for the reply. I'll give it a try myself.

@laurenzkeller
Copy link
Collaborator Author

I implemented your suggestions. Should I resolve the comments and merge this branch into the main branch?

@pawel-czyz
Copy link
Member

pawel-czyz commented Oct 16, 2023

I implemented your suggestions. Should I resolve the comments and merge this branch into the main branch?

Sounds great! Note that there's a minor conflict in one file. It looks that it can be fixed using the online editor (it's like 3-line modification), but if you need any help, let me know!

@laurenzkeller laurenzkeller merged commit 3777ce4 into main Oct 17, 2023
1 check passed
@laurenzkeller laurenzkeller deleted the likelihood branch October 17, 2023 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants