Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bayes #23

Closed
wants to merge 68 commits into from
Closed

Bayes #23

wants to merge 68 commits into from

Conversation

laurenzkeller
Copy link
Collaborator

Hi Pawel,

I created the file treemhn-bayes.smk which is very similar to the mhn-bayes.smk file. Running mhn-bayes.smk works perfectly, however when I run treemhn-bayes.smk, my system runs out of memory and the process is killed. I haven't found the error. Even with very small data sets it doesn't work. I've spent quite some time on debugging without success.
I also created the file _treemhn.py, where the TreeMHNLoglikelihood operation is defined. Additionally, you can find the loglikelihood_tree_list function in the _backend_geno.py file which calculates the likelihood of a list of trees. Could you please have a look at the code?

laukeller and others added 30 commits September 22, 2023 14:55
all errors should be fixed now.
…tive paths, created modified_io, added 1e-10 to return value of likelihood function
Copy link
Member

@pawel-czyz pawel-czyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Laurenz, great work! I'll take a closer look tomorrow or on Saturday – just one question: did you try running with e.g., only 2 samples or just the values in the Snakemake file? (Which are reasonable in terms of the collected sample size, but perhaps too computationally intense...)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for removing this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file contained the tests for the oldest version of the likelihood implementation. I will need to create a new file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #22 there's an updated version of it now 🙂

loglikelihoods = self._backend.loglikelihood_tree_list(
trees=self._data,
theta=theta_np,
sampling_rate=self._mean_sampling_time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to systematize the naming here.

@laurenzkeller
Copy link
Collaborator Author

Hi Pawel,

I found the problem. The reason was that the generated trees had sizes up to 35k! At first, I thought that running out of memory was due to the sampling process. However, it is because the trees generated were too large. Initially, I thought that the tree-generating process was buggy; however, the R version also produces such trees (with the MHN from the sample_spike_and_slab function). With small sampling rates and MHNs that have a lot of negative entries, smaller trees are generated, and the sampling works fine. Setting a maximum tree size wouldn't make sense, right? Because the likelihood calculation of a tree would change (as we discussed in our last meeting).
How should I proceed? I can now plot the posterior thetas (which still differ from the ground truth). Is there a smart way to obtain the MAP estimate from the samples, or should I just consider the theta that was sampled most?
It would be great if I could run the simulations on Euler.

@pawel-czyz
Copy link
Member

pawel-czyz commented Oct 27, 2023

I found the problem. The reason was that the generated trees had sizes up to 35k! At first, I thought that running out of memory was due to the sampling process. However, it is because the trees generated were too large.

Great you've found it! Indeed, the trees can be very large. It's always good to look at the some summary statistics of the data (e.g., the distribution of nodes or the number of subtrees).

Setting a maximum tree size wouldn't make sense, right? Because the likelihood calculation of a tree would change (as we discussed in our last meeting).

Indeed, then the model would be misspecified and the inferences should yield biased results. Understanding this misspecification will be a useful thing to know (and it's partially addressed by experiments (b) and (c)).

How should I proceed? I can now plot the posterior thetas (which still differ from the ground truth). Is there a smart way to obtain the MAP estimate from the samples, or should I just consider the theta that was sampled most?

MAP is generally different from the sample with the highest posterior (see e.g., https://discourse.mc-stan.org/t/the-typical-set-and-its-relevance-to-bayesian-computation/17174). The summary like mean may be meaningful when there's no multimodality. Generally, it'd be good to look at the posteriors of different entries (as I did in one of the presentation slides) and see whether this is an issue here (and if not, we can look at the means and standard errors).

It would be great if I could run the simulations on Euler.

Indeed! You already have the permissions for submitting jobs on several cores, but I hope you'll get also access to more cores 🙂
Submitting the jobs can be done by following the SLURM instructions listed here.

@laurenzkeller
Copy link
Collaborator Author

Hi Pawel,
I only noticed yesterday that the reason why the posterior theta plots looked so strange was that I passed an empty list of trees (I was really stupid) to the likelihood_tree_list function. Now I have the issue that the sampling process is really time-consuming. As a guest user on Euler, I have access to just 48 cores, which limits me to running simulations for a very small dataset (e.g 10 patients, across 8 chains). Already for 100 patients it takes forever.
Thanks for fixing the last PR.
See you in 3 hours 🙂

@pawel-czyz
Copy link
Member

Hi Laurenz, as there are seven conflicting files now and *.png and *.csv files shouldn't be really stored in Git, it may be worth considering to close this PR and open a new one (which branches from the current main branch) with the changes are introduced again (mostly based on the code here).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants