-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phase-agnostic singletons #412
Conversation
I need to add some tests for a function that inserts singletons from a separate data source, and then we could merge it and use on GEL. Let's leave undocumented for now. |
Is this ready for review & merging @nspope ? If so, do you want to squash the commits down, and I can have a look? |
there's a couple bugs yet @hyanwong that I was working on fixing -- will wrap up later this week |
Hah - that's really nice. Should we push the bug fix first, before the singleton stuff? Or is it too entangled? |
Too entangled-- it came up while writing unit tests. Singleton stuff is nearly there, though. |
I still haven't gotten to the bottom of the nan in the phase probability vector, that you ran into earlier -- let me add a debugging insert later today that will skip these and print some info, and then it'd be super helpful if you could run on the GeL trees (hopefully triggering the issue in the process). |
Sure thing, let me know when you're ready for me to clone the branch. I generally see one nan phase per 500K singletons on average, so it is very rare. |
Hey @nspope: If you rebase this PR on the latest main tsdate branch, the CI should pass. |
Hold on, found another bug :-) Thankfully minor, and has to do with times assigned to mutations above the root, but would like to fix now also I should squash these |
Let's not squash if we've got analyses lying around based on the git hashes. Keeping reproducibility is more important than a tidy git history |
Have we got any such analyses @duncanMR ? |
Just the chr17 and 16 trees that I dated yesterday for Nate's testing. I'm going to do a full rerun of all dating with singletons once Nate has finished with these changes. |
Let's do one final stress check please ... @duncanMR or @hyanwong, would you mind,
If everything seems hunky dory then we can merge. Thanks! |
I will do! I've unfortunately been locked out of the GEL research environment for some reason, but hopefully they will fix it soon. |
75% accuracy with inferred trees seems impressive to me! Sorry for the delay with the GeL testing: they are still recovering from the outage and the filesystem performance is worse than usual. |
I'm not sure whether we document Line 360 in 68ec97a
variational_gamma docstring, or are we leaving it deliberately undocumented?
|
Yes I think we do. |
Agreed. Do you want to add it to this PR, or shall I do so later? Duncan is waiting on the GEL cluster (currently still stuck) to do the testing you asked for above, then I'll merge. |
I could test with ukb trees if it helps?
…On Tue, 23 Jul 2024, 09:07 Duncan Mbuli-Robertson, ***@***.***> wrote:
I just realised that it isn't just the GeL filesystem is slow: it's
entirely broken, since the re_gecip folder with all the data in it is
inacessible. Not sure how long it will take for them to fix, unfortunately.
image.png (view on web)
<https://github.com/user-attachments/assets/18131c36-7c39-45cc-86e1-1f47b951c48a>
—
Reply to this email directly, view it on GitHub
<#412 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGFZLDN7EYM5XS35I4H4C5TZNYFMTAVCNFSM6AAAAABJFPVW2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBUGUZTKNZUGY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
That might be good, thanks Savita. Do our UKB tree seqs have (presumably unphased) singleton data in them though? |
Ah no, all singletons are phased but if we're just wanting to stress test before merging, could I not just assume they're unphased? |
Yup, that's fine-- thanks @savitakartik ! Mainly just want to see that it doesn't error out, that the timescale isn't off, etc with data of this scale. |
I think we could also just merge, and tackle any problems we hit at scale in follow-ups? |
Out of interest, I assume the phasing for singletons is statistical, and thus essentially arbitrary, unless UKB WGS phasing is done using long-read sequencing? |
It's done with Beagle AFAIK, but "arbitrary" is much too strong. I'm sure there's some signal they are using and are not just doing 50/50 guesses. |
Interesting. If they are doing anything, I assume it's using the match length, like Shapeit does:
|
Sure, let's just get it in, it's working well on simulated data. |
OK, merged |
I finally managed to run the dating on GEL! I don't see any issues in the timings and logs. tsqc is taking ages to load but I'll post screenshots when it's done: *Update: tsqc took four hours to load a tree sequence and timed out when I tried to connect it, so it doesn't look I'll be able to check the results further until I get back from my break in two weeks. |
Hooray, thanks Duncan |
Implements phase-agnostic singleton mutation likelihoods. If
singletons_phased=False
, this mode is turned on and all singletons are treated as though the phase is unknown. At some point I can support a mixture of unphased and phased singletons, as long as they're all consistent within an individual.Here are singleton ages from dated true topologies: correct-phase dating (left), and phase-agnostic dating (right):
for comparison, if we assign a random phase and use the original algorithm:
Current usage is,