Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phase-agnostic singletons #412

Merged
merged 29 commits into from
Jul 24, 2024
Merged

Phase-agnostic singletons #412

merged 29 commits into from
Jul 24, 2024

Conversation

nspope
Copy link
Contributor

@nspope nspope commented Jun 12, 2024

Implements phase-agnostic singleton mutation likelihoods. If singletons_phased=False, this mode is turned on and all singletons are treated as though the phase is unknown. At some point I can support a mixture of unphased and phased singletons, as long as they're all consistent within an individual.

Here are singleton ages from dated true topologies: correct-phase dating (left), and phase-agnostic dating (right):

for comparison, if we assign a random phase and use the original algorithm:

Current usage is,

ts2 = tsdate.phasing.insert_unphased_singletons(ts, position=..., individual=..., ancestral_state=..., derived_state=...)
tsdate.date(ts2, mutation_rate=..., singletons_phased=False)

@hyanwong
Copy link
Member

See #374 for more discussion. How do we want to proceed with merging this, @nspope ?

@nspope
Copy link
Contributor Author

nspope commented Jun 12, 2024

I need to add some tests for a function that inserts singletons from a separate data source, and then we could merge it and use on GEL. Let's leave undocumented for now.

@hyanwong
Copy link
Member

Is this ready for review & merging @nspope ? If so, do you want to squash the commits down, and I can have a look?

@nspope
Copy link
Contributor Author

nspope commented Jun 24, 2024

there's a couple bugs yet @hyanwong that I was working on fixing -- will wrap up later this week

@nspope
Copy link
Contributor Author

nspope commented Jul 16, 2024

Orthogonal to singleton phasing, but this branch fixes a bug in path normalisation that seems to improve bias with inferred trees.

If we match segregating sites (no path-norm):

If we match path length (path-norm):

As the bias in really in the older mutations, we might see a bigger impact if there's more samples/more sequence (this is 1000 diploids on 10Mb)? Worth testing.

@hyanwong
Copy link
Member

Hah - that's really nice. Should we push the bug fix first, before the singleton stuff? Or is it too entangled?

@nspope
Copy link
Contributor Author

nspope commented Jul 16, 2024

Too entangled-- it came up while writing unit tests. Singleton stuff is nearly there, though.

@duncanMR
Copy link

Orthogonal to singleton phasing, but this branch fixes a bug in path normalisation that seems to improve bias with inferred trees.

If we match segregating sites (no path-norm):

If we match path length (path-norm):

As the bias in really in the older mutations, we might see a bigger impact if there's more samples/more sequence (this is 1000 diploids on 10Mb)? Worth testing.

That's terrific! I just started re-dating all the chromosome chunks in GeL with singletons, besides the chr17 ones that I'd done already, so it would be great to use this. Is it okay to use this branch now, or should I wait for further changes before using it?

@nspope
Copy link
Contributor Author

nspope commented Jul 17, 2024

I still haven't gotten to the bottom of the nan in the phase probability vector, that you ran into earlier -- let me add a debugging insert later today that will skip these and print some info, and then it'd be super helpful if you could run on the GeL trees (hopefully triggering the issue in the process).

@duncanMR
Copy link

I still haven't gotten to the bottom of the nan in the phase probability vector, that you ran into earlier -- let me add a debugging insert later today that will skip these and print some info, and then it'd be super helpful if you could run on the GeL trees (hopefully triggering the issue in the process).

Sure thing, let me know when you're ready for me to clone the branch. I generally see one nan phase per 500K singletons on average, so it is very rare.

@hyanwong
Copy link
Member

Hey @nspope: If you rebase this PR on the latest main tsdate branch, the CI should pass.

@nspope
Copy link
Contributor Author

nspope commented Jul 18, 2024

Hold on, found another bug :-) Thankfully minor, and has to do with times assigned to mutations above the root, but would like to fix now

also I should squash these

@jeromekelleher
Copy link
Member

jeromekelleher commented Jul 18, 2024

Let's not squash if we've got analyses lying around based on the git hashes. Keeping reproducibility is more important than a tidy git history

@hyanwong
Copy link
Member

Let's not squash if we've got analyses lying around based on the git hashes. Keeping reproducibility is more important than a tidy git history

Have we got any such analyses @duncanMR ?

@duncanMR
Copy link

Let's not squash if we've got analyses lying around based on the git hashes. Keeping reproducibility is more important than a tidy git history

Have we got any such analyses @duncanMR ?

Just the chr17 and 16 trees that I dated yesterday for Nate's testing. I'm going to do a full rerun of all dating with singletons once Nate has finished with these changes.

@nspope
Copy link
Contributor Author

nspope commented Jul 19, 2024

Let's do one final stress check please ... @duncanMR or @hyanwong, would you mind,

  1. Run through on GEL to make sure nothing errors
  2. Run through again with rescaling_iterations=1
  3. Run through again with match_segregating_sites=True
  4. Set logging to info so there's timings printed
  5. Take a gance at (1), (2), and (3) in tsqc and make sure it looks sane

If everything seems hunky dory then we can merge. Thanks!

@duncanMR
Copy link

Let's do one final stress check please ... @duncanMR or @hyanwong, would you mind,

1. Run through on GEL to make sure nothing errors

2. Run through again with `rescaling_iterations=1`

3. Run through again with `match_segregating_sites=True`

4. Set logging to info so there's timings printed

5. Take a gance at (1), (2), and (3) in tsqc and make sure it looks sane

If everything seems hunky dory then we can merge. Thanks!

I will do! I've unfortunately been locked out of the GEL research environment for some reason, but hopefully they will fix it soon.

@nspope
Copy link
Contributor Author

nspope commented Jul 21, 2024

Did some tests on simulated chr17 with DFE over exons and 4-population OOA model and 8000 haploids:

It does look like the path rescaling (second fig) is helping with bias relative to segsites rescaling (third fig), esp for older mutations (axis labels should read "mutation" not "node" here, whoops).

@nspope
Copy link
Contributor Author

nspope commented Jul 21, 2024

More tests on the same simulated data, this time with singleton phasing (the plots below only show singleton mutations):

The MAP estimate of phase gets 25% of singleton phases wrong on inferred trees -- not great, but better than 50% wrong I guess. Similarly, on dated true trees we get 23% of singleton phases wrong. As a baseline, we can look at the true trees without dating, and:

  1. For each singleton, look at which parent of the individual at that position is older
  2. The older parent would be the MAP estimate for phase
  3. In some sense this is the best we could possibly do to guess singleton phase on the basis of node times, as it removes all uncertainty from dating

Using this strategy, we still get 24% of singleton phases wrong -- so I think we're at the fundamental limit for using haplotype age to infer phase.

@duncanMR
Copy link

75% accuracy with inferred trees seems impressive to me! Sorry for the delay with the GeL testing: they are still recovering from the outage and the filesystem performance is worse than usual.

@hyanwong
Copy link
Member

hyanwong commented Jul 22, 2024

I'm not sure whether we document match_segregating_sites in the public docs (although we have some text describing it in

# :param bool match_segregating_sites: if True, match the total number of
). Do we want to add this to the variational_gamma docstring, or are we leaving it deliberately undocumented?

@nspope
Copy link
Contributor Author

nspope commented Jul 22, 2024

Do we want to add this to the variational_gamma docstring

Yes I think we do.

@hyanwong
Copy link
Member

Do we want to add this to the variational_gamma docstring

Yes I think we do.

Agreed. Do you want to add it to this PR, or shall I do so later?

Duncan is waiting on the GEL cluster (currently still stuck) to do the testing you asked for above, then I'll merge.

@duncanMR
Copy link

duncanMR commented Jul 23, 2024

I just realised that it isn't just the GeL filesystem is slow: it's entirely broken, since the re_gecip folder with all the data in it is inaccessible. Not sure how long it will take for them to fix, unfortunately.

image

@savitakartik
Copy link

savitakartik commented Jul 23, 2024 via email

@hyanwong
Copy link
Member

That might be good, thanks Savita. Do our UKB tree seqs have (presumably unphased) singleton data in them though?

@savitakartik
Copy link

Ah no, all singletons are phased but if we're just wanting to stress test before merging, could I not just assume they're unphased?

@nspope
Copy link
Contributor Author

nspope commented Jul 23, 2024

Yup, that's fine-- thanks @savitakartik ! Mainly just want to see that it doesn't error out, that the timescale isn't off, etc with data of this scale.

@jeromekelleher
Copy link
Member

I think we could also just merge, and tackle any problems we hit at scale in follow-ups?

@hyanwong
Copy link
Member

Ah no, all singletons are phased but if we're just wanting to stress test before merging, could I not just assume they're unphased?

Out of interest, I assume the phasing for singletons is statistical, and thus essentially arbitrary, unless UKB WGS phasing is done using long-read sequencing?

@jeromekelleher
Copy link
Member

Out of interest, I assume the phasing for singletons is statistical, and thus essentially arbitrary, unless UKB WGS phasing is done using long-read sequencing?

It's done with Beagle AFAIK, but "arbitrary" is much too strong. I'm sure there's some signal they are using and are not just doing 50/50 guesses.

@hyanwong
Copy link
Member

It's done with Beagle AFAIK, but "arbitrary" is much too strong. I'm sure there's some signal they are using and are not just doing 50/50 guesses.

Interesting. If they are doing anything, I assume it's using the match length, like Shapeit does:

For singleton variants (minor allele count (MAC) of 1), SHAPEIT5 uses another phasing model that (1) assumes singletons to be recent mutation events and (2) leverages IBD sharing patterns between haplotypes to make inference (Fig. 1c). Specifically, our model identifies the longest possible match in the dataset for each target haplotype. By definition, these matches point to haplotypes sharing recent common ancestors with the target and their lengths indicate the number of generations separating them: the shorter the match, the older the common ancestor. Our model assumes that an older common ancestor means more time for a mutation to occur on that lineage and therefore assigns the minor alleles of singletons to the target haplotype with the shortest match25.

@nspope
Copy link
Contributor Author

nspope commented Jul 24, 2024

I think we could also just merge

Sure, let's just get it in, it's working well on simulated data.

@hyanwong
Copy link
Member

OK, merged

@hyanwong hyanwong merged commit 0e25baf into tskit-dev:main Jul 24, 2024
3 checks passed
@duncanMR
Copy link

duncanMR commented Jul 26, 2024

Let's do one final stress check please ... @duncanMR or @hyanwong, would you mind,

1. Run through on GEL to make sure nothing errors

2. Run through again with `rescaling_iterations=1`

3. Run through again with `match_segregating_sites=True`

4. Set logging to info so there's timings printed

5. Take a gance at (1), (2), and (3) in tsqc and make sure it looks sane

If everything seems hunky dory then we can merge. Thanks!

I finally managed to run the dating on GEL! I don't see any issues in the timings and logs. tsqc is taking ages to load but I'll post screenshots when it's done:

image

image

image

*Update: tsqc took four hours to load a tree sequence and timed out when I tried to connect it, so it doesn't look I'll be able to check the results further until I get back from my break in two weeks.

@hyanwong
Copy link
Member

Hooray, thanks Duncan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants