-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a centre=False option to genetic_relatedness statistic #1623
Adding a centre=False option to genetic_relatedness statistic #1623
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1623 +/- ##
==========================================
+ Coverage 89.71% 89.72% +0.01%
==========================================
Files 29 29
Lines 31298 31340 +42
Branches 6067 6077 +10
==========================================
+ Hits 28078 28120 +42
Misses 1837 1837
Partials 1383 1383
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Ok, this should do it, I think? The way the option gets passed through feels a little inelegant, but I think is ok. We should think through the noncentred version a bit - what's the use case? Note that it will depend on, for instance, the number of fixed, derived mutations, which is a property we like statistics to not have. |
This looks like the right thing to me. One semantic note though, should the new method you made be called I was trying to get this to run in python and there seems to be a problem there. I'm not sure if this is correct but I added the import tskit, msprime
ts = msprime.sim_ancestry(samples=10, sequence_length=10, recombination_rate=1, random_seed=1)
n = ts.num_samples
sample_sets = [[i] for i in ts.samples()]
indexes = [(i,j) for i in ts.samples() for j in ts.samples()]
ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, proportion=False, centred=False) gives On the use case, getting the shared times between sample lineages is what I am interested in, eg, to get the MLE of the rate of evolution under Brownian motion. And even if you want to eventually mean centre things, getting the noncentred version is essential if you were interested in getting the shared times in multiple/particular epochs, eg, to let the rate of evolution be piecewise constant. |
Hm - I'm not sure what you mean (but, I believe you). Ah, I think I see - you must be doing this separately on each tree? (ie with
It's on our list to add time windows to all the stats - it sounds like you might need this? |
Thanks @petrelharp -- it would have taken me a long time to figure out that format specifier. Yes, I'm interested in doing this tree by tree. And yeah, a time-window sounds very convenient for what I'd like to do (and then I should be able to just work with the centred version). I did a test to make sure this gives me what I think it should: import tskit, msprime
import numpy as np
# run a small sim
ts = msprime.sim_ancestry(samples=3, sequence_length=1e3, recombination_rate=1e-3, random_seed=1)
# calculate two different measures of genetic_relatedness
n = ts.num_samples
sample_sets = [[i] for i in ts.samples()]
indexes = [(i,j) for i in ts.samples() for j in ts.samples()]
#centred version, NONPOLARISED
sts_grel_mc = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=False, proportion=False, centred=True).reshape(ts.num_trees, n, n)
#noncentred version, POLARISED
sts_grel = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=True, proportion=False, centred=False).reshape(ts.num_trees, n, n)
# my functions to get shared times with general_stat and mean centre them, for comparison with above
def shared_times(ts):
k = ts.num_samples #number of samples
W = np.identity(k) #each node i in [0,1,...,k] given row vector of weights with 1 in column i and 0's elsewhere
def f(x): return (x.reshape(-1,1) * x).flatten() #determine which pairs of samples share the branch above a node, flattened
return ts.general_stat(
W, f, k**2, mode='branch', windows='trees', polarised=True, strict=False
).reshape(ts.num_trees, k, k)
def mean_centre(square_matrix):
k,k = square_matrix.shape
Tmat = np.identity(k) - [[1/k for _ in range(k)] for _ in range(k)]
return np.matmul(Tmat, np.matmul(square_matrix, np.transpose(Tmat)))
# now calculate the shared times and mean-centred shared times with this method
sts_gstat = shared_times(ts)
sts_gstat_mc = np.array(list(map(mean_centre, sts_gstat)))
# and finally compare the two approaches
np.all(abs(sts_gstat_mc - sts_grel_mc) < 1e-8)
np.all(abs(sts_grel * 2 - sts_gstat) < 1e-8) #NOTE THE FACTOR OF 2 It turns out that to get the mean-centred shared times from The latter bit makes sense to me: we want to simply multiply a branch length by 1 if both lineages share it (eqn 8 in Ralph et al 2020 Genetics), and since here f(x,y) = x * y / 2, we also need to multiply by 2. I'm a bit surprised by the former bit (that we need Do we need more tests? Or should I go ahead with the workflow and 1) add to docstring/docs, 2) add to the change log, and 3) send this to review? |
Hm - I'm confused on two things: how will you do this without a time window? and, I don't think that the centred version will work for this, since you want the centring to be done once, across all time windows, rather than separately by time window? (I think that's how the centering would work with time windows, since the summary stat would be applied separately to each chunk of time x genome.) |
I'd calculate the non-centred shared times with
Right, good point. But in my experience the splitting into epochs and then mean centring isn't a big deal, so what we've done here already seems to do the bulk of the work. |
The first thing is good, since that's how we designed the statistic (since this is something estimable without knowing polarization, it should be computed with As for requring polarisation to get the shared times: this should make sense once we understand what removing the mean centring is doing. Let's see, from the docs, the
Ok, and this is just removing the "relative" and we get
Hm, this is puzzling. Let's see - going back to the summary function, which without mean-centering is
Ah-ha: so if each of these are single samples, then x1 and x2 are each 0/1, indicating whether the branch/allele is above the sample. With polarised=False, this is applied to each branch for both of the allele frequencies resulting from a mutation on that branch; i.e., Somehow this does the right thing for the centred version. I forget why the factor of 2 is there, but I'm sure it's supposed to be there? So - I think that everything is as expected, although confusing - do you agree? Might you be able to draft something for the docs explaining this (eg that you need |
How's this work? For each tree you get a single number; how's it get split into epochs? (sorry, I'm being slow!) |
The current tests are great! But, I think we still need tests of centreing for |
Also our addition seems to have broken some other things - see the failing tests (and let me know if you want me to investigate). |
Thanks @petrelharp!
Ah, now I get it, thanks.
Yeah, and I just realized that an alternative way to get the centred shared times is to use
This makes more sense for me when
Yep and yep.
With def epochs(shared_times, split_times):
T = shared_times[0,0] #TMRCA in this tree (any diagonal entry)
chopped_split_times = [t for t in split_times if t < T] #chop off epochs at TMRCA
tsplits = [0] + chopped_split_times + [T] #append 0 to front of split times, append TMRCA to end
Ts = [tsplits[i + 1] - tsplits[i] for i in range(len(tsplits)-1)] #amount of time in each epoch
stimes = [] #place to store shared time matrices for all epochs
# 0s for epochs beyond TMRCA
n = len(split_times) - len(chopped_split_times)
if n > 0:
k = len(shared_times)
for _ in range(n):
stimes.append(np.zeros((k,k)))
# epochs within TMRCA
for i in range(len(tsplits)-1):
# shared times in remaining epochs (subtract elapsed time, set to zero if negative, find the min vs time in epoch)
stimes.append(np.minimum(np.maximum(shared_times - (tsplits[-1] - tsplits[-(i+1)]), 0), Ts[-(i+1)]))
return np.array(stimes) #most distant epoch listed first
# little example using sts_gstat from thread above
split_times = [2, 8, 10] #this will define 4 epochs based on these three split times
sts_epochs = []
# apply to each tree
for i,sts in enumerate(sts_gstat):
sts_epochs.append(epochs(sts, split_times))
OK, I'll work on this.
And I'll try to fix these too. Thanks again. |
Hm, I'm not sure this works out - each branch only counts in one direction, since for the case of single nodes, if x1 * x2 is nonzero then (1 - x1) * (1 - x2) is zero. More generally, the factor of 2 doesn't come from adding both alleles and dividing by 2, since it applies to multiallelic sites, if I recall correctly. |
We have the proof written up in an overleaf, I'll send it to you. |
And, oh! I see about the epochs. It only works if you're working with only single samples on a single tree and all samples are at the same time, right? |
Oh, I guess my reasoning is wrong here then (and I'd be curious to see the proof you mention). But the approach does work: sts_grel_mc_pol = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=True, proportion=False, centred=True).reshape(ts.num_trees, n, n)
np.all(abs(sts_grel_mc_pol * 2 - sts_grel_mc) < 1e-8) |
Yeah, that's the only case I've thought about. |
Let me know if you need a hand here? |
Yeah, a hand would help. I took a stab at fixing the failing tests but can't quite piece it all together. It looks like the problem is passing the extra |
I'll have a look. Separately, I see that we've called the option |
Ah-ha: the problem is that Fst calls Any suggestions here, @benjeffery, @jeromekelleher? |
I've done (two versions of) a way to fix Fst specific to the Fst method. That's enough for you to proceed here anyhow, @mmosmond . |
I was thinking the same and just committed the changes to |
I'm not following the issue here @petrelharp, but maybe the simplest thing is to hack what needs to be hacked here and to create an issue to track the necessary "unhacking"? |
Is that enough to get you going again? |
Need any assistance to get this wrapped up, @mmosmond? |
Hey @petrelharp, sorry, having a hard time finding the time right now (and it takes me a while to navigate the code). If you can find the time go for it! Sorry to drag this out... |
All good! I'll see if I can get to it... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spotted a few minor things, otherwise LGTM
Whoops - I guess I forgot to check codecov for |
Okay - looks good to go pending @mmosmond's review of the docs and concepts! |
…tic_relatedness . initialize all nodes in branch stats; closes tskit-dev#2983 right normalization in genetic relatedness, closes tskit-dev#2984 tests pass!!
lint relatedness_matrix passes tests passes all tests?
additional C testing
lint changelog apply @jeromekelleher comments lint a few more low-level tests one more test
a2d1326
to
3d3a989
Compare
docs/stats.md
Outdated
|
||
Most statistics are not be affected by invariant sites, | ||
and hence do not depend on any part of the tree that is not ancestral to any of the sample sets. | ||
However, some statistics are different: for instance, {meth}`TreeSequence.genetic_relatedness` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'However, some statistics are different: for instance, given a pair of samples, {meth}TreeSequence.genetic_relatedness
with centre=False
(and polarised=True
, the default for that method)
adds up the total number of alleles (or total area of branches) that is...'
docs/stats.md
Outdated
of samples. | ||
of sample sets. | ||
For an polarised statistic (the default) with biallelic loci, this calculates | ||
{math}`(p_1 - \bar{p}) (p_2 - \bar{p})`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth clarifying what is meant by
docs/stats.md
Outdated
@@ -588,18 +607,30 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1. | |||
For an unpolarized statistic with biallelic loci, this calculates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'unpolarised'
Shall we merge? |
Yes! @mmosmond still happy for any suggestions / changes - we can put them in separately. |
Wow, amazing to see how you guys work! No wonder this software can do all the things it can do. This looks good to me. I can get exactly what I want (shared branch length in each tree for all pairs of samples) with import tskit, msprime
import numpy as np
ts = msprime.sim_ancestry(samples=3, sequence_length=1e3, recombination_rate=1e-3, random_seed=1)
n = ts.num_samples
sample_sets = [[i] for i in ts.samples()] #each sample in its own set
indexes = [(i,j) for i in ts.samples() for j in ts.samples()] #every pairwise comparison
ts.genetic_relatedness(sample_sets, indexes,
mode='branch', windows='trees',
span_normalise=True, polarised=True,
proportion=False, centre=False).reshape(ts.num_trees, n, n) No more pesky factor of 2! |
Yep, that's right! And you probably already noticed this, but if you don't want the actual matrix but just want to do operations with it, you can do vector-matrix-vector products with |
Your attention to detail on the documentation would be particularly helpful, though, @mmosmond - so if you haven't read it carefully then that would be helpful. |
Yeah I took a look yesterday, and again now, and (with a focus on branch statistics) all looks ok to me. The only thing I wonder about is -- as a relative novice with a very specific aim -- if it should be said explicitly under "branch" that you can get the shared branch lengths between two samples in each tree with |
Perhaps a short example in the tutorials would be nice for that? |
Yeah, that sounds more appropriate. |
Description
Adding
centre=False
option togenetic_relatedness
statistic, so that it can calculate, for example, the non-mean-centred shared times between samples.So far I've just added the option to the Python test function and a test (which passes). Next step is to develop the C code, but I haven't figured out how to do it. I think a similar change (an option to set mean to 0) needs to be added to
genetic_relatedness_summary_func
.Fixes #1619
Thanks for any help you can give @petrelharp or others!