Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a centre=False option to genetic_relatedness statistic #1623

Merged
merged 5 commits into from
Sep 17, 2024

Conversation

mmosmond
Copy link
Contributor

@mmosmond mmosmond commented Aug 12, 2021

Description

Adding centre=False option to genetic_relatedness statistic, so that it can calculate, for example, the non-mean-centred shared times between samples.

So far I've just added the option to the Python test function and a test (which passes). Next step is to develop the C code, but I haven't figured out how to do it. I think a similar change (an option to set mean to 0) needs to be added to genetic_relatedness_summary_func.

Fixes #1619

Thanks for any help you can give @petrelharp or others!

@codecov
Copy link

codecov bot commented Aug 12, 2021

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.72%. Comparing base (024e042) to head (275b27d).
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1623      +/-   ##
==========================================
+ Coverage   89.71%   89.72%   +0.01%     
==========================================
  Files          29       29              
  Lines       31298    31340      +42     
  Branches     6067     6077      +10     
==========================================
+ Hits        28078    28120      +42     
  Misses       1837     1837              
  Partials     1383     1383              
Flag Coverage Δ
c-tests 86.49% <100.00%> (+0.02%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 88.99% <100.00%> (+<0.01%) ⬆️
python-tests 99.01% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
c/tskit/trees.c 90.50% <100.00%> (+0.05%) ⬆️
python/_tskitmodule.c 88.99% <100.00%> (+<0.01%) ⬆️
python/tskit/trees.py 98.78% <100.00%> (+<0.01%) ⬆️

@petrelharp
Copy link
Contributor

Ok, this should do it, I think? The way the option gets passed through feels a little inelegant, but I think is ok.

We should think through the noncentred version a bit - what's the use case? Note that it will depend on, for instance, the number of fixed, derived mutations, which is a property we like statistics to not have.

@mmosmond
Copy link
Contributor Author

This looks like the right thing to me. One semantic note though, should the new method you made be called genetic_relatedness_noncentred_summary_func rather than genetic_relatedness_centred_summary_func?

I was trying to get this to run in python and there seems to be a problem there. I'm not sure if this is correct but I added the centred=T/F option to __k_way_sample_set_stat in trees.py, but then there seems to be an error in parsing the keyword arguments, and I'm not sure what to do to fix it. More specifically,

import tskit, msprime

ts = msprime.sim_ancestry(samples=10, sequence_length=10, recombination_rate=1, random_seed=1)

n = ts.num_samples
sample_sets = [[i] for i in ts.samples()]
indexes = [(i,j) for i in ts.samples() for j in ts.samples()]

ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, proportion=False, centred=False)

gives SystemError: More keyword list entries (8) than format specifiers (7), where the error comes from line 5883 in trees.py.

On the use case, getting the shared times between sample lineages is what I am interested in, eg, to get the MLE of the rate of evolution under Brownian motion. And even if you want to eventually mean centre things, getting the noncentred version is essential if you were interested in getting the shared times in multiple/particular epochs, eg, to let the rate of evolution be piecewise constant.

@petrelharp
Copy link
Contributor

getting the shared times between sample lineages is what I am interested in, eg, to get the MLE of the rate of evolution under Brownian motion.

Hm - I'm not sure what you mean (but, I believe you). Ah, I think I see - you must be doing this separately on each tree? (ie with windows='trees')?

if you were interested in getting the shared times in multiple/particular epochs, eg, to let the rate of evolution be piecewise constant.

It's on our list to add time windows to all the stats - it sounds like you might need this?

@mmosmond
Copy link
Contributor Author

Thanks @petrelharp -- it would have taken me a long time to figure out that format specifier.

Yes, I'm interested in doing this tree by tree.

And yeah, a time-window sounds very convenient for what I'd like to do (and then I should be able to just work with the centred version).

I did a test to make sure this gives me what I think it should:

import tskit, msprime
import numpy as np

# run a small sim
ts = msprime.sim_ancestry(samples=3, sequence_length=1e3, recombination_rate=1e-3, random_seed=1)

# calculate two different measures of genetic_relatedness
n = ts.num_samples
sample_sets = [[i] for i in ts.samples()]
indexes = [(i,j) for i in ts.samples() for j in ts.samples()]
#centred version, NONPOLARISED
sts_grel_mc = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=False, proportion=False, centred=True).reshape(ts.num_trees, n, n) 
#noncentred version, POLARISED
sts_grel = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=True, proportion=False, centred=False).reshape(ts.num_trees, n, n) 

# my functions to get shared times with general_stat and mean centre them, for comparison with above
def shared_times(ts):
    k = ts.num_samples #number of samples
    W = np.identity(k) #each node i in [0,1,...,k] given row vector of weights with 1 in column i and 0's elsewhere 
    def f(x): return (x.reshape(-1,1) * x).flatten() #determine which pairs of samples share the branch above a node, flattened
    return ts.general_stat(
        W, f, k**2, mode='branch', windows='trees', polarised=True, strict=False
    ).reshape(ts.num_trees, k, k)

def mean_centre(square_matrix):
    k,k = square_matrix.shape
    Tmat = np.identity(k) - [[1/k for _ in range(k)] for _ in range(k)] 
    return np.matmul(Tmat, np.matmul(square_matrix, np.transpose(Tmat)))

# now calculate the shared times and mean-centred shared times with this method
sts_gstat = shared_times(ts)
sts_gstat_mc = np.array(list(map(mean_centre, sts_gstat)))

# and finally compare the two approaches
np.all(abs(sts_gstat_mc - sts_grel_mc) < 1e-8)
np.all(abs(sts_grel * 2 - sts_gstat) < 1e-8) #NOTE THE FACTOR OF 2

It turns out that to get the mean-centred shared times from genetic_relatedness we need polarised=False but to get the non-mean-centred shared times from genetic_relatedness we need polarised=True and to multiply the result by 2.

The latter bit makes sense to me: we want to simply multiply a branch length by 1 if both lineages share it (eqn 8 in Ralph et al 2020 Genetics), and since here f(x,y) = x * y / 2, we also need to multiply by 2. I'm a bit surprised by the former bit (that we need polarised=False to get the mean-centred shared times), but glad everything works!!

Do we need more tests? Or should I go ahead with the workflow and 1) add to docstring/docs, 2) add to the change log, and 3) send this to review?

@petrelharp
Copy link
Contributor

And yeah, a time-window sounds very convenient for what I'd like to do (and then I should be able to just work with the centred version).

Hm - I'm confused on two things: how will you do this without a time window? and, I don't think that the centred version will work for this, since you want the centring to be done once, across all time windows, rather than separately by time window? (I think that's how the centering would work with time windows, since the summary stat would be applied separately to each chunk of time x genome.)

@mmosmond
Copy link
Contributor Author

how will you do this without a time window?

I'd calculate the non-centred shared times with genetic_relatedness, write a little script to convert the output to the shared times (in each tree) in each epoch, multiply each epoch by a factor, add them all back together, and mean centre.

I don't think that the centred version will work for this, since you want the centring to be done once, across all time windows, rather than separately by time window?

Right, good point. But in my experience the splitting into epochs and then mean centring isn't a big deal, so what we've done here already seems to do the bulk of the work.

@petrelharp
Copy link
Contributor

It turns out that to get the mean-centred shared times from genetic_relatedness we need polarised=False but to get the non-mean-centred shared times from genetic_relatedness we need polarised=True and to multiply the result by 2.

The first thing is good, since that's how we designed the statistic (since this is something estimable without knowing polarization, it should be computed with polarised=False; and this is the default, since polarization is not observable from the data).

As for requring polarisation to get the shared times: this should make sense once we understand what removing the mean centring is doing. Let's see, from the docs, the site version does:

Number of pairwise allelic matches in the window between two sample sets relative to the rest of the sample sets. To be precise, let m(u,v) denote the total number of alleles shared between nodes u and v, and let m(I,J) be the sum of m(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be independently chosen sample sets. Then, for sample sets I and J, this computes E[m(I,J) - m(I,S) - m(J,T) + m(S,T)]. 

Ok, and this is just removing the "relative" and we get E[m(I,J)]. With polarised=False this includes the ancestral allele as well.
Hm, and the branch version:

Total area of branches in the window ancestral to pairs of samples in two sample sets relative to the rest of the sample sets. To be precise, let B(u,v) denote the total area of all branches ancestral to nodes u and v, and let B(I,J) be the sum of B(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be two independently chosen sample sets. Then for sample sets I and J, this computes E[B(I,J) - B(I,S) - B(J,T) + B(S,T)].

Hm, this is puzzling. Let's see - going back to the summary function, which without mean-centering is

 f(x1, x2) = x1 * x2 / 2

Ah-ha: so if each of these are single samples, then x1 and x2 are each 0/1, indicating whether the branch/allele is above the sample. With polarised=False, this is applied to each branch for both of the allele frequencies resulting from a mutation on that branch; i.e., f(x1, x2) + f(1-x1, 1-x2) (since I'm doing the example of single samples). So if a branch is ancestral to neither sample, then it counts also. This makes good sense, because as we noted above, a mutation that is inherited by neither sample contributes to the shared number of alleles, since both get the ancestral allele at that site.

Somehow this does the right thing for the centred version. I forget why the factor of 2 is there, but I'm sure it's supposed to be there?

So - I think that everything is as expected, although confusing - do you agree? Might you be able to draft something for the docs explaining this (eg that you need polarised=True to get non-mean-centred shared branches, and maybe why)?

@petrelharp
Copy link
Contributor

I'd calculate the non-centred shared times with genetic_relatedness, write a little script to convert the output to the shared times (in each tree) in each epoch, multiply each epoch by a factor, add them all back together, and mean centre.

How's this work? For each tree you get a single number; how's it get split into epochs? (sorry, I'm being slow!)

@petrelharp
Copy link
Contributor

Do we need more tests? Or should I go ahead with the workflow and 1) add to docstring/docs, 2) add to the change log, and 3) send this to review?

The current tests are great! But, I think we still need tests of centreing for site and node statistics.

@petrelharp
Copy link
Contributor

Also our addition seems to have broken some other things - see the failing tests (and let me know if you want me to investigate).

@mmosmond
Copy link
Contributor Author

Thanks @petrelharp!

So if a branch is ancestral to neither sample, then it counts also.

Ah, now I get it, thanks.

Somehow this does the right thing for the centred version.

Yeah, and I just realized that an alternative way to get the centred shared times is to use polarised=True and multiply by 2 (as we do for the non-centred version). This is the most intuitive way, for me, to think about these branch statistics, as it means you just add up the lengths of shared branches (in reality you add half the length of each shared branch, but then multiply by everything by 2).

I forget why the factor of 2 is there, but I'm sure it's supposed to be there?

This makes more sense for me when mode='site' and polarised=False , because then if two sample nodes share an allele at a site they get a 1/2 for sharing the allele and 1/2 for sharing the absence of the other allele, giving 1. But it is harder for me to justify the 1/2 when used as a branch statistic. More thought needed...

So - I think that everything is as expected, although confusing - do you agree? Might you be able to draft something for the docs explaining this (eg that you need polarised=True to get non-mean-centred shared branches, and maybe why)?

Yep and yep.

How's this work? For each tree you get a single number; how's it get split into epochs? (sorry, I'm being slow!)

With n_e epochs and k samples, for each tree you get n_e (k x k) matrices. I've been doing this like (I'm sure this can be improved!):

def epochs(shared_times, split_times):

    T = shared_times[0,0] #TMRCA in this tree (any diagonal entry)
    chopped_split_times = [t for t in split_times if t < T] #chop off epochs at TMRCA
    tsplits = [0] + chopped_split_times + [T] #append 0 to front of split times, append TMRCA to end
    Ts = [tsplits[i + 1] - tsplits[i] for i in range(len(tsplits)-1)] #amount of time in each epoch
    
    stimes = [] #place to store shared time matrices for all epochs

    # 0s for epochs beyond TMRCA
    n = len(split_times) - len(chopped_split_times)
    if n  > 0:
        k = len(shared_times)
        for _ in range(n):
            stimes.append(np.zeros((k,k)))
    
    # epochs within TMRCA
    for i in range(len(tsplits)-1):
        # shared times in remaining epochs (subtract elapsed time, set to zero if negative, find the min vs time in epoch)
        stimes.append(np.minimum(np.maximum(shared_times - (tsplits[-1] - tsplits[-(i+1)]), 0), Ts[-(i+1)])) 
    
    return np.array(stimes) #most distant epoch listed first

# little example using sts_gstat from thread above
split_times = [2, 8, 10] #this will define 4 epochs based on these three split times
sts_epochs = []
# apply to each tree
for i,sts in enumerate(sts_gstat):
    sts_epochs.append(epochs(sts, split_times))

The current tests are great! But, I think we still need tests of centreing for site and node statistics.

OK, I'll work on this.

Also our addition seems to have broken some other things - see the failing tests (and let me know if you want me to investigate).

And I'll try to fix these too.

Thanks again.

@petrelharp
Copy link
Contributor

Yeah, and I just realized that an alternative way to get the centred shared times is to use polarised=True and multiply by 2 (as we do for the non-centred version). This is the most intuitive way, for me, to think about these branch statistics, as it means you just add up the lengths of shared branches (in reality you add half the length of each shared branch, but then multiply by everything by 2).

Hm, I'm not sure this works out - each branch only counts in one direction, since for the case of single nodes, if x1 * x2 is nonzero then (1 - x1) * (1 - x2) is zero. More generally, the factor of 2 doesn't come from adding both alleles and dividing by 2, since it applies to multiallelic sites, if I recall correctly.

@petrelharp
Copy link
Contributor

We have the proof written up in an overleaf, I'll send it to you.

@petrelharp
Copy link
Contributor

And, oh! I see about the epochs. It only works if you're working with only single samples on a single tree and all samples are at the same time, right?

@mmosmond
Copy link
Contributor Author

Hm, I'm not sure this works out - each branch only counts in one direction, since for the case of single nodes, if x1 * x2 is nonzero then (1 - x1) * (1 - x2) is zero. More generally, the factor of 2 doesn't come from adding both alleles and dividing by 2, since it applies to multiallelic sites, if I recall correctly.

Oh, I guess my reasoning is wrong here then (and I'd be curious to see the proof you mention). But the approach does work:

sts_grel_mc_pol = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=True, proportion=False, centred=True).reshape(ts.num_trees, n, n)
np.all(abs(sts_grel_mc_pol * 2 - sts_grel_mc) < 1e-8)

@mmosmond
Copy link
Contributor Author

And, oh! I see about the epochs. It only works if you're working with only single samples on a single tree and all samples are at the same time, right?

Yeah, that's the only case I've thought about.

@petrelharp
Copy link
Contributor

Let me know if you need a hand here?

@mmosmond
Copy link
Contributor Author

Yeah, a hand would help. I took a stab at fixing the failing tests but can't quite piece it all together. It looks like the problem is passing the extra centre=False keyword argument to the lower level diversity stat, but I don't get exactly where/how that is happening in the tests, and so I'm hesitant to try any fixes.

@petrelharp
Copy link
Contributor

I'll have a look. Separately, I see that we've called the option centred, while in the test code it's centre. The latter matches other options better, being a verb rather than an adjective (like span_normalise). Shall we change it to centre?

@petrelharp
Copy link
Contributor

Ah-ha: the problem is that Fst calls __k_way_sample_set_stat with an anonymous python function that itself calls the lower-level diversity method; we've made it so that centred is always an argument to the lower-level k-way method, but not to the one-way method. One fix would be to add centred to TreeSequence_one_way_stat_method as we have for TreeSequence_k_way_stat_method, although it's never used, or to do something in the anonymous python Fst function - but a much better solution would be to figure out a way to add an argument to just one method in the C code without adding it to all the other stats methods (without a lot of boilerplate).

Any suggestions here, @benjeffery, @jeromekelleher?

@petrelharp
Copy link
Contributor

I've done (two versions of) a way to fix Fst specific to the Fst method. That's enough for you to proceed here anyhow, @mmosmond .

@mmosmond
Copy link
Contributor Author

Shall we change it to centre?

I was thinking the same and just committed the changes to _tskitmodule.c and trees.py (the trees.c/h files still refer to a noncentred option/summary function, but in that context it seems appropriate to leave as is

@jeromekelleher
Copy link
Member

I'm not following the issue here @petrelharp, but maybe the simplest thing is to hack what needs to be hacked here and to create an issue to track the necessary "unhacking"?

@petrelharp
Copy link
Contributor

Ok, @mmosmond - the conclusion in #1648 is that you can leave things the way they are here, and we'll follow up with the unhacking separately.

@petrelharp
Copy link
Contributor

Is that enough to get you going again?

@petrelharp
Copy link
Contributor

Need any assistance to get this wrapped up, @mmosmond?

@mmosmond
Copy link
Contributor Author

Hey @petrelharp, sorry, having a hard time finding the time right now (and it takes me a while to navigate the code). If you can find the time go for it! Sorry to drag this out...

@petrelharp
Copy link
Contributor

All good! I'll see if I can get to it...

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted a few minor things, otherwise LGTM

c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.c Show resolved Hide resolved
c/tskit/trees.h Outdated Show resolved Hide resolved
python/_tskitmodule.c Show resolved Hide resolved
python/_tskitmodule.c Show resolved Hide resolved
python/tskit/trees.py Outdated Show resolved Hide resolved
@petrelharp
Copy link
Contributor

petrelharp commented Sep 16, 2024

Whoops - I guess I forgot to check codecov for _tskitmodule.c? I'm a bit puzzled why it wasn't more obvious when I looked at the codecov report - I thought I'd looked through everything! I think I've got all those bits; we'll see if codecov agrees. Edit: maybe I just didn't click on that file? It's clearly there now...

@petrelharp
Copy link
Contributor

Okay - looks good to go pending @mmosmond's review of the docs and concepts!

…tic_relatedness

.

initialize all nodes in branch stats; closes tskit-dev#2983

right normalization in genetic relatedness, closes tskit-dev#2984

tests pass!!
lint

relatedness_matrix passes tests

passes all tests?
additional C testing
lint

changelog

apply @jeromekelleher comments

lint

a few more low-level tests

one more test
@petrelharp petrelharp force-pushed the genetic_relatedness_centerFalse branch from a2d1326 to 3d3a989 Compare September 17, 2024 00:48
docs/stats.md Outdated Show resolved Hide resolved
docs/stats.md Outdated

Most statistics are not be affected by invariant sites,
and hence do not depend on any part of the tree that is not ancestral to any of the sample sets.
However, some statistics are different: for instance, {meth}`TreeSequence.genetic_relatedness`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'However, some statistics are different: for instance, given a pair of samples, {meth}TreeSequence.genetic_relatedness
with centre=False (and polarised=True, the default for that method)
adds up the total number of alleles (or total area of branches) that is...'

docs/stats.md Outdated
of samples.
of sample sets.
For an polarised statistic (the default) with biallelic loci, this calculates
{math}`(p_1 - \bar{p}) (p_2 - \bar{p})`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth clarifying what is meant by $\bar{p}$ precisely

docs/stats.md Outdated Show resolved Hide resolved
docs/stats.md Outdated
@@ -588,18 +607,30 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
For an unpolarized statistic with biallelic loci, this calculates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'unpolarised'

docs/stats.md Outdated Show resolved Hide resolved
docs/stats.md Outdated Show resolved Hide resolved
docs/stats.md Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Member

Shall we merge?

@petrelharp
Copy link
Contributor

Yes! @mmosmond still happy for any suggestions / changes - we can put them in separately.

@petrelharp petrelharp merged commit 4170933 into tskit-dev:main Sep 17, 2024
19 checks passed
@mmosmond
Copy link
Contributor Author

Wow, amazing to see how you guys work! No wonder this software can do all the things it can do.

This looks good to me. I can get exactly what I want (shared branch length in each tree for all pairs of samples) with

import tskit, msprime
import numpy as np
ts = msprime.sim_ancestry(samples=3, sequence_length=1e3, recombination_rate=1e-3, random_seed=1)
n = ts.num_samples
sample_sets = [[i] for i in ts.samples()] #each sample in its own set
indexes = [(i,j) for i in ts.samples() for j in ts.samples()] #every pairwise comparison
ts.genetic_relatedness(sample_sets, indexes, 
                                  mode='branch', windows='trees', 
                                  span_normalise=True, polarised=True, 
                                  proportion=False, centre=False).reshape(ts.num_trees, n, n) 

No more pesky factor of 2!

@petrelharp
Copy link
Contributor

Yep, that's right! And you probably already noticed this, but if you don't want the actual matrix but just want to do operations with it, you can do vector-matrix-vector products with genetic_relatedness_weights, and matrix-vector products when #2980 goes in.

@petrelharp
Copy link
Contributor

Your attention to detail on the documentation would be particularly helpful, though, @mmosmond - so if you haven't read it carefully then that would be helpful.

@mmosmond
Copy link
Contributor Author

Yeah I took a look yesterday, and again now, and (with a focus on branch statistics) all looks ok to me. The only thing I wonder about is -- as a relative novice with a very specific aim -- if it should be said explicitly under "branch" that you can get the shared branch lengths between two samples in each tree with mode='branch', windows='trees', proportion=False, centre=False. But that is probably putting too much emphasis on an overly specific case. The upside would be to provide a concrete example for the branch statistic, like $K_{c0}$ does for the site statistic.

@petrelharp
Copy link
Contributor

Perhaps a short example in the tutorials would be nice for that?

@mmosmond
Copy link
Contributor Author

Yeah, that sounds more appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

new (?) statistic to efficiently calculate shared times between every pair of samples at every tree
5 participants