Adding a centre=False option to genetic_relatedness statistic #1623

mmosmond · 2021-08-12T13:31:42Z

Description

Adding centre=False option to genetic_relatedness statistic, so that it can calculate, for example, the non-mean-centred shared times between samples.

So far I've just added the option to the Python test function and a test (which passes). Next step is to develop the C code, but I haven't figured out how to do it. I think a similar change (an option to set mean to 0) needs to be added to genetic_relatedness_summary_func.

Fixes #1619

Thanks for any help you can give @petrelharp or others!

codecov · 2021-08-12T14:02:58Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.72%. Comparing base (024e042) to head (275b27d).
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1623      +/-   ##
==========================================
+ Coverage   89.71%   89.72%   +0.01%     
==========================================
  Files          29       29              
  Lines       31298    31340      +42     
  Branches     6067     6077      +10     
==========================================
+ Hits        28078    28120      +42     
  Misses       1837     1837              
  Partials     1383     1383

Flag	Coverage Δ
c-tests	`86.49% <100.00%> (+0.02%)`	⬆️
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`88.99% <100.00%> (+<0.01%)`	⬆️
python-tests	`99.01% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
c/tskit/trees.c	`90.50% <100.00%> (+0.05%)`	⬆️
python/_tskitmodule.c	`88.99% <100.00%> (+<0.01%)`	⬆️
python/tskit/trees.py	`98.78% <100.00%> (+<0.01%)`	⬆️

petrelharp · 2021-08-13T05:08:28Z

Ok, this should do it, I think? The way the option gets passed through feels a little inelegant, but I think is ok.

We should think through the noncentred version a bit - what's the use case? Note that it will depend on, for instance, the number of fixed, derived mutations, which is a property we like statistics to not have.

mmosmond · 2021-08-14T17:57:31Z

This looks like the right thing to me. One semantic note though, should the new method you made be called genetic_relatedness_noncentred_summary_func rather than genetic_relatedness_centred_summary_func?

I was trying to get this to run in python and there seems to be a problem there. I'm not sure if this is correct but I added the centred=T/F option to __k_way_sample_set_stat in trees.py, but then there seems to be an error in parsing the keyword arguments, and I'm not sure what to do to fix it. More specifically,

import tskit, msprime

ts = msprime.sim_ancestry(samples=10, sequence_length=10, recombination_rate=1, random_seed=1)

n = ts.num_samples
sample_sets = [[i] for i in ts.samples()]
indexes = [(i,j) for i in ts.samples() for j in ts.samples()]

ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, proportion=False, centred=False)

gives SystemError: More keyword list entries (8) than format specifiers (7), where the error comes from line 5883 in trees.py.

On the use case, getting the shared times between sample lineages is what I am interested in, eg, to get the MLE of the rate of evolution under Brownian motion. And even if you want to eventually mean centre things, getting the noncentred version is essential if you were interested in getting the shared times in multiple/particular epochs, eg, to let the rate of evolution be piecewise constant.

petrelharp · 2021-08-18T05:55:11Z

getting the shared times between sample lineages is what I am interested in, eg, to get the MLE of the rate of evolution under Brownian motion.

Hm - I'm not sure what you mean (but, I believe you). Ah, I think I see - you must be doing this separately on each tree? (ie with windows='trees')?

if you were interested in getting the shared times in multiple/particular epochs, eg, to let the rate of evolution be piecewise constant.

It's on our list to add time windows to all the stats - it sounds like you might need this?

mmosmond · 2021-08-18T21:51:50Z

Thanks @petrelharp -- it would have taken me a long time to figure out that format specifier.

Yes, I'm interested in doing this tree by tree.

And yeah, a time-window sounds very convenient for what I'd like to do (and then I should be able to just work with the centred version).

I did a test to make sure this gives me what I think it should:

import tskit, msprime
import numpy as np

# run a small sim
ts = msprime.sim_ancestry(samples=3, sequence_length=1e3, recombination_rate=1e-3, random_seed=1)

# calculate two different measures of genetic_relatedness
n = ts.num_samples
sample_sets = [[i] for i in ts.samples()]
indexes = [(i,j) for i in ts.samples() for j in ts.samples()]
#centred version, NONPOLARISED
sts_grel_mc = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=False, proportion=False, centred=True).reshape(ts.num_trees, n, n) 
#noncentred version, POLARISED
sts_grel = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=True, proportion=False, centred=False).reshape(ts.num_trees, n, n) 

# my functions to get shared times with general_stat and mean centre them, for comparison with above
def shared_times(ts):
    k = ts.num_samples #number of samples
    W = np.identity(k) #each node i in [0,1,...,k] given row vector of weights with 1 in column i and 0's elsewhere 
    def f(x): return (x.reshape(-1,1) * x).flatten() #determine which pairs of samples share the branch above a node, flattened
    return ts.general_stat(
        W, f, k**2, mode='branch', windows='trees', polarised=True, strict=False
    ).reshape(ts.num_trees, k, k)

def mean_centre(square_matrix):
    k,k = square_matrix.shape
    Tmat = np.identity(k) - [[1/k for _ in range(k)] for _ in range(k)] 
    return np.matmul(Tmat, np.matmul(square_matrix, np.transpose(Tmat)))

# now calculate the shared times and mean-centred shared times with this method
sts_gstat = shared_times(ts)
sts_gstat_mc = np.array(list(map(mean_centre, sts_gstat)))

# and finally compare the two approaches
np.all(abs(sts_gstat_mc - sts_grel_mc) < 1e-8)
np.all(abs(sts_grel * 2 - sts_gstat) < 1e-8) #NOTE THE FACTOR OF 2

It turns out that to get the mean-centred shared times from genetic_relatedness we need polarised=False but to get the non-mean-centred shared times from genetic_relatedness we need polarised=True and to multiply the result by 2.

The latter bit makes sense to me: we want to simply multiply a branch length by 1 if both lineages share it (eqn 8 in Ralph et al 2020 Genetics), and since here f(x,y) = x * y / 2, we also need to multiply by 2. I'm a bit surprised by the former bit (that we need polarised=False to get the mean-centred shared times), but glad everything works!!

Do we need more tests? Or should I go ahead with the workflow and 1) add to docstring/docs, 2) add to the change log, and 3) send this to review?

petrelharp · 2021-08-19T13:42:21Z

And yeah, a time-window sounds very convenient for what I'd like to do (and then I should be able to just work with the centred version).

Hm - I'm confused on two things: how will you do this without a time window? and, I don't think that the centred version will work for this, since you want the centring to be done once, across all time windows, rather than separately by time window? (I think that's how the centering would work with time windows, since the summary stat would be applied separately to each chunk of time x genome.)

mmosmond · 2021-08-19T13:49:32Z

how will you do this without a time window?

I'd calculate the non-centred shared times with genetic_relatedness, write a little script to convert the output to the shared times (in each tree) in each epoch, multiply each epoch by a factor, add them all back together, and mean centre.

I don't think that the centred version will work for this, since you want the centring to be done once, across all time windows, rather than separately by time window?

Right, good point. But in my experience the splitting into epochs and then mean centring isn't a big deal, so what we've done here already seems to do the bulk of the work.

petrelharp · 2021-08-19T14:02:26Z

It turns out that to get the mean-centred shared times from genetic_relatedness we need polarised=False but to get the non-mean-centred shared times from genetic_relatedness we need polarised=True and to multiply the result by 2.

The first thing is good, since that's how we designed the statistic (since this is something estimable without knowing polarization, it should be computed with polarised=False; and this is the default, since polarization is not observable from the data).

As for requring polarisation to get the shared times: this should make sense once we understand what removing the mean centring is doing. Let's see, from the docs, the site version does:

Number of pairwise allelic matches in the window between two sample sets relative to the rest of the sample sets. To be precise, let m(u,v) denote the total number of alleles shared between nodes u and v, and let m(I,J) be the sum of m(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be independently chosen sample sets. Then, for sample sets I and J, this computes E[m(I,J) - m(I,S) - m(J,T) + m(S,T)].

Ok, and this is just removing the "relative" and we get E[m(I,J)]. With polarised=False this includes the ancestral allele as well.
Hm, and the branch version:

Total area of branches in the window ancestral to pairs of samples in two sample sets relative to the rest of the sample sets. To be precise, let B(u,v) denote the total area of all branches ancestral to nodes u and v, and let B(I,J) be the sum of B(u,v) over all nodes u in sample set I and v in sample set J. Let S and T be two independently chosen sample sets. Then for sample sets I and J, this computes E[B(I,J) - B(I,S) - B(J,T) + B(S,T)].

Hm, this is puzzling. Let's see - going back to the summary function, which without mean-centering is

 f(x1, x2) = x1 * x2 / 2

Ah-ha: so if each of these are single samples, then x1 and x2 are each 0/1, indicating whether the branch/allele is above the sample. With polarised=False, this is applied to each branch for both of the allele frequencies resulting from a mutation on that branch; i.e., f(x1, x2) + f(1-x1, 1-x2) (since I'm doing the example of single samples). So if a branch is ancestral to neither sample, then it counts also. This makes good sense, because as we noted above, a mutation that is inherited by neither sample contributes to the shared number of alleles, since both get the ancestral allele at that site.

Somehow this does the right thing for the centred version. I forget why the factor of 2 is there, but I'm sure it's supposed to be there?

So - I think that everything is as expected, although confusing - do you agree? Might you be able to draft something for the docs explaining this (eg that you need polarised=True to get non-mean-centred shared branches, and maybe why)?

petrelharp · 2021-08-19T14:03:17Z

I'd calculate the non-centred shared times with genetic_relatedness, write a little script to convert the output to the shared times (in each tree) in each epoch, multiply each epoch by a factor, add them all back together, and mean centre.

How's this work? For each tree you get a single number; how's it get split into epochs? (sorry, I'm being slow!)

petrelharp · 2021-08-19T14:05:23Z

Do we need more tests? Or should I go ahead with the workflow and 1) add to docstring/docs, 2) add to the change log, and 3) send this to review?

The current tests are great! But, I think we still need tests of centreing for site and node statistics.

petrelharp · 2021-08-19T14:07:05Z

Also our addition seems to have broken some other things - see the failing tests (and let me know if you want me to investigate).

mmosmond · 2021-08-19T15:50:25Z

Thanks @petrelharp!

So if a branch is ancestral to neither sample, then it counts also.

Ah, now I get it, thanks.

Somehow this does the right thing for the centred version.

Yeah, and I just realized that an alternative way to get the centred shared times is to use polarised=True and multiply by 2 (as we do for the non-centred version). This is the most intuitive way, for me, to think about these branch statistics, as it means you just add up the lengths of shared branches (in reality you add half the length of each shared branch, but then multiply by everything by 2).

I forget why the factor of 2 is there, but I'm sure it's supposed to be there?

This makes more sense for me when mode='site' and polarised=False , because then if two sample nodes share an allele at a site they get a 1/2 for sharing the allele and 1/2 for sharing the absence of the other allele, giving 1. But it is harder for me to justify the 1/2 when used as a branch statistic. More thought needed...

So - I think that everything is as expected, although confusing - do you agree? Might you be able to draft something for the docs explaining this (eg that you need polarised=True to get non-mean-centred shared branches, and maybe why)?

Yep and yep.

How's this work? For each tree you get a single number; how's it get split into epochs? (sorry, I'm being slow!)

With n_e epochs and k samples, for each tree you get n_e (k x k) matrices. I've been doing this like (I'm sure this can be improved!):

def epochs(shared_times, split_times):

    T = shared_times[0,0] #TMRCA in this tree (any diagonal entry)
    chopped_split_times = [t for t in split_times if t < T] #chop off epochs at TMRCA
    tsplits = [0] + chopped_split_times + [T] #append 0 to front of split times, append TMRCA to end
    Ts = [tsplits[i + 1] - tsplits[i] for i in range(len(tsplits)-1)] #amount of time in each epoch
    
    stimes = [] #place to store shared time matrices for all epochs

    # 0s for epochs beyond TMRCA
    n = len(split_times) - len(chopped_split_times)
    if n  > 0:
        k = len(shared_times)
        for _ in range(n):
            stimes.append(np.zeros((k,k)))
    
    # epochs within TMRCA
    for i in range(len(tsplits)-1):
        # shared times in remaining epochs (subtract elapsed time, set to zero if negative, find the min vs time in epoch)
        stimes.append(np.minimum(np.maximum(shared_times - (tsplits[-1] - tsplits[-(i+1)]), 0), Ts[-(i+1)])) 
    
    return np.array(stimes) #most distant epoch listed first

# little example using sts_gstat from thread above
split_times = [2, 8, 10] #this will define 4 epochs based on these three split times
sts_epochs = []
# apply to each tree
for i,sts in enumerate(sts_gstat):
    sts_epochs.append(epochs(sts, split_times))

The current tests are great! But, I think we still need tests of centreing for site and node statistics.

OK, I'll work on this.

Also our addition seems to have broken some other things - see the failing tests (and let me know if you want me to investigate).

And I'll try to fix these too.

Thanks again.

petrelharp · 2021-08-19T16:13:05Z

Yeah, and I just realized that an alternative way to get the centred shared times is to use polarised=True and multiply by 2 (as we do for the non-centred version). This is the most intuitive way, for me, to think about these branch statistics, as it means you just add up the lengths of shared branches (in reality you add half the length of each shared branch, but then multiply by everything by 2).

Hm, I'm not sure this works out - each branch only counts in one direction, since for the case of single nodes, if x1 * x2 is nonzero then (1 - x1) * (1 - x2) is zero. More generally, the factor of 2 doesn't come from adding both alleles and dividing by 2, since it applies to multiallelic sites, if I recall correctly.

petrelharp · 2021-08-19T16:13:44Z

We have the proof written up in an overleaf, I'll send it to you.

petrelharp · 2021-08-19T16:15:44Z

And, oh! I see about the epochs. It only works if you're working with only single samples on a single tree and all samples are at the same time, right?

mmosmond · 2021-08-19T16:21:17Z

Hm, I'm not sure this works out - each branch only counts in one direction, since for the case of single nodes, if x1 * x2 is nonzero then (1 - x1) * (1 - x2) is zero. More generally, the factor of 2 doesn't come from adding both alleles and dividing by 2, since it applies to multiallelic sites, if I recall correctly.

Oh, I guess my reasoning is wrong here then (and I'd be curious to see the proof you mention). But the approach does work:

sts_grel_mc_pol = ts.genetic_relatedness(sample_sets, indexes, mode='branch', windows='trees', span_normalise=True, polarised=True, proportion=False, centred=True).reshape(ts.num_trees, n, n)
np.all(abs(sts_grel_mc_pol * 2 - sts_grel_mc) < 1e-8)

mmosmond · 2021-08-19T16:21:48Z

And, oh! I see about the epochs. It only works if you're working with only single samples on a single tree and all samples are at the same time, right?

Yeah, that's the only case I've thought about.

petrelharp · 2021-08-27T06:25:06Z

Let me know if you need a hand here?

mmosmond · 2021-08-30T02:05:38Z

Yeah, a hand would help. I took a stab at fixing the failing tests but can't quite piece it all together. It looks like the problem is passing the extra centre=False keyword argument to the lower level diversity stat, but I don't get exactly where/how that is happening in the tests, and so I'm hesitant to try any fixes.

petrelharp · 2021-08-30T13:24:05Z

I'll have a look. Separately, I see that we've called the option centred, while in the test code it's centre. The latter matches other options better, being a verb rather than an adjective (like span_normalise). Shall we change it to centre?

petrelharp · 2021-08-30T13:44:31Z

Ah-ha: the problem is that Fst calls __k_way_sample_set_stat with an anonymous python function that itself calls the lower-level diversity method; we've made it so that centred is always an argument to the lower-level k-way method, but not to the one-way method. One fix would be to add centred to TreeSequence_one_way_stat_method as we have for TreeSequence_k_way_stat_method, although it's never used, or to do something in the anonymous python Fst function - but a much better solution would be to figure out a way to add an argument to just one method in the C code without adding it to all the other stats methods (without a lot of boilerplate).

Any suggestions here, @benjeffery, @jeromekelleher?

petrelharp · 2021-08-30T14:11:35Z

I've done (two versions of) a way to fix Fst specific to the Fst method. That's enough for you to proceed here anyhow, @mmosmond .

mmosmond · 2021-08-30T14:12:47Z

Shall we change it to centre?

I was thinking the same and just committed the changes to _tskitmodule.c and trees.py (the trees.c/h files still refer to a noncentred option/summary function, but in that context it seems appropriate to leave as is

jeromekelleher · 2021-08-30T14:15:05Z

I'm not following the issue here @petrelharp, but maybe the simplest thing is to hack what needs to be hacked here and to create an issue to track the necessary "unhacking"?

petrelharp · 2021-08-30T18:48:29Z

Ok, @mmosmond - the conclusion in #1648 is that you can leave things the way they are here, and we'll follow up with the unhacking separately.

petrelharp · 2021-08-30T18:49:06Z

Is that enough to get you going again?

petrelharp · 2021-09-20T04:13:31Z

Need any assistance to get this wrapped up, @mmosmond?

mmosmond · 2021-09-20T15:33:45Z

Hey @petrelharp, sorry, having a hard time finding the time right now (and it takes me a while to navigate the code). If you can find the time go for it! Sorry to drag this out...

petrelharp · 2021-09-20T15:35:22Z

All good! I'll see if I can get to it...

jeromekelleher

Spotted a few minor things, otherwise LGTM

c/tskit/trees.c

c/tskit/trees.h

python/_tskitmodule.c

python/tskit/trees.py

petrelharp · 2024-09-16T16:22:12Z

Whoops - I guess I forgot to check codecov for _tskitmodule.c? I'm a bit puzzled why it wasn't more obvious when I looked at the codecov report - I thought I'd looked through everything! I think I've got all those bits; we'll see if codecov agrees. Edit: maybe I just didn't click on that file? It's clearly there now...

petrelharp · 2024-09-16T17:15:28Z

Okay - looks good to go pending @mmosmond's review of the docs and concepts!

…tic_relatedness . initialize all nodes in branch stats; closes tskit-dev#2983 right normalization in genetic relatedness, closes tskit-dev#2984 tests pass!!

lint relatedness_matrix passes tests passes all tests?

additional C testing

@jeromekelleher

lint changelog apply @jeromekelleher comments lint a few more low-level tests one more test

docs/stats.md

brieuclehmann · 2024-09-17T14:20:12Z

docs/stats.md

+
+Most statistics are not be affected by invariant sites,
+and hence do not depend on any part of the tree that is not ancestral to any of the sample sets.
+However, some statistics are different: for instance, {meth}`TreeSequence.genetic_relatedness`


'However, some statistics are different: for instance, given a pair of samples, {meth}TreeSequence.genetic_relatedness
with centre=False (and polarised=True, the default for that method)
adds up the total number of alleles (or total area of branches) that is...'

brieuclehmann · 2024-09-17T14:34:04Z

docs/stats.md

-  of samples.
+  of sample sets.
+  For an polarised statistic (the default) with biallelic loci, this calculates
+  {math}`(p_1 - \bar{p}) (p_2 - \bar{p})`.


Worth clarifying what is meant by $\bar{p}$ precisely

docs/stats.md

brieuclehmann · 2024-09-17T14:34:41Z

docs/stats.md

@@ -588,18 +607,30 @@ and boolean expressions (e.g., {math}`(x > 0)`) are interpreted as 0/1.
  For an unpolarized statistic with biallelic loci, this calculates


'unpolarised'

docs/stats.md

jeromekelleher · 2024-09-17T21:24:45Z

Shall we merge?

petrelharp · 2024-09-17T23:33:22Z

Yes! @mmosmond still happy for any suggestions / changes - we can put them in separately.

mmosmond · 2024-09-18T01:55:32Z

Wow, amazing to see how you guys work! No wonder this software can do all the things it can do.

This looks good to me. I can get exactly what I want (shared branch length in each tree for all pairs of samples) with

import tskit, msprime
import numpy as np
ts = msprime.sim_ancestry(samples=3, sequence_length=1e3, recombination_rate=1e-3, random_seed=1)
n = ts.num_samples
sample_sets = [[i] for i in ts.samples()] #each sample in its own set
indexes = [(i,j) for i in ts.samples() for j in ts.samples()] #every pairwise comparison
ts.genetic_relatedness(sample_sets, indexes, 
                                  mode='branch', windows='trees', 
                                  span_normalise=True, polarised=True, 
                                  proportion=False, centre=False).reshape(ts.num_trees, n, n)

No more pesky factor of 2!

petrelharp · 2024-09-18T18:12:12Z

Yep, that's right! And you probably already noticed this, but if you don't want the actual matrix but just want to do operations with it, you can do vector-matrix-vector products with genetic_relatedness_weights, and matrix-vector products when #2980 goes in.

petrelharp · 2024-09-18T18:13:15Z

Your attention to detail on the documentation would be particularly helpful, though, @mmosmond - so if you haven't read it carefully then that would be helpful.

mmosmond · 2024-09-18T18:34:50Z

Yeah I took a look yesterday, and again now, and (with a focus on branch statistics) all looks ok to me. The only thing I wonder about is -- as a relative novice with a very specific aim -- if it should be said explicitly under "branch" that you can get the shared branch lengths between two samples in each tree with mode='branch', windows='trees', proportion=False, centre=False. But that is probably putting too much emphasis on an overly specific case. The upside would be to provide a concrete example for the branch statistic, like $K_{c0}$ does for the site statistic.

petrelharp · 2024-09-20T16:00:43Z

Perhaps a short example in the tutorials would be nice for that?

mmosmond · 2024-09-20T18:10:05Z

Yeah, that sounds more appropriate.

petrelharp mentioned this pull request Aug 30, 2021

allow new options to k-way stats #1648

Open

jeromekelleher reviewed Sep 16, 2024

View reviewed changes

petrelharp added 4 commits September 16, 2024 17:47

rebase of @mmsomond's previous commits to add centered option to gene…

0381eb6

…tic_relatedness . initialize all nodes in branch stats; closes tskit-dev#2983 right normalization in genetic relatedness, closes tskit-dev#2984 tests pass!!

genetic_relatedness_weighted fixed up

1dd30b3

lint relatedness_matrix passes tests passes all tests?

docs for relatedness_matrix

389b152

additional C testing

divmat works for sample sets; closes tskit-dev#2888

3d3a989

lint changelog apply @jeromekelleher comments lint a few more low-level tests one more test

petrelharp force-pushed the genetic_relatedness_centerFalse branch from a2d1326 to 3d3a989 Compare September 17, 2024 00:48

petrelharp mentioned this pull request Sep 17, 2024

Relatedness matrix-vector product #2980

Merged

brieuclehmann reviewed Sep 17, 2024

View reviewed changes