Memory consumption of tree sequence statistics #647

molpopgen · 2020-05-26T00:16:01Z

When the output dimension of a statistic is large, so is the memory consumption.

The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).

import msprime
import numpy as np
import tskit


def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
    sample_sets = []
    indexes = []
    for i in range(len(samples)):
        sample_sets.append([i])
        for j in range(i + 1, len(samples)):
            indexes.append((i, j))

    div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
    return div


print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])

The versions are:
0.7.4
0.2.3

From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?

petrelharp · 2020-05-26T05:37:42Z

Here's the memory; the algorithm is described here. I think that summary can be recomputed each time it is needed, which would alleviate this problem, although would probably cause a drop in performance.

jeromekelleher · 2020-05-26T08:21:52Z

Well, we could add an option to either store the intermediate results or recompute them. I don't think this would add too much complexity. That is, assuming there is a significant difference in performance. If not, then we should get rid of the stored results. Should be easy enough to do a quick test?

jeromekelleher · 2020-08-27T13:01:19Z

Is this still an open issue? I think we should probably close it unless someone is intending to follow it up.

molpopgen · 2020-08-27T13:15:17Z

It does need addressing by someone who knows the code. There are certain situations where one ends up with unnecessary runtime crashes. Unfortunately, I have a hard time with the stats code.

…

On Thu, Aug 27, 2020, 6:06 AM Jerome Kelleher ***@***.***> wrote: Is this still an open issue? I think we should probably close it unless someone is intending to follow it up. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#647 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQ6OHZYVCXU3TQRH3NKR2DSCZKK5ANCNFSM4NJ3KGJQ> .

jeromekelleher · 2020-08-27T13:18:23Z

OK, let's keep it open.

jeromekelleher · 2023-07-07T10:50:41Z

I'm going to close this because we're addressing the general problem with pairwise statistics using a different framework now (starting from the divergence matrix in #2736)

The short version I think is that the stats API assumes that we have a relatively small number of statistics, and if we have a large number of related statistics to compute then other approaches should be used.

petrelharp · 2023-07-09T04:57:56Z

I actually think it's still worth running those tests - what you say is true for pairwise stats, but there's also clasess of stats with output equal to the number of samples that would be nice to do this way; e.g. "relatedness matrix times a vector".

petrelharp · 2024-09-25T03:57:08Z

Closed in #2980.

benjeffery added the Performance This issue addresses performance, either runtime or memory label Sep 29, 2020

petrelharp mentioned this issue Sep 29, 2021

ts-PCA performance is slow compared scikit-allel #1743

Open

petrelharp added the statistics label Nov 17, 2021

jeromekelleher closed this as completed Jul 7, 2023

petrelharp reopened this Jul 9, 2023

petrelharp closed this as completed Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption of tree sequence statistics #647

Memory consumption of tree sequence statistics #647

molpopgen commented May 26, 2020

petrelharp commented May 26, 2020

jeromekelleher commented May 26, 2020

jeromekelleher commented Aug 27, 2020

molpopgen commented Aug 27, 2020 via email

jeromekelleher commented Aug 27, 2020

jeromekelleher commented Jul 7, 2023

petrelharp commented Jul 9, 2023

petrelharp commented Sep 25, 2024

Memory consumption of tree sequence statistics #647

Memory consumption of tree sequence statistics #647

Comments

molpopgen commented May 26, 2020

petrelharp commented May 26, 2020

jeromekelleher commented May 26, 2020

jeromekelleher commented Aug 27, 2020

molpopgen commented Aug 27, 2020 via email

jeromekelleher commented Aug 27, 2020

jeromekelleher commented Jul 7, 2023

petrelharp commented Jul 9, 2023

petrelharp commented Sep 25, 2024