Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory consumption of tree sequence statistics #647

Closed
molpopgen opened this issue May 26, 2020 · 8 comments
Closed

Memory consumption of tree sequence statistics #647

molpopgen opened this issue May 26, 2020 · 8 comments
Labels
Performance This issue addresses performance, either runtime or memory statistics

Comments

@molpopgen
Copy link
Member

When the output dimension of a statistic is large, so is the memory consumption.

The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).

import msprime
import numpy as np
import tskit


def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
    sample_sets = []
    indexes = []
    for i in range(len(samples)):
        sample_sets.append([i])
        for j in range(i + 1, len(samples)):
            indexes.append((i, j))

    div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
    return div


print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])

The versions are:
0.7.4
0.2.3

From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?

@petrelharp
Copy link
Contributor

Here's the memory; the algorithm is described here. I think that summary can be recomputed each time it is needed, which would alleviate this problem, although would probably cause a drop in performance.

@jeromekelleher
Copy link
Member

Well, we could add an option to either store the intermediate results or recompute them. I don't think this would add too much complexity. That is, assuming there is a significant difference in performance. If not, then we should get rid of the stored results. Should be easy enough to do a quick test?

@jeromekelleher
Copy link
Member

Is this still an open issue? I think we should probably close it unless someone is intending to follow it up.

@molpopgen
Copy link
Member Author

molpopgen commented Aug 27, 2020 via email

@jeromekelleher
Copy link
Member

OK, let's keep it open.

@benjeffery benjeffery added the Performance This issue addresses performance, either runtime or memory label Sep 29, 2020
@jeromekelleher
Copy link
Member

I'm going to close this because we're addressing the general problem with pairwise statistics using a different framework now (starting from the divergence matrix in #2736)

The short version I think is that the stats API assumes that we have a relatively small number of statistics, and if we have a large number of related statistics to compute then other approaches should be used.

@petrelharp
Copy link
Contributor

I actually think it's still worth running those tests - what you say is true for pairwise stats, but there's also clasess of stats with output equal to the number of samples that would be nice to do this way; e.g. "relatedness matrix times a vector".

@petrelharp petrelharp reopened this Jul 9, 2023
@petrelharp
Copy link
Contributor

Closed in #2980.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance This issue addresses performance, either runtime or memory statistics
Projects
None yet
Development

No branches or pull requests

4 participants