Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split_ultimate_ancestor doesn't always change the root when its children change #993

Closed
duncanMR opened this issue Jan 23, 2025 · 3 comments

Comments

@duncanMR
Copy link

Post-processing is supposed to remove the virtual-root-like ancestor (0) and then split the ultimate ancestor (1) into multiple roots whenever its children change. However, it seems that this process misses some cases. For example:

import msprime
import tsinfer
import numpy as np

sim_ts = msprime.sim_ancestry(
    100,
    sequence_length=1e6,
    random_seed=2,
    recombination_rate=1e-8,
    population_size=1e4,
)
sim_mts = msprime.sim_mutations(sim_ts, rate=1.29e-8, random_seed=2)
sample_data = tsinfer.SampleData.from_tree_sequence(sim_mts)
base_ts = tsinfer.infer(sample_data, post_process=False)
ts = tsinfer.post_process(base_ts)

roots = []
edges_right = ts.edges_right
edges_parent = ts.edges_parent
problem_trees = []
for tree in ts.trees():
    if len(tree.roots) == 1: #skip first and last trees
        root = tree.root
        roots.append(root)
        right = edges_right[edges_parent == root]
        if not np.all(right == right[0]): #some children change at different positions
            problem_trees.append(tree.index)

print(f"{len(problem_trees)} root changes are missed out of {len(roots)} roots") #116 out 1244

Here is one of the problematic trees:

tree = ts.at_index(351)
print(f'Root: {tree.root}, children = {tree.children(tree.root)}')
#Root: 1744, children = (1683, 1151)
tree.prev()
print(f'Root: {tree.root}, children = {tree.children(tree.root)}')
#Root: 1744, children = (1683, 1027, 1335)

The calculation of root breaks in split_ultimate_ancestor seems to be correct in this example:

edges_left = base_ts.edges_left
edges_right = base_ts.edges_right
edges_parent = base_ts.edges_parent
left_children_changes = np.sort(edges_left[edges_parent == 1])
right_children_changes = np.sort(edges_right[edges_parent == 1])
children_changes = set(left_children_changes).union(right_children_changes)

I modified post_process to return the root_breaks, and they match the children_changes I calculated above exactly. So I think the problem might be with the step to modify the tables to add new roots, but I'm not sure where the bug is yet.

@benjeffery
Copy link
Member

Nice find - are the problematic trees on GeL as it would be good to extract some test cases for the test suite.

@duncanMR
Copy link
Author

I just found #850 and tskit-dev/tsdate#452 which discuss this. I'm confused, since aren't all the local root nodes created by splitting up the ultimate ancestor (1) whenever its children change? I don't understand why we would expect to have local roots with children that change over multiple trees after post-processing. @hyanwong, could you please clarify?

@duncanMR
Copy link
Author

I had made a silly error here: I forgot that simplification will remove the ultimate root nodes where there is a younger MRCA, and there is obviously no guarantee that these new roots will have the same children over multiple trees. Even when the ultimate root node is present in the simplified tree, the edges with that node as parent will not always share the same left and right coordinates, because simplification will also merge adjacent edges with the same parent and child. Since it's not a bug, I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants