Checkpointed rejection sampling in the SLiM engine does not produce correct simulations #1325

nspope · 2022-08-10T20:22:31Z

nspope
Aug 10, 2022
Maintainer

Summarizing the discussion here with @grahamgower, @petrelharp, and @andrewkern :

The SLiM engine in stdpopsim can use checkpointed rejection sampling to condition on frequencies of particular mutations over time (e.g. to simulate sweeps). As pointed out by @petrelharp, this mechanism can easily result in simulations from the incorrect stochastic process. A simple example is a scenario where a mutation gets introduced at generation $t_0$, the simulation state is saved at generation $t_1$, and if the mutation is lost anywhere in the interval $\{t_1 + 1, \dots, t_2\}$ then the simulation restarts from the checkpoint.

  Mutation introduced
          |          Checkpoint                           Simulation ends
          |               |      Restart from checkpoint    |
          |               |       if mutation gets lost     |
          |-------A-------|----------------B----------------|
         t_0             t_1                               t_2
                                Time

In the current API, this corresponds to the extended events,

extended_events = [
    stdpopsim.ext.DrawMutation(
        mutation_type_id=id,
        time=t_0,
        population_id=0,
        coordinate=coordinate,
    ),
    stdpopsim.ext.ConditionOnAlleleFrequency(
        mutation_type_id=id,
        start_time=t_1,
        end_time=t_2,
        op=">",
        allele_frequency=0.0,
        population_id=0,
        save=True,
    ),
]

where the "checkpointing" behavior is triggered by save=True.

Let $F_i$ be a random variable that is the frequency of the mutation at time $i$. When we run the simulator from time $t_0$ until time $t_1$, we end up with a particular realization of frequencies across all generations in the interval $A = \{t_0 + 1, \dots, t_1\}$. Let $F_{A}$ be the random frequency trajectory over this interval and let $f_A$ be its realization, respectively. If we stop the simulator at $t_1$ then the joint probability of frequencies over $A$ is $\mathbb{P}(F_{A} = f_{A})$.

Now continue the simulator from $t_1$ until $t_2$. First consider rejection sampling when there's no intermediate checkpoint, so that the simulation restarts from $t_0$ if the mutation is lost in the interval $B = \{t_1+1, \dots, t_2\}$. We'd get a new $f_A$ in each candidate sample, that would be accepted with probability $\mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A)$. Thus, the values of $f_A$ across the rejection-sampled trajectories will follow the conditional distribution $$\mathbb{P}(F_A = f_A \mid F_i > 0\text{ for }i \in B) = \frac{\mathbb{P}(F_A = f_A) \mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A)}{\mathbb{P}(F_i > 0\text{ for }i \in B)}$$ where the denominator is just a normalizing constant to make the distribution sum/integrate to 1.

Now consider rejection sampling where there's a checkpoint at generation $t_1$ -- that is, we save $f_A$ at time $t_1$ and if the mutation is lost in $B$, then we return to generation $t_1$ and keep reusing the same value of $f_{t_1}$ as a starting point until we get an acceptable $f_B$. In this case, the checkpointed $f_A$ gets accepted with probability 1, because we're forcing the rejection sampler to return a trajectory that includes it! Thus, even though the resulting trajectories all satisfy $f_{t_2} > 0$, they will contain values of $f_A$ distributed according to $\mathbb{P}(F_A = f_A)$ rather than the conditional distribution $\mathbb{P}(F_A = f_A \mid F_i > 0\text{ for }i \in B)$. So the allele frequency trajectories (and thus genealogies) produced by the checkpointed rejection sampler won't come from the correct (conditioned) stochastic process.

One implication is that if the probability that an allele is lost is "independent enough" of its frequency at the checkpoint -- so that $\mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A) \approx \mathbb{P}(F_i > 0\text{ for }i \in B)$ -- then the checkpointed sampler would be close to correct. This may also happen if the allele frequency trajectories tend to go to fixation quickly enough (?). Even if the sampler is incorrect, importance weights could be used to mitigate the bias (e.g. for computing expected values, or for weighting training simulations for NNs).

Another nuance is checkpointing when the mutation is introduced. In principle, this should be fine because there's only one possible allele frequency ( $f_{t_0} =1/(2N)$ ) when the state is saved. However, if multiple mutations are introduced during the simulation (e.g. at staggered times, maybe across distinct sites) and are all checkpointed, then the problem described above would occur -- because the checkpoints of "later" mutations would store the partial frequency trajectories of earlier mutations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpointed rejection sampling in the SLiM engine does not produce correct simulations #1325

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Checkpointed rejection sampling in the SLiM engine does not produce correct simulations #1325

nspope Aug 10, 2022 Maintainer

Replies: 0 comments

nspope
Aug 10, 2022
Maintainer