You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The SLiM engine in stdpopsim can use checkpointed rejection sampling to condition on frequencies of particular mutations over time (e.g. to simulate sweeps). As pointed out by @petrelharp, this mechanism can easily result in simulations from the incorrect stochastic process. A simple example is a scenario where a mutation gets introduced at generation $t_0$, the simulation state is saved at generation $t_1$, and if the mutation is lost anywhere in the interval $\{t_1 + 1, \dots, t_2\}$ then the simulation restarts from the checkpoint.
Mutation introduced
| Checkpoint Simulation ends
| | Restart from checkpoint |
| | if mutation gets lost |
|-------A-------|----------------B----------------|
t_0 t_1 t_2
Time
In the current API, this corresponds to the extended events,
where the "checkpointing" behavior is triggered by save=True.
Let $F_i$ be a random variable that is the frequency of the mutation at time $i$. When we run the simulator from time $t_0$ until time $t_1$, we end up with a particular realization of frequencies across all generations in the interval $A = \{t_0 + 1, \dots, t_1\}$. Let $F_{A}$ be the random frequency trajectory over this interval and let $f_A$ be its realization, respectively. If we stop the simulator at $t_1$ then the joint probability of frequencies over $A$ is $\mathbb{P}(F_{A} = f_{A})$.
Now continue the simulator from $t_1$ until $t_2$. First consider rejection sampling when there's no intermediate checkpoint, so that the simulation restarts from $t_0$ if the mutation is lost in the interval $B = \{t_1+1, \dots, t_2\}$. We'd get a new $f_A$ in each candidate sample, that would be accepted with probability $\mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A)$. Thus, the values of $f_A$ across the rejection-sampled trajectories will follow the conditional distribution $$\mathbb{P}(F_A = f_A \mid F_i > 0\text{ for }i \in B) = \frac{\mathbb{P}(F_A = f_A) \mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A)}{\mathbb{P}(F_i > 0\text{ for }i \in B)}$$ where the denominator is just a normalizing constant to make the distribution sum/integrate to 1.
Now consider rejection sampling where there's a checkpoint at generation $t_1$ -- that is, we save $f_A$ at time $t_1$ and if the mutation is lost in $B$, then we return to generation $t_1$ and keep reusing the same value of $f_{t_1}$ as a starting point until we get an acceptable $f_B$. In this case, the checkpointed $f_A$ gets accepted with probability 1, because we're forcing the rejection sampler to return a trajectory that includes it! Thus, even though the resulting trajectories all satisfy $f_{t_2} > 0$, they will contain values of $f_A$ distributed according to $\mathbb{P}(F_A = f_A)$ rather than the conditional distribution $\mathbb{P}(F_A = f_A \mid F_i > 0\text{ for }i \in B)$. So the allele frequency trajectories (and thus genealogies) produced by the checkpointed rejection sampler won't come from the correct (conditioned) stochastic process.
One implication is that if the probability that an allele is lost is "independent enough" of its frequency at the checkpoint -- so that $\mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A) \approx \mathbb{P}(F_i > 0\text{ for }i \in B)$ -- then the checkpointed sampler would be close to correct. This may also happen if the allele frequency trajectories tend to go to fixation quickly enough (?). Even if the sampler is incorrect, importance weights could be used to mitigate the bias (e.g. for computing expected values, or for weighting training simulations for NNs).
Another nuance is checkpointing when the mutation is introduced. In principle, this should be fine because there's only one possible allele frequency ( $f_{t_0} =1/(2N)$ ) when the state is saved. However, if multiple mutations are introduced during the simulation (e.g. at staggered times, maybe across distinct sites) and are all checkpointed, then the problem described above would occur -- because the checkpoints of "later" mutations would store the partial frequency trajectories of earlier mutations.
This discussion was converted from issue #1322 on August 16, 2022 14:50.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Summarizing the discussion here with @grahamgower, @petrelharp, and @andrewkern :
The SLiM engine in$t_0$ , the simulation state is saved at generation $t_1$ , and if the mutation is lost anywhere in the interval $\{t_1 + 1, \dots, t_2\}$ then the simulation restarts from the checkpoint.
stdpopsim
can use checkpointed rejection sampling to condition on frequencies of particular mutations over time (e.g. to simulate sweeps). As pointed out by @petrelharp, this mechanism can easily result in simulations from the incorrect stochastic process. A simple example is a scenario where a mutation gets introduced at generationIn the current API, this corresponds to the extended events,
where the "checkpointing" behavior is triggered by
save=True
.Let$F_i$ be a random variable that is the frequency of the mutation at time $i$ . When we run the simulator from time $t_0$ until time $t_1$ , we end up with a particular realization of frequencies across all generations in the interval $A = \{t_0 + 1, \dots, t_1\}$ . Let $F_{A}$ be the random frequency trajectory over this interval and let $f_A$ be its realization, respectively. If we stop the simulator at $t_1$ then the joint probability of frequencies over $A$ is $\mathbb{P}(F_{A} = f_{A})$ .
Now continue the simulator from$t_1$ until $t_2$ . First consider rejection sampling when there's no intermediate checkpoint, so that the simulation restarts from $t_0$ if the mutation is lost in the interval $B = \{t_1+1, \dots, t_2\}$ . We'd get a new $f_A$ in each candidate sample, that would be accepted with probability $\mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A)$ . Thus, the values of $f_A$ across the rejection-sampled trajectories will follow the conditional distribution $$\mathbb{P}(F_A = f_A \mid F_i > 0\text{ for }i \in B) = \frac{\mathbb{P}(F_A = f_A) \mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A)}{\mathbb{P}(F_i > 0\text{ for }i \in B)}$$ where the denominator is just a normalizing constant to make the distribution sum/integrate to 1.
Now consider rejection sampling where there's a checkpoint at generation$t_1$ -- that is, we save $f_A$ at time $t_1$ and if the mutation is lost in $B$ , then we return to generation $t_1$ and keep reusing the same value of $f_{t_1}$ as a starting point until we get an acceptable $f_B$ . In this case, the checkpointed $f_A$ gets accepted with probability 1, because we're forcing the rejection sampler to return a trajectory that includes it! Thus, even though the resulting trajectories all satisfy $f_{t_2} > 0$ , they will contain values of $f_A$ distributed according to $\mathbb{P}(F_A = f_A)$ rather than the conditional distribution $\mathbb{P}(F_A = f_A \mid F_i > 0\text{ for }i \in B)$ . So the allele frequency trajectories (and thus genealogies) produced by the checkpointed rejection sampler won't come from the correct (conditioned) stochastic process.
One implication is that if the probability that an allele is lost is "independent enough" of its frequency at the checkpoint -- so that$\mathbb{P}(F_i > 0\text{ for }i \in B \mid F_A = f_A) \approx \mathbb{P}(F_i > 0\text{ for }i \in B)$ -- then the checkpointed sampler would be close to correct. This may also happen if the allele frequency trajectories tend to go to fixation quickly enough (?). Even if the sampler is incorrect, importance weights could be used to mitigate the bias (e.g. for computing expected values, or for weighting training simulations for NNs).
Another nuance is checkpointing when the mutation is introduced. In principle, this should be fine because there's only one possible allele frequency ($f_{t_0} =1/(2N)$ ) when the state is saved. However, if multiple mutations are introduced during the simulation (e.g. at staggered times, maybe across distinct sites) and are all checkpointed, then the problem described above would occur -- because the checkpoints of "later" mutations would store the partial frequency trajectories of earlier mutations.
Beta Was this translation helpful? Give feedback.
All reactions