Why scaling the smaller dataset to the same depth as the larger dataset results in more false positive peaks. #694

bentyeh · 2025-02-27T20:44:04Z

bentyeh
Feb 27, 2025

In the original 2008 MACS paper by Zhang et al., the authors write

we notice that when tag counts from ChIP and controls are not balanced, the sample with more tags often gives more peaks even though MACS normalizes the total tag counts between the two samples ... we await more available ChIP-Seq data with deeper coverage to understand and overcome this bias

This is actually to be expected given MACS's Poisson model. I did not find a thorough explanation elsewhere (GitHub Issues, GitHub Discussions, or MACS Google Group), so I'm posting this here in case anyone else finds it useful. (Or if my understanding is incorrect, please let me know!)

Let $X$ be the coverage at some peak in the ChIP track. Let $\lambda$ be the coverage at that peak in the control track. Let $f$ be the sequencing depth ratio between samples, i.e., total number of reads in the ChIP sample / total number of reads in the control sample. If scaling the ChIP track to match the control track, then the scaled coverage ratio at the peak is $r = (X / f) / \lambda = X / (\lambda f)$. If scaling the control track to match the ChIP track, then the ratio is $r = X / (\lambda f)$. The scaled ratio is therefore identical regardless of which direction the scaling is performed.

For a constant ratio of a sample value to the mean of a Poisson distribution, the p-value (or 1 - CDF) decreases as the mean increases. While this can be observed by simulation, it can also be intuitively understood as follows:

In a Poisson distribution, the standard deviation grows with the square-root of the mean. Therefore, for a constant ratio of sample value-to-mean, the distance (in units of standard deviation) between the sample value and the mean increases as the mean increases.
- The number of standard deviations away from the mean does not alone determine a p-value when using a Poisson distribution. However, to further our intuitive understanding, we can consider the case of large mean (e.g., $λ \ge 1000$), when a Poisson distribution becomes very accurately approximated by a Normal distribution (see Wikipedia), which does have the property that p-values are uniquely determined by the number of standard deviations away from the mean. The z-score, given constant ratio $r$, increases (and p-value decreases) as the scaled local coverage $\lambda$ increases:

$$z(r, \lambda) = \frac{r\lambda - \lambda}{\sqrt{\lambda}} = (r-1) \sqrt{\lambda}$$

Consequently, using a higher control coverage value $\lambda$ (whether by scaling up a low-depth control to match a high-depth ChIP sample, or by scaling up a low-depth ChIP sample to match a high-depth control sample) results in lower p-values for all peaks and therefore more peaks passing cutoff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why scaling the smaller dataset to the same depth as the larger dataset results in more false positive peaks. #694

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Why scaling the smaller dataset to the same depth as the larger dataset results in more false positive peaks. #694

bentyeh Feb 27, 2025

Replies: 0 comments

bentyeh
Feb 27, 2025