Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coverage of modekit pileup results does not match the coverage displayed on IGV #385

Open
rania-o opened this issue Feb 25, 2025 · 2 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@rania-o
Copy link

rania-o commented Feb 25, 2025

Hello,

I used modkit pileup with a bam basecalled with dorado model [email protected] and called for m5C, m6a, pseudou and inosine.
I have two questions :

  • The first one is why does modkit report positions as modified when the fraction of modified base is 0 ?
  • The second is about the valid coverage or the score, for one position on IGV I have more than 7500 reads, but in the results modkit reports only 39 for the coverage, why this big difference ? I also have the case where modkit report ~2900 while on IGV I have only 800 reads.

Thanks in advance for you reply,
Rania

@ArtRand
Copy link
Contributor

ArtRand commented Feb 25, 2025

Hello @rania-o,

The first one is why does modkit report positions as modified when the fraction of modified base is 0 ?

Modkit pileup will report all reference positions where at least one read had a valid call. A "valid call" can be either a canonical or modified call with a probability $\ge$ the pass threshold. You may filter out the positions that have 0% modified - but that's not the default behavior.

The second is about the valid coverage or the score, for one position on IGV I have more than 7500 reads, but in the results modkit reports only 39 for the coverage, why this big difference ? I also have the case where modkit report ~2900 while on IGV I have only 800 reads.

For high-depth, you may want to increase the --max-depth parameter in modkit pileup. If you have a position with valid coverage of 39, but a read depth of 7500, could you check the value of $\text{N}_{\text{diff}}$? You may have a case where most of your reads align an alternate base, for example a G, then a small proportion of the reads have a mismatch and a modification reported on that mismatch. When modkit reports more depth than IGV, it can often be the case that IGV is downsampling the modBAM, if I recall correctly you can change this in the preferences.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Feb 25, 2025
@rania-o
Copy link
Author

rania-o commented Feb 26, 2025

Hi @ArtRand ,

Thanks for your reply.

Modkit pileup will report all reference positions where at least one read had a valid call. A "valid call" can be either a canonical or modified call with a probability ≥ the pass threshold. You may filter out the positions that have 0% modified - but that's not the default behavior.

What do you mean here by "that's not the default behavior" ? Filtering out positions with 0% modified automatically by modkit ?

For high-depth, you may want to increase the --max-depth parameter in modkit pileup. If you have a position with valid coverage of 39, but a read depth of 7500, could you check the value of N diff ? You may have a case where most of your reads align an alternate base, for example a G, then a small proportion of the reads have a mismatch and a modification reported on that mismatch. When modkit reports more depth than IGV, it can often be the case that IGV is downsampling the modBAM, if I recall correctly you can change this in the preferences.

For the case where IGV shows less reads than modkit, actually I'm aware of this and I checked on IGV all the reads are displayed.

For the other case, you are right. All the reads are considered as an alternate base. But the problem that I face is that I'm doubting about the modification to be a pseudoU since the signature of this modification is misbasecalling uridine as a cytosine T-->C, which is the case on IGV. But, I'm a bit surprised by the model not calling this base as a pseudoU, and instead calling an m6a even if the base in the reference is not an A.

Image
This is the output for this position :

EU293121.1      2131    2132    a       19      +       2131    2132    255,0,0 19      26.32   5       12      2       345     1192    6575    0
EU293121.1      2131    2132    m       6247    +       2131    2132    255,0,0 6247    2.37    148     6099    0       345     1192    347     0
EU293121.1      2131    2132    17596   19      +       2131    2132    255,0,0 19      10.53   2       12      5       345     1192    6575    0
EU293121.1      2131    2132    17802   321     +       2131    2132    255,0,0 321     13.40   43      278     0       345     1192    6273    0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants