Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining 5mC with 5hmC #363

Open
bopohdr opened this issue Feb 2, 2025 · 1 comment
Open

Combining 5mC with 5hmC #363

bopohdr opened this issue Feb 2, 2025 · 1 comment
Labels
question Looking for clarification on inputs and/or outputs

Comments

@bopohdr
Copy link

bopohdr commented Feb 2, 2025

As i understood, if interested in only 5mC vs C (and not 5hmC), then collapsing is performed by splitting 5hmC probabilities equally between both states:
collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'.

This would work only assuming that 5hmC calls are equally possible between 5mC and C. However, it is known that 5hmC is more likely to be a FP of 5mC and not canonical C (e.g., see section "The use of more comprehensive negative controls to account for confounding DNA modifications" in https://www.biorxiv.org/content/10.1101/2024.11.19.624260v1.full.pdf)

  1. Do I understand correctly, that if Pcanonical = 0.45, P5mC=0.35, P5hmC=0.2 - it would be assigned Pcanonical (providing threshold is 0.4) ?
  2. Is there a smarter way to first merge P5mC + P5hmC and then call based on the probability?

Thanks!

@ArtRand
Copy link
Contributor

ArtRand commented Feb 4, 2025

Hello @bopohdr,

As i understood, if interested in only 5mC vs C (and not 5hmC), then collapsing is performed by splitting 5hmC probabilities equally between both states: collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'.

That's correct.

This would work only assuming that 5hmC calls are equally possible between 5mC and C. However, it is known that 5hmC is more likely to be a FP of 5mC and not canonical C (e.g., see section "The use of more comprehensive negative controls to account for confounding DNA modifications" in https://www.biorxiv.org/content/10.1101/2024.11.19.624260v1.full.pdf)

The method that Modkit uses doesn't have any information regarding the relative frequency of the various modifications in the sample. It also doesn't have any information about the error rate of the model to form any kind of likelihood of observing $p_{\text{5hmC}}$ given that $\text{x} = \text{5mC}$ or anything like that.

Basically, tl;dr, the method in Modkit is a fairly unbiased way to turn a 3-way classification into a 2-way classification.

Do I understand correctly, that if Pcanonical = 0.45, P5mC=0.35, P5hmC=0.2 - it would be assigned Pcanonical (providing threshold is 0.4) ?

That's correct.

Is there a smarter way to first merge P5mC + P5hmC and then call based on the probability?

Modkit has a couple other ways actually.

  1. In functions like pileup you can use --combine-mods which will sum together all modification calls.
  2. You can use modkit adjust-mods ${inbam} ${outbam} --convert h m which will add $p_{\text{5hmC}}$ to $p_{\text{5mC}}$ - I think this is the most similar to what you're looking for.

You can play around with this using the test data in the repo. E.g.

$ bam=./tests/resources/bc_anchored_10_reads.sorted.bam
$ modkit extract full ${bam} /tmp/scratch/orig.tsv
$ modkit adjust-mods ${bam} stdout --convert h m | modkit extract full stdin /tmp/scratch/converted.tsv

In general we've found that the default method for --ignore gives the best results when evaluating against ground truth synthetic oligos. We have a blog post and public data available if you want to try the various other methods. There is an older blog post comparing to WGBS as well.

Thanks for the question, happy to elaborate on anything that isn't clear.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants