Combining 5mC with 5hmC #363

bopohdr · 2025-02-02T19:54:07Z

As i understood, if interested in only 5mC vs C (and not 5hmC), then collapsing is performed by splitting 5hmC probabilities equally between both states:
collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'.

This would work only assuming that 5hmC calls are equally possible between 5mC and C. However, it is known that 5hmC is more likely to be a FP of 5mC and not canonical C (e.g., see section "The use of more comprehensive negative controls to account for confounding DNA modifications" in https://www.biorxiv.org/content/10.1101/2024.11.19.624260v1.full.pdf)

Do I understand correctly, that if Pcanonical = 0.45, P5mC=0.35, P5hmC=0.2 - it would be assigned Pcanonical (providing threshold is 0.4) ?
Is there a smarter way to first merge P5mC + P5hmC and then call based on the probability?

Thanks!

The text was updated successfully, but these errors were encountered:

ArtRand · 2025-02-04T19:28:54Z

Hello @bopohdr,

As i understood, if interested in only 5mC vs C (and not 5hmC), then collapsing is performed by splitting 5hmC probabilities equally between both states: collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'.

That's correct.

This would work only assuming that 5hmC calls are equally possible between 5mC and C. However, it is known that 5hmC is more likely to be a FP of 5mC and not canonical C (e.g., see section "The use of more comprehensive negative controls to account for confounding DNA modifications" in https://www.biorxiv.org/content/10.1101/2024.11.19.624260v1.full.pdf)

The method that Modkit uses doesn't have any information regarding the relative frequency of the various modifications in the sample. It also doesn't have any information about the error rate of the model to form any kind of likelihood of observing $p_{\text{5hmC}}$ given that $\text{x} = \text{5mC}$ or anything like that.

Basically, tl;dr, the method in Modkit is a fairly unbiased way to turn a 3-way classification into a 2-way classification.

Do I understand correctly, that if Pcanonical = 0.45, P5mC=0.35, P5hmC=0.2 - it would be assigned Pcanonical (providing threshold is 0.4) ?

That's correct.

Is there a smarter way to first merge P5mC + P5hmC and then call based on the probability?

Modkit has a couple other ways actually.

In functions like pileup you can use --combine-mods which will sum together all modification calls.
You can use modkit adjust-mods ${inbam} ${outbam} --convert h m which will add $p_{\text{5hmC}}$ to $p_{\text{5mC}}$ - I think this is the most similar to what you're looking for.

You can play around with this using the test data in the repo. E.g.

$ bam=./tests/resources/bc_anchored_10_reads.sorted.bam
$ modkit extract full ${bam} /tmp/scratch/orig.tsv
$ modkit adjust-mods ${bam} stdout --convert h m | modkit extract full stdin /tmp/scratch/converted.tsv

In general we've found that the default method for --ignore gives the best results when evaluating against ground truth synthetic oligos. We have a blog post and public data available if you want to try the various other methods. There is an older blog post comparing to WGBS as well.

Thanks for the question, happy to elaborate on anything that isn't clear.

ArtRand added the question Looking for clarification on inputs and/or outputs label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combining 5mC with 5hmC #363

Combining 5mC with 5hmC #363

bopohdr commented Feb 2, 2025

ArtRand commented Feb 4, 2025

Combining 5mC with 5hmC #363

Combining 5mC with 5hmC #363

Comments

bopohdr commented Feb 2, 2025

ArtRand commented Feb 4, 2025