Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Modkit DMR Pair Results and Validate Accuracy #327

Open
BrendanBeahan opened this issue Dec 31, 2024 · 4 comments
Open
Labels
question Looking for clarification on inputs and/or outputs

Comments

@BrendanBeahan
Copy link

BrendanBeahan commented Dec 31, 2024

Hello,

Sorry for the flurry of recent questions approaching new years! But I have recently run modkit dmr pair comparing an in-vitro transcribed dataset where every A is m6A modified with another oligo which is fully unmodified.

I am, however, not seeing performance in keeping with what I was expecting when looking at existing modkit/dorado benchmarks, and I was wondering if I was perhaps doing something wrong? Or I am simply misinterpreting the results?

For dmr pair I perform the following:

modkit dmr pair \\ --ref ${params.reference}\\ -a ${bed_unmod_gz}\\ -b ${bed_mod_gz}\\ --base A\\ --header\\ --out-path single_site_dmr.tsv

And the output looks as follows:

chrom start end name score a_counts a_total b_counts b_total a_mod_percentages b_mod_percentages a_pct_modified b_pct_modified map_pvalue effect_size MYGENE 1 2 . 133.0527343210424 a:14 7809 a:59 508 a:0.18 a:11.61 0.0017928032 0.11614173 0 -0.11333333333333334 MYGENE 2 3 . 225.94086564894315 a:48 7696 a:128 720 a:0.62 a:17.78 0.0062370063 0.17777778 0 -0.17 MYGENE 3 4 . 330.8802245030597 a:48 7542 a:180 788 a:0.64 a:22.84 0.0063643595 0.2284264 0.0000000000000000001116233249574274 -0.22333333333333336 MYGENE 4 5 . 493.87030782775764 a:43 7669 a:251 859 a:0.56 a:29.22 0.0056069894 0.29220024 0.000000000000000000000000007378606107449063 -0.2866666666666667 MYGENE 5 6 . 54.474088063362615 a:73 7695 a:59 844 a:0.95 a:6.99 0.00948668 0.069905214 0 -0.060000000000000005

My concern is that nearly every single map pvalue is effectively 0 with a small number being exactly 1. And these 1 values occasionally overlap with the expected modification sites as well, which I would assume is the opposite of expected behavior.

Also when I try to utilize modkit validate I see the accuracy is only about 50%, with the log output as follows for the modified oligo dataset:

image

My bed file input for validate looks as follows:

MYGENE 1 2 a . +
MYGENE 2 3 a . +
MYGENE 3 4 a . +
MYGENE 4 5 a . +
MYGENE 5 6 a . +
MYGENE 6 7 - . +
MYGENE 7 8 - . +
MYGENE 8 9 - . +
MYGENE 9 10 - . +
MYGENE 10 11 - . +
MYGENE 11 12 - . +
MYGENE 12 13 - . +
MYGENE 13 14 - . +
MYGENE 14 15 - . +
MYGENE 15 16 a . +

I would greatly appreciate any insights into what might be affecting the performance or if there’s anything I should adjust in my approach. Happy to provide any additional details that would help clarify the issue.

Best,
Brendan

@ArtRand
Copy link
Contributor

ArtRand commented Dec 31, 2024

Hello @BrendanBeahan,

Once again, your commands look fine.

I reproduced the DMR table below to make it a little easier to read:

chrom start end name score a_counts a_total b_counts b_total a_mod_percentages b_mod_percentages a_pct_modified b_pct_modified map_pvalue effect_size
MYGENE 1 2 . 133.0527343210424 a:14 7809 a:59 508 a:0.18 a:11.61 0.0017928032 0.11614173 0 -0.11333333333333334
MYGENE 2 3 . 225.94086564894315 a:48 7696 a:128 720 a:0.62 a:17.78 0.0062370063 0.17777778 0 -0.17
MYGENE 3 4 . 330.8802245030597 a:48 7542 a:180 788 a:0.64 a:22.84 0.0063643595 0.2284264 1.116233249574274e-19 -0.22333333333333336
MYGENE 4 5 . 493.87030782775764 a:43 7669 a:251 859 a:0.56 a:29.22 0.0056069894 0.29220024 7.378606107449063e-27 -0.2866666666666667
MYGENE 5 6 . 54.474088063362615 a:73 7695 a:59 844 a:0.95 a:6.99 0.00948668 0.069905214 0 -0.060000000000000005

The b_mod_percentages are not close to 100% as you would expect if the molecule has all adenine residues substituted for m6a. Either the Remora caller is failing to detect these modified bases or the IVT is failing to incorporate the bases correctly (I'm not sure if the latter is even possible with your experimental setup). It's likely that the RNA basecaller and Remora modified base caller are struggling with a molecule with 100% m6A bases. We tend to train and evaluate on molecules with an expected base modification rate closer to what we'd expect to see in a biological system. Very densely packed modified bases also tends to be difficult. The map_pvalues are going to be nearly zero because you have very deep coverage.

I would investigate why you're not getting closer to 100% m6A calls on the modified sample. This is likely why the evaluate output is reporting low accuracy. What do the alignments look like? You have one tenth the coverage in the modified sample, are you loosing a lot of molecules in the mapping process? I don't want to be too discouraging since I don't know what you're trying to do exactly, but completely modified samples are going to be challenging to use effectively. I'm certainly keen to help you out here, so maybe if you tell me what the aim of this experiment is I can give you some pointers.

A

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Dec 31, 2024
@BrendanBeahan
Copy link
Author

BrendanBeahan commented Jan 2, 2025

Hello @ArtRand,

My goal is to benchmark several m6A-calling tools and then to pick the most appropriate one for our lab’s specific use case. In order to compare m6A-callers I was hoping to have some sort of continuous site-level confidence score, kind of like how m6Anet outputs site level modification probabilities and some of the other tools provide a p-value or FDR etc. This would serve as a threshold/continuous metric for a precision-recall curve and some other benchmarking. However, I am just as happy to use modkit in a comparative manner. I’m really just looking for a single metric I can point to and say “This is how confident modkit is that this site is modified.”. Even though it is much lower than expected, perhaps I could just use % modified for this metric? This obviously seems inappropriate, but I'm not entirely sure what alternative I would use. I was thinking perhaps the map_p of dmr pair.

Regarding our experimental design, it’s a bit complicated. We have two equal samples of a fully modified in-vitro transcribed oligo and an unmodified IVT, which we barcoded and sequenced together. I then utilized demultiplexing software to partition the reads into unaligned bams for barcode 1 (unmodified) and barcode 2 (modified). Notably, at this stage I had already lost many of the modified reads.. plausibly due to the abnormal amount of modifications causing issues with our sequencing. Regardless, my demultiplexing commands appeared fine to the authors of that software when I contacted them and their software has been used in similar IVT contexts to success.

I then used Dorado (which I suppose in turn uses minimap) to align these unaligned bams to a custom fasta containing the sequence of nucleotides of the synthetic IVT. At this stage I removed any unmapped reads. These bams are then used to filter my initial pod5 files into a modified directory and an unmodified directory, which in turn is fed back into my benchmarking pipeline so that all of my m6A-callers can use these now properly partitioned pod5 directories.

From here, it is a bit redundant, but I run a modified dorado basecaller on these now filtered pod5 directories, then map with minimap to the same reference, then utilize modkit pileup etc.

So in sum, I’m a bit stumped why I’m getting such a low % of modified m6A sites in my modified sample. But I think more importantly for me, I’m unsure how to create a continuous site-level metric showing modkit’s confidence of m6A modification to use as the continuous threshold for generating something like a precision-recall curve. I suppose in the case of most of these other tools, they are using lack of modification as the null hypothesis by comparing to an unmodified dataset or a previously trained model on a similarly unmodified dataset.

Any ideas would be greatly appreciated.

Best,
Brendan

@BrendanBeahan
Copy link
Author

BrendanBeahan commented Jan 2, 2025

Also, in looking at the initial output of modkit pileup for my modified pod5 directory further I see that the sites that should be modified (i.e. the A's) have an average valid coverage of almost 2 orders of magnitude higher than for the non-A/unmodified sites. With basically equal % modified predictions agnostic of the base in question.

@ArtRand
Copy link
Contributor

ArtRand commented Jan 3, 2025

Hello @BrendanBeahan,

In order to compare m6A-callers I was hoping to have some sort of continuous site-level confidence score, kind of like how m6Anet outputs site level modification probabilities and some of the other tools provide a p-value or FDR etc. ... “This is how confident modkit is that this site is modified.”

I can see what you're trying to do here, but let me see if I can convince you that you don't need to calculate a site-level confidence score. Basically, tl;dr, if your per-read-per-base accuracy is highest, you'll also minimize the difference between the estimated %-modified and actual %-modified.

In the experimental setup you've constructed, you expect that a given position will have a m6A frequency of 100% or 0%. You can use the difference in the percent methylation from the bedMethyl to the expected modification level as a metric of accuracy. You could aggregate every call for every read and produce a precision-recall plot, confusion matrix, or any other summary statistic. This is what modkit validate does. I'd argue that this approach is more generalizable than trying to classify sites as modified or unmodified because in reality, you're probably not going to have sites where every molecule is modified or canonical. I took a quick read of the m6anet code and it looks like it will call a site as modified if >=1 read predicts modification, which, in my opinion, discards a lot of the information in your data and I'm surprised doesn't lead to a very high false positive rate. At sufficiently high depth, wouldn't every residue be called modified? Surely I'm missing a nuance to how it works. You could do something similar where you a --mod-threshold a:0.99 or similar then if a single record in the bedMethyl reports modification - call the site modified - but I don't think this is a great idea.

So in sum, I’m a bit stumped why I’m getting such a low % of modified m6A sites in my modified sample.

This is a large warning signal. Could you run the modified sample separately from the unmodified one to test that the demultiplexing is working? From the BED file you sent:

MYGENE 1 2 a . +
MYGENE 2 3 a . +
MYGENE 3 4 a . +
MYGENE 4 5 a . +
MYGENE 5 6 a . +
MYGENE 6 7 - . +
MYGENE 7 8 - . +
MYGENE 8 9 - . +
MYGENE 9 10 - . +
MYGENE 10 11 - . +
MYGENE 11 12 - . +
MYGENE 12 13 - . +
MYGENE 13 14 - . +
MYGENE 14 15 - . +
MYGENE 15 16 a . +

There is a mistake here where you have . as the strand (which means "both") instead of + (schema here). Although I don't think it should matter in this case. The modifications are also in a homopolymer run which is probably going to be difficult for the basecalling model and Remora model to resolve correctly. What do the alignments of the two barcodes look like? Are they full length? What is the difference in the alignment accuracy between the two?

One thing I might try is instead of using 100% m6A and 0% m6A, try using a known stoichiometric fraction m6A. The polymerase will likely prefer to incorporate one nucleotide over the other, but if you add 50% m6A and 50% A to the reaction, I would expect the modification calls at a given site to be close to 50%.We've also compared whole transcriptome dRNA seq to the GLORI method, and seen good concordance. Ideally, really, the fully synthetic oligos like we have in the blog post is the best way to compare methods. We don't have dRNA date publicly released yet - but I hear it's coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants