Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Filter Sites for High-Confidence Modifications? #346

Open
evans-ZH opened this issue Jan 16, 2025 · 5 comments
Open

How to Filter Sites for High-Confidence Modifications? #346

evans-ZH opened this issue Jan 16, 2025 · 5 comments

Comments

@evans-ZH
Copy link

Hi,
I used modkit pileup to detect modification sites, including m6A, m5C, inosine, and pseU. However, the resulting dataset contains an overwhelming number of sites, I use the filter:

Nvalid_cov / (Nvalid_cov + Ndelete + Nfail + Ndiff + Nnocall)  > 0.75

But it still seems that nearly every site is marked as modified.
Additionally, I noticed that the motif of the detected m6A sites at the corresponding location on the reference genome did not match DRACH.
I’d like to know if there are any recommended methods or additional filtering criteria I can apply to identify high-confidence and biologically meaningful modification sites. Thanks!

@Ge0rges
Copy link

Ge0rges commented Jan 16, 2025

Hi @evans-ZH you might want to take a look at the --filter-threshold in modkit pileup to filter based on the absolute confidence the modification calling model has in its output. For example, I set that to 0.95 for at least 95% confidence. The default in modkit is to keep 90% of calls by taking anything above the 10% percentile I believe.

PS: I am not the developer.

@marcus1487
Copy link
Contributor

Thanks for reaching out! Could you clarify what you mean by “nearly every site is marked as modified”? Specifically, what criteria are you using to determine modification?

It sounds like basecalling errors might be contributing to modification calls. To address this, you could separate your analysis by base type. For example, when analyzing m6A, you can specify the canonical base using the --motif option in your pileup command: --motif A 0 --ref /path/to/reference.fa This will focus the analysis on adenines and exclude other base types (e.g., m5C or inosine). This approach should help identify high-confidence modification sites and reduce noise.

Let me know if you need further assistance!

@evans-ZH
Copy link
Author

Oh! Thanks, adding --filter-threshold 0.95 has improved the results. There are fewer detected modification sites now, but it still seems to be too many. Does this mean I should increase the --filter-threshold further? I mean, how was the value 0.95 determined? Should I keep increasing it until the results align with theoretical expectations? Previously, I used the default --filter-percentile 0.1, and I wonder the difference between these two parameters.

I checked the results in IGV, as shown in the attached image(This is after applying --filter-threshold 0.95, where the number of detected modification sites significantly decreased. Without it, nearly every site was identified with all possible modifications for the corresponding base. For example, A was identified as both m6A and inosine, C as m5C, and T as pseU.). But there's still too many sites, and has considerable overlap between m6A and inosine.

Image

Additionally, I noticed that the "percent modified" column generally has low values. Does this mean that many sites have a small fraction of reads with different modifications, leading to multiple modifications at a single site and detecte large numbers of sites? If my understanding is correct, how should I proceed with the next step in my analysis? The number of detected modification sites is overwhelming, and I’d like to identify the most reliable and significant sites.

Image

I also examined the m6A motif at the corresponding sites on the reference genome using the BED file information, but the motifs didn’t match the expected DRACH pattern. Could you clarify why this might be?

Additionally, I’ve encountered a situation like this, which looks quite odd.

Image

Finally, I followed your suggestion and added --motif A 0, but the output was almost identical to when it wasn’t included. It seems like this option only removes m5C and pseU from the table for simplicity.

@evans-ZH
Copy link
Author

I selected sites with percent modified > 1 to draw m6A motif, now they are DRACH compliant in reference sequence, I want to know if this is correct and whether it fit m5C, inosine and pseU? if not, is there a more general filtering method? Or can you provide some ideas to verify that the obtained sites are accurate and important? Thanks!

@marcus1487
Copy link
Contributor

Hi @evans-ZH,

Thank you for following up with additional details and questions! It sounds like you’re making good progress in refining your analysis, but I’d like to help clarify some key aspects of how modified bases are detected and reported, which may provide better context for interpreting your results and planning the next steps.

  1. Understanding the Fraction of Reads with Modifications

It’s important to note that modification calls in the output represent the fraction of reads supporting a modification at a given position. A low percentage in the “percent modified” column does not necessarily mean the position “is modified” in a biologically significant way. Instead, it might reflect noise, sequencing errors, or model limitations, as the detection process is not absolute.

Given that models have an inherent error rate (typically 0.3%-1.5%, depending on the model and sample characteristics), some positions will naturally appear to have modifications even if they do not. Therefore, low fraction values should be treated with caution and may not represent biologically meaningful modifications.

  1. Filtering for High-Confidence Sites

You’re on the right track with filtering to focus on the most reliable modification calls. The --filter-threshold parameter is particularly useful here because it directly applies a confidence cutoff for modification calls. A value of 0.95 is often a good starting point, but you can increase it further if necessary to achieve a more stringent set of sites.

The difference between --filter-threshold and --filter-percentile is worth noting:
--filter-threshold applies an absolute confidence cutoff (e.g., retain calls with ≥95% confidence).
--filter-percentile retains a percentage of the most confident calls, regardless of their absolute scores.

For your goals, relying on --filter-threshold is more intuitive and transparent, as you can directly control the confidence level for inclusion, but be aware that raising the value too high and removing most all of the calls can also produce very confusing results. Filtering more than 40%-50% of the total calls is not recommended.

  1. Motif and Overlap Concerns

It’s encouraging that filtering for higher confidence and focusing on motifs like DRACH has helped refine your m6A results. If you’re seeing overlap between modifications (e.g., m6A and inosine), this might reflect shared signal regions or noise, especially when percentages are low. To address this:
• Consider analyzing each modification type separately by base and motif. This reduces the risk of conflating signals and helps ensure motif compliance (e.g., DRACH for m6A).
• Cross-reference the detected sites with expected biological patterns or known modifications in your system to confirm their relevance.

  1. Defining Your Research Goals

To move forward effectively, it might help to revisit your research goals. For instance:
• Are you aiming to identify only high-confidence modifications supported by strong biological evidence?
• Are you interested in characterizing global trends, even if some individual calls have lower confidence?

Being clear about your objectives will help you set appropriate thresholds and prioritize the most relevant data for downstream analysis.

  1. Next Steps and Suggestions

Here are a few practical suggestions:
• For m6A, continue to refine sites using motif-based filtering (e.g., DRACH compliance). For other modifications like m5C, inosine, or pseU, consider creating similar filters based on known or expected sequence contexts.
• If you suspect model errors, you could validate a subset of key sites using orthogonal methods (e.g. targeted sequencing or chemical assays).
• Evaluate broader trends in your data (e.g. distribution across genes or regions) rather than focusing solely on individual sites with low percentages.

Let me know if you’d like clarification on any of these points or additional guidance!

Best of luck with your analysis, and feel free to share further updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants