Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assess the precision of the 4mC ratio #344

Open
hannan666666 opened this issue Jan 15, 2025 · 2 comments
Open

assess the precision of the 4mC ratio #344

hannan666666 opened this issue Jan 15, 2025 · 2 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@hannan666666
Copy link

Hello, I am working on quantifying the ratio of 4mC in mouse samples, but I have encountered a challenge. According to public papers, 4mC is very rare in mammals. I was wondering if you could provide some guidance on how I can assess the precision of the 4mC ratio of the modkit? Additionally, do you have any strategies to improve its precision, such as setting a higher threshold for the analysis? Thank you very much !!!

bases C
total_reads_used 10042
count_reads_C 10042
@ pass_threshold_C 0.640625
base code pass_count pass_frac all_count all_frac
C - 33096024 0.9303225 35700393 0.905164
C m 1632598 0.045892 2118287 0.053708013
C 21839 846164 0.0237855 1622119 0.041127943

@ArtRand
Copy link
Contributor

ArtRand commented Jan 16, 2025

Hello @hannan666666,

We recommend testing base modification models on synthetic strands. We've recently published a blog post describing how we derive the model performance metrics. Unfortunately, the 4mC validation data hasn't been released publicly yet.

I ran a test on the validation data I have, using the latest models ([email protected]_4mC_5mC@v3) and attached the pass confusion matrix from modkit validate.

> Call probability threshold: 0.6836
> Percent of modified base calls removed: 9.98%
> Filtered accuracy: 96.85%
> Filtered modified base calls contingency table
                  Called Base
         ┌───────┬────────┬────────┬────────┐
         │       │ C      │ 21839  │ m      │
         ├───────┼────────┼────────┼────────┤
 Ground  │ C     │ 97.83% │  1.75% │  0.42% │
 Truth   │ 21839 │  1.10% │ 98.78% │  0.12% │
         │ m     │  0.45% │  0.02% │ 99.52% │
         └───────┴────────┴────────┴────────┘

The threshold value I'm getting isn't much higher than what you're getting. There will always be a trade-off between increasing the --filter-threshold and the sensitivity of the model. What I would do is look at the output from modkit sample-probs and pick a threshold value for 4mC that corresponds to ~15-20th percentile.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Jan 16, 2025
@hannan666666
Copy link
Author

Thank you very much for your kind and informative reply! If possible, could you share the species and the 4mC fraction of your validation sample? My sample is from a mouse, and the 4mC fraction I observed is 0.041127943. Based on your experience, do you think this value is unusually high for mammals? I would greatly appreciate any insights you could provide.

Thank you again for your time and support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants