Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do you use modkit dmr? #361

Open
ArtRand opened this issue Jan 31, 2025 · 3 comments
Open

Do you use modkit dmr? #361

ArtRand opened this issue Jan 31, 2025 · 3 comments
Labels
good first issue Good for newcomers

Comments

@ArtRand
Copy link
Contributor

ArtRand commented Jan 31, 2025

Hello everyone.

I'd like to know if you're having issues with modkit dmr either in the pair or multi variety.

If you're not using it (but you are doing some kind of differential methylation analysis), why not?

Are the outputs hard to interpret, not helpful, or not compatible to other methods?
Is it too slow or (worse) are there bugs?

One thing that's on my immediate roadmap is to compare an open dataset to a published tool such as DSS. I'm also experimenting with a method to get p-values for regions so you could find significantly differentially methylated regions.

If you're using it and liking it, throw a 👍 on here for fun. But don't hold back if there are things that could be better. Of course I'm not promising I can get to all of them.

@ArtRand ArtRand added the good first issue Good for newcomers label Jan 31, 2025
@kylepalos
Copy link

I've been using DMR quite a bit and it has been fast and intuitive! Thanks to the devs for making Modkit a very user-friendly tool!

I do have two very minor questions that I couldn't really find answers to elsewhere.
In both cases, I usually perform paired site specific analyses, such as:

modkit dmr pair \
-a sample1_rep1.bed.gz -a sample1_rep2.bed.gz \
-b sample2_rep1.bed.gz -b sample2_rep2.bed.gz \
-o DMR.bed \
--ref reference.fasta \
--base A --base T \
--min-valid-coverage 10
  1. When analyzing the outputs with balanced replicates, would you recommend always analyzing the balanced effect sizes and p-values (rather than the un-balanced/raw values)? The effect sizes seem to be agreeable b/w raw and balanced, but p-values agree less, see attached scatter plots below. I'm not sure if this is expected behavior or if something about my analysis may be off.

  2. This one is ever more minor. I often analyze modification mutants where the effects are quite strong and a substantial fraction of my p-values (balanced or raw) == 0. I realize the exact p-value past a certain point isn't very interesting/informative, but I was wondering whether the range of reporting could/should be expanded beyond ~1e-50? This would just allow me to not have a massive clump of points at a very similar -log10(p-value) on volcano plots and similar graphics. Again, extremely minor and not actually a Modkit issue.

Image

Image

Thanks a lot!

@ArtRand
Copy link
Contributor Author

ArtRand commented Feb 8, 2025

@kylepalos Thanks for this!

When analyzing the outputs with balanced replicates, would you recommend always analyzing the balanced effect sizes and p-values (rather than the un-balanced/raw values)? The effect sizes seem to be agreeable b/w raw and balanced, but p-values agree less, see attached scatter plots below. I'm not sure if this is expected behavior or if something about my analysis may be off.

Let me take a look into this.

@Ge0rges
Copy link

Ge0rges commented Feb 23, 2025

I wanted to chime in here in support of this command. I have explored many different methods for quantifying how differently methylated a nucleotide is in different samples. I've typically looked for:

  • Significance metric (p-value)
  • Effect size metric
  • Whether methylation type is taken into account
  • On regions, whether position is taken into account
  • Whether methylation fraction is taken into account
  • Whether the test can be corrected for differences in coverage
  • Whether the test can take advantage of replicates
  • Whether the test can handle different number of replicates AND different coverage per replicate

This is a tall order. I started with modkit dmr went around the block a few (many) times, and finally have settled on modkit dmr pair which satisfies all these criteria. I think the tool you've developed is excellent. It is on my list of things to do on in the feature to explore a contribution integrating it into anvi'o (perhaps at a workshop in September).

For me the only thing lacking is a robust study testing the command's output on a controlled dataset, perhaps including a benchmark to some other relevant statistical tests. Perhaps that will one day be conducted by a member of the scientific community. That's on your TODO! Great!

The output is perfectly suitable for me, I've built a small software suite that reads that data in along with other modkit outputs, genetic annotations, etc. to output relevant plots that allow for a nice analysis.

Thanks for developing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants