Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk process for summary and sample-prob #389

Open
KunFang93 opened this issue Feb 27, 2025 · 3 comments
Open

Chunk process for summary and sample-prob #389

KunFang93 opened this issue Feb 27, 2025 · 3 comments
Labels
enhancement New feature or request

Comments

@KunFang93
Copy link

Hi @ArtRand,

I was wondering if it might be possible to add chunk-based processing (similar to the pileup method) for the –no-sampling option in summary and sample-prob in the future. Currently, the –no-sampling option is very resource-intensive—in my case, processing 150,000 reads requires around 60GB of RAM. Because my modifications are sparse, –no-sampling seems the only viable option I have. While I can work around this by splitting my BAM file into smaller segments and then aggregating the results, it would be ideal if the –no-sampling option could incorporate chunk processing strategy like pileup in the future.

Thanks for your help!

Best,
Kun

@ArtRand
Copy link
Contributor

ArtRand commented Feb 28, 2025

Hello @KunFang93,

That's a good idea, both of those commands are due for a little refresh. One caution about splitting the bam, depending on how you're doing it, you can have reads that get counted in two splits if they span the gap. Another option is to use modkit extract calls and pipe the table through another filter that calculates the statistics per-read. All of the rows for a read will come out together, so you can operate on each read at once, calculate the %-modified, etc.

Calculating the pass thresholds is a little more complicated. Right now the percentiles are naively, but exactly. I can already think of a few ways to be more clever about calculating the percentiles without using as much memory. Thanks for the use case and the pressure, I'll see what I can do.

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Feb 28, 2025
@KunFang93
Copy link
Author

Thanks for your suggestion! I will try it. Looking forward to seeing the new tricks in old functions :)

@ArtRand ArtRand added enhancement New feature or request and removed question Looking for clarification on inputs and/or outputs labels Mar 5, 2025
@ArtRand
Copy link
Contributor

ArtRand commented Mar 5, 2025

Reopening this to track the work.

@ArtRand ArtRand reopened this Mar 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants