Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: How many fragments are generally sufficient for hmmratac function? #687

Open
leqi0001 opened this issue Feb 12, 2025 · 3 comments
Open

Comments

@leqi0001
Copy link

leqi0001 commented Feb 12, 2025

Hi,

I have a big cohort of scATACseq data, totaling 1 billion fragments from 24 pools of experiments (24 billion total). I've been trying to run hmmratac on these 24 fragment files with --cutoff-analysis-only, but it has been a week. The log says it downsampled to 800 million fragments for training, but the step to generate short, mono-, di-, and tri-nucleosomal signals has run for 3 days. Should I downsample the fragments files before hmmratac? There must be diminishing returns with more fragments, but I'm not sure how to decide the degree of downsampling.

Appreciate any suggestions!

@taoliu
Copy link
Contributor

taoliu commented Feb 12, 2025

  1. 800 millions reads is still too much for one single run on human genome... You can further down-sample to about 50million.
  2. As for scATAC, you may want to select a subset of cells to call peaks.
  3. Did you use MACS3 v3.0.2?

@leqi0001
Copy link
Author

leqi0001 commented Feb 12, 2025

@taoliu
Thanks for your reply! I'm trying downsampling and see what happens. Is there any downside with using too many reads other than memory/time consumption, such as increased noise?

  1. Yes I used v3.0.2.

@taoliu
Copy link
Contributor

taoliu commented Feb 20, 2025

@leqi0001 First, it will be a waste of $ if you sample is already saturated. Secondly, it will take more time and memory to process -- especially for hmmratac. Lastly, for method calling peaks based on p-value cutoff, such as callpeak, p-value will be overly optimistic when the sample size is large. In this case, effective size (using foldchange) should be considered together with p-value cutoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants