Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper steps for cellSNP and Vireo for large dataset #61

Open
vincycheng opened this issue Jun 14, 2022 · 4 comments
Open

Proper steps for cellSNP and Vireo for large dataset #61

vincycheng opened this issue Jun 14, 2022 · 4 comments

Comments

@vincycheng
Copy link

Hi. I have been working on a set of data with 20K cells, and I have few questions regarding how to approach the data.

Q1: For cellSNP, it was taking forever (more than 15days) to run cellSNP as one whole, so I follow the suggestion I saw and split the bam file by chromosome and got individual cellSNP output. I then merge them together. I wonder if there is a better/prefer way to merge them for Vireo.
What I am currently doing is: bcftools merge, then bcftools sort

Q2: For Vireo, I used the VCF file (1.8GB) I mentioned above as $CELL_DATA and I also have the $DONOR_GT_FILE (744KB) which I follow the suggestion to subset it using bcftools view. The issue is, it seems to be using a lot of memory, and it is hard for me to estimate the amount of memory space I need to reserve for this.
The command I used is: vireo -c $CELL_DATA -d $DONOR_GT_FILE -o $OUT_DIR

Please advice. Thanks!

@huangyh09
Copy link
Collaborator

Hi, thanks for the questions.

Q1: please use cellsnp-lite; it is re-implemented with C/C++ for much faster and memory-efficient performance.

Q2: vireo supports loading the sparse matrices directly, so won't touch the large cellSNP.cells.vcf.gz. It's generally OK with memory usage for 20K cells, and I guess ~20GB memory should be sufficient. Otherwise, how many SNPs are there in your CELL_DATA folder (you can get it from cellSNP.base.vcf.gz)?

Hope these help.
Yuanhua

@vincycheng
Copy link
Author

Hi things seems to work well now after using cellsnp-lite instead. Thanks!

@hsymoon
Copy link

hsymoon commented May 27, 2024

Hi, thanks for the questions.

Q1: please use [cellsnp-lite](https: //cellsnp-lite.readthedocs.io); it is re-implemented with C/C++ for much faster and memory-efficient performance.

Q2: vireo supports loading the sparse matrices directly, so won't touch the large cellSNP.cells.vcf.gz. It's generally OK with memory usage for 20K cells, and I guess ~20GB memory should be sufficient. Otherwise, how many SNPs are there in your CELL_DATA folder (you can get it from cellSNP.base.vcf.gz)?

Hope these help. Yuanhua

Hello,I met "Memoryerror" when I use viero mode2. My command is vireo -c $sc_vcf -d 2donor.sorted.vcf.gz -o ${OUT_DIR} -N 4 --randSeed 2 --genoTag PL. Information about $sc_vcf is followed: bcftools +counts $sc_vcf
Number of samples: 9808
Number of SNPs: 747932
Number of INDELs: 120289
Number of MNPs: 0
Number of others: 0
Number of sites: 868210
Can you help me?.Thanks very much.

@huangyh09
Copy link
Collaborator

Hi, thanks for sharing the issue. It looks similar to Q2 above, so try not using the sc_vcf but use the CELL_DATA folder as the output of the cellsnp-lite. Then it will directly load the sparse matrices and skip parsing the vcf file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants