-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
benchmarks and speedups #56
Comments
Hi @darked89 Thanks for the feedback. 10 hours does seem slow. It typically take 2-4 hours for densely imputed data (10M variants). How many variants are you mapping? The process is very I/O intense so fast storage will vastly improve performance. Thanks |
Dear Matt, the TSV input has +16M rows/positions. Compressing it with bgzip and indexing with tabix unfortunatelly did not improve the time needed to process it:
using (use real PID of the python3 running the gwas2vcf)
I got that >95% of the time the program spends executing update_dbsnp (gwas.py:94). No idea if i.e newer pysam (not that it is an easy change to make) will improve the speed. Best, DK |
Deari Matt, for the large set of summary stats i.e. from Finngen the quick hack to try is to reduce the size of the dbSNPs VCF file by creating a customized, mini-dbSNP VCF by selecting dbSNP entries specific for the given biobank. I have to recheck that the results obtained in this way are identical to the ones obtained using the whole dbSNP. Can you think about any reason they may differ? Best, DK |
Hi @darked89 Should give the same results. I doubt the performance would improve though since the SNP lookup is using tabix rather than reading the whole dbsnp VCF. Do you observe a performance improvement? Thanks |
Hi Matt, I have used just a small subset of the original Finngen (chromosome 22).
Maybe there are some delays in accessing the drive in our setup, but it looks like the size of the VCF does affect the speed how fast one can query the indexed VCF file in some environments . Hope it helps, DK |
Thanks @darked89! That's a huge difference in performance. I will investigate. If you would like to try, I created a new branch Thanks |
Hello, I was unsure if there may be some silly snafu's somehow giving me a corrupted /not all the records result VCF really, really fast, so I did compute md5sums on the non-header portions of the outputs (whole_dbSNP vs Finngen_subset_ofdbSNP):
and the headers have a different commands:
so It does not look like some late night error produced results too good to be true. Or so I hope.. ;) DK |
Hello,
I have completed TSV to VCF transformation of one Finngen GWAS summary file as a test case.
The gwas2vcf was run using Singularity container using:
Both genomic fasta and dbSNP VCF had chromosome ids in the same 1-22,X,Y,MT format, were indexed etc.
the output VCF format:
This was executed in 582.99 mins
usr time 577.51 mins sys time 5.38 mins
My questions
Thank you
Darek Kedra
The text was updated successfully, but these errors were encountered: