-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative GQ values after vcf_to_hdf5 #341
Comments
Looks like integer overflow. You can pass a |
Ah thanks very much! I just added Cheers for the fast response! |
Cool- nice work on the comparisons by the way- useful to have this type of thing out there. Please drop a note to the google group if you write this up into a blog post/similar. I might reopen this- just to decide whether it's worth having an explicit check on the max value of GQ, otherwise could catch people out. Not an easy thing to trace. What is the source of your VCF? If it's something mainstream producing GQs > 99 we should look at supporting this. NB: scikit-allel/allel/io/vcf_read.py Line 1392 in 5f2f73f
Is the relevant line that defaults to i1 .
|
Hi, I was comparing some different VCF extraction packages and have been really impressed by the extraction speed in the h5 output of vcf_to_hdf5. However, I was doing some basic tests just to check outputs were the same from each method and have found some strange behaviour in the scikit-allel output.
Specifically I found that my GQ values >128 appear to have a negative offset...
Here is an example plotting GQ against sample DP and you can see the (orange) scikit-allel result has this strange offset:
For example, row 265 of the VCF is:
chr17 1005289 . A G 431133 PASS AC=6;AF=0.823;AN=6;DP=109787;FS=0;MQ=249.96;MQRankSum=5.615;QD=3.94;ReadPosRankSum=3.074;SOR=0.7 GT:AD:AF:DP:GQ:FT:F1R2:F2R1:PL:GP 1/1:0,53:1:53:156:PASS:.:.:244,159,0:206,156,0 1/1:0,40:1:40:117:PASS:.:.:205,120,0:167.3,117.3,0 1/1:0,50:1:50:147:PASS:.:.:235,150,0:197.4,147.4,0
You can see that the correct GQ for sample 1 is 156, but the value I get through scikit-allel is -100:
I found 271 cases where the GQ is offset and in each cases it is offset by either 28 (256) or 211 (2048) which can't be a coincidence.
Any idea what might be causing this issue?
The text was updated successfully, but these errors were encountered: