You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working a project using <htslib/vcf.h>. In particular, I am trying to access gene type (GT) info. I've shared my findings below.
GT is an encoded array with allele and phase info.
Idx is the index for alleles w.r.t the ref allele(0). For instance,
idx = 0(ref), 1, 2, 3(alt)
when phased(|), the corresponding encoded values are
val = 3, 5, 7, 9;
when unphased (), the corresponding encoded values are
val = 2, 4, 6, 8;
// Macros for setting genotypes correctly, for use with bcf_update_genotypes only; idx corresponds
// to VCF's GT (1-based index to ALT or 0 for the reference allele) and val is the opposite, obtained
// from bcf_get_genotypes() below.
#define bcf_gt_phased(idx) (((idx)+1)<<1|1)
#define bcf_gt_unphased(idx) (((idx)+1)<<1)
#define bcf_gt_missing 0
#define bcf_gt_is_missing(val) ((val)>>1 ? 0 : 1)
#define bcf_gt_is_phased(idx) ((idx)&1) // ??? should it be bcf_gt_is_phased(val)???
#define bcf_gt_allele(val) (((val)>>1)-1)
Later, one can also decode its val back to idx and phase. However, when I am parsing an example phased bcf file as attached, I found the following
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ID1 ID2
chr1 54490 . G A . . GC=0.394015 GT:AD:Ldev:Bdev:AS 1|0:19,19:0:0:0 ./.:.:.:.:.
chr1 634553 . G A . . GC=0.511222 GT:AD:Ldev:Bdev:AS ./.:.:.:.:. 0|1:14,21:0.002:0.158:-1
The first record chr1:54490, GT=1|0; ./., according to the above encode algorithm, my projected values should be (((1) + 1) << 1 | 1) = 2 * 2 + 1 = 5, (((0) + 1) << 1 | 1) = 1 * 2 + 1 = 3.
However, the observed values are 4 and 3 from (int8_t *)fmt->p.
The second record chr1:634553, GT=./.;0:1. Similarly, the projected values are 3 and 5. However, the observed values are 2 and 5.
This leads to my conjecture that (int8_t *)fmt->p [0] is not been considered when encoding somewhere.
Then, I checked for the unphased same bcf file, the encoded values are (4, 2) and (2,4), which is consistent with the above code snippets.
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ID1 ID2
chr1 54490 . G A . . GC=0.394015 GT:AD:Ldev:Bdev:AS 1/0:19,19:0:0:0 ./.:.:.:.:.
chr1 634553 . G A . . GC=0.511222 GT:AD:Ldev:Bdev:AS ./.:.:.:.:. 0/1:14,21:0.002:0.158:-1
Albeit, the current coding scheme doesn’t affect bcf_gt_allele at all, and has some effect when evaluating bcf_gt_is_phased(val).
I’m not sure if there is any special consideration for position 0. Otherwise, I would be more than happy to work with you to look into the details. I have attached the code I used to read.
Your code looks incorrect. See bcf_format_gt() to see how to read GT values (up to VCF4.3). Compared with your code, note that only one value is accessed for each GT. It is also not valid C to cast the pointer into the GT data to a wider type, as it may not be correctly aligned. That's why bcf_format_gt() internally uses uint8_t pointers and functions like le_to_i16 or le_to_i32 to access the data.
@daviesrob , please see this bcf_all_phased. I feel I used the same approach as what it does. However, for loop can still assign sample_phased = 1 as long as p[1] is odd even if p[0] is not odd.
I am working a project using <htslib/vcf.h>. In particular, I am trying to access gene type (GT) info. I've shared my findings below.
https://github.com/samtools/htslib/blob/c705bec2af9caedacc6ba80dfba7fe625eb5ba5c/htslib/vcf.h#L1019C1-L1028C1
Later, one can also decode its
val
back toidx
and phase. However, when I am parsing an example phased bcf file as attached, I found the followingThe first record
chr1:54490, GT=1|0; ./.
, according to the above encode algorithm, my projected values should be(((1) + 1) << 1 | 1) = 2 * 2 + 1 = 5,
(((0) + 1) << 1 | 1) = 1 * 2 + 1 = 3.
However, the observed values are
4
and3
from(int8_t *)fmt->p
.The second record
chr1:634553, GT=./.;0:1
. Similarly, the projected values are3
and5
. However, the observed values are2
and5
.This leads to my conjecture that
(int8_t *)fmt->p [0]
is not been considered when encoding somewhere.Then, I checked for the unphased same bcf file, the encoded values are
(4, 2)
and(2,4)
, which is consistent with the above code snippets.Albeit, the current coding scheme doesn’t affect
bcf_gt_allele
at all, and has some effect when evaluatingbcf_gt_is_phased(val)
.I’m not sure if there is any special consideration for position 0. Otherwise, I would be more than happy to work with you to look into the details. I have attached the code I used to read.
test_read_bcf.txt
The text was updated successfully, but these errors were encountered: