Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phased GT encoding #1872

Open
yuw444 opened this issue Jan 8, 2025 · 2 comments
Open

Phased GT encoding #1872

yuw444 opened this issue Jan 8, 2025 · 2 comments

Comments

@yuw444
Copy link

yuw444 commented Jan 8, 2025

I am working a project using <htslib/vcf.h>. In particular, I am trying to access gene type (GT) info. I've shared my findings below.

  1. GT is an encoded array with allele and phase info.
  2. Idx is the index for alleles w.r.t the ref allele(0). For instance,

idx = 0(ref), 1, 2, 3(alt)

  1. when phased(|), the corresponding encoded values are

val = 3, 5, 7, 9;

  1. when unphased (), the corresponding encoded values are

val = 2, 4, 6, 8;

    // Macros for setting genotypes correctly, for use with bcf_update_genotypes only; idx corresponds
    // to VCF's GT (1-based index to ALT or 0 for the reference allele) and val is the opposite, obtained
    // from bcf_get_genotypes() below.
    #define bcf_gt_phased(idx)      (((idx)+1)<<1|1)
    #define bcf_gt_unphased(idx)    (((idx)+1)<<1)
    #define bcf_gt_missing          0
    #define bcf_gt_is_missing(val)  ((val)>>1 ? 0 : 1)
    #define bcf_gt_is_phased(idx)   ((idx)&1)   // ??? should it be  bcf_gt_is_phased(val)???
    #define bcf_gt_allele(val)      (((val)>>1)-1)

https://github.com/samtools/htslib/blob/c705bec2af9caedacc6ba80dfba7fe625eb5ba5c/htslib/vcf.h#L1019C1-L1028C1

Later, one can also decode its val back to idx and phase. However, when I am parsing an example phased bcf file as attached, I found the following

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ID1       ID2
chr1    54490   .       G       A       .       .       GC=0.394015     GT:AD:Ldev:Bdev:AS      1|0:19,19:0:0:0 ./.:.:.:.:.
chr1    634553  .       G       A       .       .       GC=0.511222     GT:AD:Ldev:Bdev:AS      ./.:.:.:.:.     0|1:14,21:0.002:0.158:-1
  1. The first record chr1:54490, GT=1|0; ./., according to the above encode algorithm, my projected values should be
    (((1) + 1) << 1 | 1) = 2 * 2 + 1 = 5,
    (((0) + 1) << 1 | 1) = 1 * 2 + 1 = 3.
    However, the observed values are 4 and 3 from (int8_t *)fmt->p.

  2. The second record chr1:634553, GT=./.;0:1. Similarly, the projected values are 3 and 5. However, the observed values are 2 and 5.

This leads to my conjecture that (int8_t *)fmt->p [0] is not been considered when encoding somewhere.

Then, I checked for the unphased same bcf file, the encoded values are (4, 2) and (2,4), which is consistent with the above code snippets.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ID1       ID2
chr1    54490   .       G       A       .       .       GC=0.394015     GT:AD:Ldev:Bdev:AS      1/0:19,19:0:0:0 ./.:.:.:.:.
chr1    634553  .       G       A       .       .       GC=0.511222     GT:AD:Ldev:Bdev:AS      ./.:.:.:.:.     0/1:14,21:0.002:0.158:-1

Albeit, the current coding scheme doesn’t affect bcf_gt_allele at all, and has some effect when evaluating bcf_gt_is_phased(val).

I’m not sure if there is any special consideration for position 0. Otherwise, I would be more than happy to work with you to look into the details. I have attached the code I used to read.

test_read_bcf.txt

@daviesrob
Copy link
Member

Your code looks incorrect. See bcf_format_gt() to see how to read GT values (up to VCF4.3). Compared with your code, note that only one value is accessed for each GT. It is also not valid C to cast the pointer into the GT data to a wider type, as it may not be correctly aligned. That's why bcf_format_gt() internally uses uint8_t pointers and functions like le_to_i16 or le_to_i32 to access the data.

@yuw444
Copy link
Author

yuw444 commented Jan 13, 2025

@daviesrob , please see this bcf_all_phased. I feel I used the same approach as what it does. However, for loop can still assign sample_phased = 1 as long as p[1] is odd even if p[0] is not odd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants