You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Calculation of linkage disequilibrium via phased data and comparing phased and unphased LD via a Sign test. Also for 2-locus haplotypes, calculation of LD statistics via phased data if unphased data is being currently used.
Describe the solution you'd like
The estimates of LD (D, D prime, Wn, and the 2-locus pairwise LD statistics D and D prime) to be calculated via phased data, together with the correlation coefficient (r^2).
Describe alternatives you've considered
Using Pould package in R
Additional context
No response
The text was updated successfully, but these errors were encountered:
Thanks again for the report, @nabil161289 . I think the long-term plan is to deprecate [Emhaplofreq] in favour of [Haplostats]. Haplostats is a wrapper around the R package haplo.stats. There is already an alpha version of Haplostats (undocumented) already contained within PyPop, but it is not yet ready for production release.
At a glance, the upstream R package does mention phased genotypes, but I'm not whether we're currently including functionality, and if it's easy or even possible to do that. @rsingle may know more about that.
I think that computing the LD measures directly from the phased input will have to wait until we finish off the Haplostats module (which includes the wrapper around haplo.stats). The idea is that haplo.stats C code will be used solely do the EM haplotype estimation, and the rest of the LD computations currently done inside the emhaplofreq program will be moved into Python.
Once that is done, it will be easier to have the LD computations work off phased input directly from the input data, or input out of haplo.stats, by using the same Python to do both. Haplostats is partly implemented, but it's not completely there, and this would be another enhancement on top of what we already planned. All of this is fairly non-trivial, so it may be a while.
alexlancaster
changed the title
Phased linkage disequilibrium
Allow linkage disequilibrium computations to be done from phased input data
Nov 4, 2023
In theory, since phased LD calculations rely only on counts of phased variants, which in PyPop TSV format would be identified as (for two alleles of a 3 locus haplotype for two subjects):
Row X (locus1_1*allele & locus2_1*allele & locus3_1*allele) and (locus1_2*allele & locus_2*allele & locus3_2*allele)
Row Y (locus1_1*allele & locus2_1*allele & locus3_1*allele) and (locus1_2*allele & locus_2*allele & locus3_2*allele)
and using GL String format would already be formatted as (for two subjects):
> Subject X locus1_1*allele~locus2_1*allele~locus3_1*allele+locus1_2*allele~locus_2*allele~locus3_2*allele
> Subject Y locus1_1*allele~locus2_1*allele~locus3_1*allele+locus1_2*allele~locus_2*allele~locus3_2*allele
all that would be needed (using the Haplotype Specific Homozygosity approach to calculating ALD) would be to count the variants that have already been entered into PyPop, because no EM is needed to generate the counts/frequencies. So phased LD should be easier to calculate than unphased LD.
A clear description of your feature request
Calculation of linkage disequilibrium via phased data and comparing phased and unphased LD via a Sign test. Also for 2-locus haplotypes, calculation of LD statistics via phased data if unphased data is being currently used.
Describe the solution you'd like
The estimates of LD (D, D prime, Wn, and the 2-locus pairwise LD statistics D and D prime) to be calculated via phased data, together with the correlation coefficient (r^2).
Describe alternatives you've considered
Using Pould package in R
Additional context
No response
The text was updated successfully, but these errors were encountered: