Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow linkage disequilibrium computations to be done from phased input data #163

Open
nabil161289 opened this issue Nov 2, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@nabil161289
Copy link

A clear description of your feature request

Calculation of linkage disequilibrium via phased data and comparing phased and unphased LD via a Sign test. Also for 2-locus haplotypes, calculation of LD statistics via phased data if unphased data is being currently used.

Describe the solution you'd like

The estimates of LD (D, D prime, Wn, and the 2-locus pairwise LD statistics D and D prime) to be calculated via phased data, together with the correlation coefficient (r^2).

Describe alternatives you've considered

Using Pould package in R

Additional context

No response

@nabil161289 nabil161289 added the enhancement New feature or request label Nov 2, 2023
@alexlancaster alexlancaster added this to the Nice-To-Have milestone Nov 2, 2023
@alexlancaster
Copy link
Owner

Thanks again for the report, @nabil161289 . I think the long-term plan is to deprecate [Emhaplofreq] in favour of [Haplostats]. Haplostats is a wrapper around the R package haplo.stats. There is already an alpha version of Haplostats (undocumented) already contained within PyPop, but it is not yet ready for production release.

At a glance, the upstream R package does mention phased genotypes, but I'm not whether we're currently including functionality, and if it's easy or even possible to do that. @rsingle may know more about that.

@alexlancaster
Copy link
Owner

I think that computing the LD measures directly from the phased input will have to wait until we finish off the Haplostats module (which includes the wrapper around haplo.stats). The idea is that haplo.stats C code will be used solely do the EM haplotype estimation, and the rest of the LD computations currently done inside the emhaplofreq program will be moved into Python.

Once that is done, it will be easier to have the LD computations work off phased input directly from the input data, or input out of haplo.stats, by using the same Python to do both. Haplostats is partly implemented, but it's not completely there, and this would be another enhancement on top of what we already planned. All of this is fairly non-trivial, so it may be a while.

@alexlancaster alexlancaster changed the title Phased linkage disequilibrium Allow linkage disequilibrium computations to be done from phased input data Nov 4, 2023
@sjmack
Copy link
Collaborator

sjmack commented Nov 13, 2023

In theory, since phased LD calculations rely only on counts of phased variants, which in PyPop TSV format would be identified as (for two alleles of a 3 locus haplotype for two subjects):

Row X (locus1_1*allele & locus2_1*allele & locus3_1*allele) and (locus1_2*allele & locus_2*allele & locus3_2*allele)
Row Y (locus1_1*allele & locus2_1*allele & locus3_1*allele) and (locus1_2*allele & locus_2*allele & locus3_2*allele)

and using GL String format would already be formatted as (for two subjects):

> Subject X locus1_1*allele~locus2_1*allele~locus3_1*allele+locus1_2*allele~locus_2*allele~locus3_2*allele
> Subject Y locus1_1*allele~locus2_1*allele~locus3_1*allele+locus1_2*allele~locus_2*allele~locus3_2*allele

all that would be needed (using the Haplotype Specific Homozygosity approach to calculating ALD) would be to count the variants that have already been entered into PyPop, because no EM is needed to generate the counts/frequencies. So phased LD should be easier to calculate than unphased LD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants