Replaced pooled global Aes with population average Aes #155

standage · 2024-09-24T19:16:36Z

The default A_e values currently in MicroHapDB are calculated from a pool of all observed haplotypes in the entire global 26-population data set. This PR changes the build process so that the default values are computed as the mean of the 26 population-specific A_e values, rather than from the entire pool.

standage · 2024-09-25T17:51:25Z

dbbuild/sources/byrskabishop2022/util.py

-    agg_tallies = defaultdict(Counter)
    for n, row in haplotypes.iterrows():
        for haplokey in ("Haplotype1", "Haplotype2"):
            mhallele = [row[haplokey]]
            if not pd.isna(mhallele):
-                # The following line could arguably be moved into the conditional block below to
-                # excluded admixed individuals from the aggregate haplotype tallies. But as of
-                # today, I think including them in the aggregate totals is appropriate.
-                # -- DSS, 2023-02-28.
-                agg_tallies[row["Marker"]].update(mhallele)
                pop_tallies[row["Marker"]][row["Population"]].update(mhallele)
                if row["Population"] not in admixed:
                    pop_tallies[row["Marker"]][row["Superpopulation"]].update(mhallele)
    for marker, popcounts in sorted(pop_tallies.items()):
-        total_count = sum(agg_tallies[marker].values())
-        for mhallele, agg_count in sorted(agg_tallies[marker].items()):
-            freq = agg_count / total_count
-            yield marker, "1KGP", mhallele, freq, total_count


No need to calculate pooled frequencies any more.

standage · 2024-09-25T17:52:19Z

dbbuild/sources/byrskabishop2022/util.py

+    superpops = ("AFR", "AMR", "EAS", "EUR", "SAS")
    for marker, marker_data in frequencies.groupby("Marker"):
+        population_aes = list()
        for population, pop_data in marker_data.groupby("Population"):
            ae = 1.0 / sum([f**2 for f in pop_data.Frequency])
            entry = (marker, population, ae)
            aes.append(entry)
+            if population not in superpops:
+                population_aes.append(ae)
+        avg_ae = sum(population_aes) / len(population_aes)
+        entry = (marker, "1KGP", avg_ae)
+        aes.append(entry)


Instead, we average population-level A_e values here.

Daniel Standage added 7 commits September 24, 2024 15:13

Global Aes are now population averages rather than pooled [skip ci]

3d2d271

Update change log

f9e055f

Merge branch "master"

b6e7e48

Fixing code [skip ci]

5684ec5

New 1KGP Ae scores

6fb60b5

Put new data into place [skip ci]

1707b77

Fix test suite

c9a1628

standage commented Sep 25, 2024

View reviewed changes

Daniel Standage added 2 commits September 25, 2024 13:54

Remove extraneous file

467677e

Troubleshoot CI

45cd4ba

standage merged commit 763a577 into master Sep 25, 2024
4 checks passed

standage deleted the fix/aes branch September 25, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced pooled global Aes with population average Aes #155

Replaced pooled global Aes with population average Aes #155

standage commented Sep 24, 2024

standage Sep 25, 2024

standage Sep 25, 2024

Replaced pooled global Aes with population average Aes #155

Replaced pooled global Aes with population average Aes #155

Conversation

standage commented Sep 24, 2024

standage Sep 25, 2024

Choose a reason for hiding this comment

standage Sep 25, 2024

Choose a reason for hiding this comment