Refactor genetics calculations to allow import to be triggered separately #650

bbimber · 2023-09-04T20:41:54Z

This PR contains two main categories of changes. The motivation of these changes was to separate the process of computing the data for EHR genetics (e.g. kinship and inbreeding) from actually importing the data into the EHR. In total, the idea is that PRIMe-seq (which already has a mirrored copy of pedigree data), will execute the default EHR GeneticsCalculation pipeline. It will farm the computation to our cluster, which that server is already configured to do. When complete, this pipeline already saves the results as TSV files. I wrote a separate ETL that will be defined in ONPRC modules, which copies the resulting TSVs to a location visible to PRIMe, and then pings PRIMe via a new server-side action to cause PRIMe to take those TSVs and call EHRService.standaloneProcessKinshipAndInbreeding to actually import them.

The changes within EHR itself are primarily to refactor the portions of the code that import the TSVs be to static, allow it to be called separately, and expose this through EHRService. These changes themselves are basically a refactor without touching much within the code itself.

When I got into the weeds of the R code, I noticed a number of other things that seemed worth cleaning up. These are not directly related to the importing of data, but should be broadly useful:

I did some general style improvements in the R code, and removed some ancient strange patterns. This resulted in more code being touched, but most changes are minor.
Since we now depend on dplyr anyway, I used it more, which should reduce the memory footprint b/c we can stream dataframes.
I think I understand the intent behind "options(error = dump.frames)", but I dont think it was giving the result it should. In my hands, not only was this not writing anything to a file, the script did not die on errors. I'd argue it's a lot more useful if it dies when there's an error (such as not having kinship2 installed), rather than continue onward after errors and accumulate strange downstream errors. I tested this pretty thoroughly locally and I think R's default error handling (i.e. dont specify anything with options()) does as good a job as anything here. Here is an example (which requires a login), showing how the existing 23.7 code ignores R errors: https://prime-seq.ohsu.edu/pipeline-status/Internal/PMR/details.view?rowId=433466.
Marty's addition of the java-side validation step seems useful; however, I realized we already have all the information we need to replicate that in-memory within R. Using the input pedigree data, we can generate a table of expected coefficients and just merge that with the kinship data. The advantage is that this is really fast, and we can easily just do that every time. As-is, the pipeline job just logs the result; however, it would not be hard to respect the existing 'kinshipValidation' flag.

…re calculated separately

- Make R scripts exit immediately on error - Bugfix to expected kinship/validation

…y computed kinship data

bbimber · 2023-09-05T17:15:55Z

@dnicolalde: as part of touching the kinship scripts, I'd like to address known issues. One limitation is that there are instances where a given animal is a hybrid of 2 species. As it stands, the kinship script runs species-by-species, which is clearly a problem there. Let's say there is a Rhesus macaque with a sire who was a pigtail. I dont think just adding the one missing sire to the rhesus dataframe is adequate, since in theory there are more ancestors that could effect this (granted, that's probably rather rare). One possible solution might be to update the R code to identify animals with a cross-species parent, and then merge these species together and run kinship on the two species together. If IDs were unique between species, the only downside is memory. When these scripts were written in ~2010 memory was more limiting than it probably is today. I think ONPRC has a couple legacy hybrid animals and I will give this a look with those.

- Refactor kinship script to optionally merge species where hybrids are present and process together

bbimber · 2023-09-05T21:53:18Z

A couple changes of note just added:

Changed GeneticCalculationsRTask such the 'validateKinship' property it written to the XML, rather than read at runtime. That's essential if this task will be run on a remote server when there is no access to the DB.
Refactored the kinship script to further reduce memory
Linking to 22.7 fb kinship comparison #503: I added an opt-in flag for mergeSpeciesWithHybrids. Like I said above, the only full accurate way I can currently think of to handle individuals with hybrid parents is to process both species together as a unit. If this flag is set, rather than process species-by-species, the R code will look for hybrids, and if animals are found it will join those species and process together through kinship().

I've been testing this using the full ONPRC pedigree. It's fairly easy to run against another pedigree if any center is will to share the results of pedigree.sql as a TSV.

dnicolalde · 2023-09-06T13:55:35Z

I looked at your changes yesterday. I can ran against our pedigree data and shared the results. I would like to verify that the data generated is not changing that much.

bbimber · 2023-09-06T15:18:54Z

I looked at your changes yesterday. I can ran against our pedigree data and shared the results. I would like to verify that the data generated is not changing that much.

OK. Other than hybrids I would expect zero changes in coefficients at all, so I'd be interested to know how that looks.

spamhurts · 2023-09-11T16:43:52Z

Hi Ben, I compared the results from the new populateKinship.r script with the results on our production server and found there were no differences in the number of pairs or the coefficients generated. The only change I saw was the addition of the species column in the results.

…ess hours

bbimber · 2023-09-20T12:59:53Z

@labkey-martyp and @labkey-jeckels: any comments on this?

labkey-martyp · 2023-09-20T13:10:13Z

@labkey-martyp and @labkey-jeckels: any comments on this?

@bbimber I'm still testing this out. I tried this on one full set of client data and this PR was missing some of the kinship relationships. I have not had time to investigate but I would guess there are some assumptions made about the data, possibly null or unknown values, that is affecting the results. I can dig more and get more info by the end of the week. But it does look like the validation in the R script is using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up.

If you're blocked on this for setting up your cluster calculations it might make sense to split this into two PRs. One with just the import functionality and the other with the R script refactors.

bbimber · 2023-09-20T13:18:01Z

@labkey-martyp and @labkey-jeckels: any comments on this?

@bbimber I'm still testing this out. I tried this on one full set of client data and this PR was missing some of the kinship relationships. I have not had time to investigate but I would guess there are some assumptions made about the data, possibly null or unknown values, that is affecting the results. I can dig more and get more info by the end of the week. But it does look like the validation in the R script is using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up.

If you're blocked on this for setting up your cluster calculations it might make sense to split this into two PRs. One with just the import functionality and the other with the R script refactors.

I dont suppose you're able to share a pedigree file? Is it TNPRC or JHU?

The one difference between the R based and java-based is that the R-based check relies on the data that makes it into pedigree.sql. I noticed your java code pulled from demographicsFamily/demographicsOffspring/demographicsSiblings. Those are independent SQL scripts and therefore could easily be applying subtly different filters. This could either be the combination one or the other uses for including or not including supplemental pedigree, or whatever sources of data might exist.

Simply looking at rows counts and/or SQL joins on the results of pedigree.sql and those 3 might be informative.

bbimber · 2023-09-20T15:19:46Z

@labkey-martyp: to follow up on this specific sentence you wrote:

But it does look like the validation in the R script is <not?> using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up.

If you are intending to say that the kinship pedigree/script does not catch a certain relationship, but the java validation process does catch it, and this is explicitly because they use different SQL queries as input, then I think I would ask the question: "what exactly are we trying to catch here"? If the actual difference is due to different SQL queries returning different results, then wouldnt it be a heck of a lot more direct to do a SQL test of pedigree.sql against those queries queries and ask whether discrepancies should exist? If there are, should the respective SQL get updated? Said another way, I think it's important to separate the questions of "are we giving R the identical/complete input data", and "does the kinship package give the expected results".

dnicolalde · 2023-09-26T18:58:10Z

@bbimber I was able to set this code in a test instance of our EHR and it looks like there might be a problem. The file generated in our production server is around 786MB and the one that this new code generated is 417MB. It is adding the data into the corresponding dataset and I will be able to see how many rows I am missing. Any idea why it would create a file half the size? Both started with the same Pedigree table which contains 20,316 rows.

dnicolalde · 2023-09-26T19:22:31Z

Follow up, it finished importing and it only generate 12,924,820, in comparison to the current production server that generates 28,219,394. I can work with Marty to see where is it having problems with the calculation.

bbimber · 2023-09-26T19:35:40Z

Follow up, it finished importing and it only generate 12,924,820, in comparison to the current production server that generates 28,219,394. I can work with Marty to see where is it having problems with the calculation.

Thanks for running this. I have some fairly straightforward checks where we compare the data at each step which would isolate what is or isnt changed. Running with or without the flag to handle hybrids (which is the primary intentional difference) would also be a fast way to determine whether this is the source of a difference.

bbimber added 4 commits September 4, 2023 08:54

Refactor genetics pipeline to allow standalone import of TSVs that we…

5b0eee7

…re calculated separately

Improve R scripts:

117fd05

- Make R scripts exit immediately on error - Bugfix to expected kinship/validation

Add action to allow external server to trigger re-import of externall…

dbb5c46

…y computed kinship data

Correct typos

d301f03

bbimber mentioned this pull request Sep 4, 2023

Add action to allow external server to trigger re-import of genetics data LabKey/onprcEHRModules#782

Closed

Minor cleanup

2c3dd15

bbimber requested review from labkey-jeckels and labkey-martyp September 5, 2023 17:00

bbimber mentioned this pull request Sep 5, 2023

22.7 fb kinship comparison #503

Open

- Refactor kinship script to further reduce memory

b99c3da

- Refactor kinship script to optionally merge species where hybrids are present and process together

Avoid another fringe kinship case

4af9709

Add setting to control whether kinship import is allowed during busin…

2cd09b1

…ess hours

VetAssignmentAndCacheFalsePositives

6dfbed1

bbimber mentioned this pull request Feb 28, 2024

Refactor kinship to allow other code to initiate import directly #730

Merged

bbimber closed this Mar 18, 2024

bbimber deleted the 23.7_fb_kinshiprefactor branch March 18, 2024 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor genetics calculations to allow import to be triggered separately #650

Refactor genetics calculations to allow import to be triggered separately #650

bbimber commented Sep 4, 2023 •

edited

Loading

bbimber commented Sep 5, 2023

bbimber commented Sep 5, 2023

dnicolalde commented Sep 6, 2023

bbimber commented Sep 6, 2023

spamhurts commented Sep 11, 2023

bbimber commented Sep 20, 2023

labkey-martyp commented Sep 20, 2023

bbimber commented Sep 20, 2023

bbimber commented Sep 20, 2023

dnicolalde commented Sep 26, 2023

dnicolalde commented Sep 26, 2023

bbimber commented Sep 26, 2023

Refactor genetics calculations to allow import to be triggered separately #650

Refactor genetics calculations to allow import to be triggered separately #650

Conversation

bbimber commented Sep 4, 2023 • edited Loading

bbimber commented Sep 5, 2023

bbimber commented Sep 5, 2023

dnicolalde commented Sep 6, 2023

bbimber commented Sep 6, 2023

spamhurts commented Sep 11, 2023

bbimber commented Sep 20, 2023

labkey-martyp commented Sep 20, 2023

bbimber commented Sep 20, 2023

bbimber commented Sep 20, 2023

dnicolalde commented Sep 26, 2023

dnicolalde commented Sep 26, 2023

bbimber commented Sep 26, 2023

bbimber commented Sep 4, 2023 •

edited

Loading