Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor genetics calculations to allow import to be triggered separately #650

Closed
wants to merge 9 commits into from

Conversation

bbimber
Copy link
Collaborator

@bbimber bbimber commented Sep 4, 2023

This PR contains two main categories of changes. The motivation of these changes was to separate the process of computing the data for EHR genetics (e.g. kinship and inbreeding) from actually importing the data into the EHR. In total, the idea is that PRIMe-seq (which already has a mirrored copy of pedigree data), will execute the default EHR GeneticsCalculation pipeline. It will farm the computation to our cluster, which that server is already configured to do. When complete, this pipeline already saves the results as TSV files. I wrote a separate ETL that will be defined in ONPRC modules, which copies the resulting TSVs to a location visible to PRIMe, and then pings PRIMe via a new server-side action to cause PRIMe to take those TSVs and call EHRService.standaloneProcessKinshipAndInbreeding to actually import them.

The changes within EHR itself are primarily to refactor the portions of the code that import the TSVs be to static, allow it to be called separately, and expose this through EHRService. These changes themselves are basically a refactor without touching much within the code itself.

When I got into the weeds of the R code, I noticed a number of other things that seemed worth cleaning up. These are not directly related to the importing of data, but should be broadly useful:

  • I did some general style improvements in the R code, and removed some ancient strange patterns. This resulted in more code being touched, but most changes are minor.
  • Since we now depend on dplyr anyway, I used it more, which should reduce the memory footprint b/c we can stream dataframes.
  • I think I understand the intent behind "options(error = dump.frames)", but I dont think it was giving the result it should. In my hands, not only was this not writing anything to a file, the script did not die on errors. I'd argue it's a lot more useful if it dies when there's an error (such as not having kinship2 installed), rather than continue onward after errors and accumulate strange downstream errors. I tested this pretty thoroughly locally and I think R's default error handling (i.e. dont specify anything with options()) does as good a job as anything here. Here is an example (which requires a login), showing how the existing 23.7 code ignores R errors: https://prime-seq.ohsu.edu/pipeline-status/Internal/PMR/details.view?rowId=433466.
  • Marty's addition of the java-side validation step seems useful; however, I realized we already have all the information we need to replicate that in-memory within R. Using the input pedigree data, we can generate a table of expected coefficients and just merge that with the kinship data. The advantage is that this is really fast, and we can easily just do that every time. As-is, the pipeline job just logs the result; however, it would not be hard to respect the existing 'kinshipValidation' flag.

- Make R scripts exit immediately on error
- Bugfix to expected kinship/validation
@bbimber
Copy link
Collaborator Author

bbimber commented Sep 5, 2023

@dnicolalde: as part of touching the kinship scripts, I'd like to address known issues. One limitation is that there are instances where a given animal is a hybrid of 2 species. As it stands, the kinship script runs species-by-species, which is clearly a problem there. Let's say there is a Rhesus macaque with a sire who was a pigtail. I dont think just adding the one missing sire to the rhesus dataframe is adequate, since in theory there are more ancestors that could effect this (granted, that's probably rather rare). One possible solution might be to update the R code to identify animals with a cross-species parent, and then merge these species together and run kinship on the two species together. If IDs were unique between species, the only downside is memory. When these scripts were written in ~2010 memory was more limiting than it probably is today. I think ONPRC has a couple legacy hybrid animals and I will give this a look with those.

- Refactor kinship script to optionally merge species where hybrids are present and process together
@bbimber
Copy link
Collaborator Author

bbimber commented Sep 5, 2023

A couple changes of note just added:

  1. Changed GeneticCalculationsRTask such the 'validateKinship' property it written to the XML, rather than read at runtime. That's essential if this task will be run on a remote server when there is no access to the DB.
  2. Refactored the kinship script to further reduce memory
  3. Linking to 22.7 fb kinship comparison #503: I added an opt-in flag for mergeSpeciesWithHybrids. Like I said above, the only full accurate way I can currently think of to handle individuals with hybrid parents is to process both species together as a unit. If this flag is set, rather than process species-by-species, the R code will look for hybrids, and if animals are found it will join those species and process together through kinship().

I've been testing this using the full ONPRC pedigree. It's fairly easy to run against another pedigree if any center is will to share the results of pedigree.sql as a TSV.

@dnicolalde
Copy link
Collaborator

I looked at your changes yesterday. I can ran against our pedigree data and shared the results. I would like to verify that the data generated is not changing that much.

@bbimber
Copy link
Collaborator Author

bbimber commented Sep 6, 2023

I looked at your changes yesterday. I can ran against our pedigree data and shared the results. I would like to verify that the data generated is not changing that much.

OK. Other than hybrids I would expect zero changes in coefficients at all, so I'd be interested to know how that looks.

@spamhurts
Copy link
Contributor

Hi Ben, I compared the results from the new populateKinship.r script with the results on our production server and found there were no differences in the number of pairs or the coefficients generated. The only change I saw was the addition of the species column in the results.

@bbimber
Copy link
Collaborator Author

bbimber commented Sep 20, 2023

@labkey-martyp and @labkey-jeckels: any comments on this?

@labkey-martyp
Copy link
Contributor

@labkey-martyp and @labkey-jeckels: any comments on this?

@bbimber I'm still testing this out. I tried this on one full set of client data and this PR was missing some of the kinship relationships. I have not had time to investigate but I would guess there are some assumptions made about the data, possibly null or unknown values, that is affecting the results. I can dig more and get more info by the end of the week. But it does look like the validation in the R script is using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up.

If you're blocked on this for setting up your cluster calculations it might make sense to split this into two PRs. One with just the import functionality and the other with the R script refactors.

@bbimber
Copy link
Collaborator Author

bbimber commented Sep 20, 2023

@labkey-martyp and @labkey-jeckels: any comments on this?

@bbimber I'm still testing this out. I tried this on one full set of client data and this PR was missing some of the kinship relationships. I have not had time to investigate but I would guess there are some assumptions made about the data, possibly null or unknown values, that is affecting the results. I can dig more and get more info by the end of the week. But it does look like the validation in the R script is using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up.

If you're blocked on this for setting up your cluster calculations it might make sense to split this into two PRs. One with just the import functionality and the other with the R script refactors.

I dont suppose you're able to share a pedigree file? Is it TNPRC or JHU?

The one difference between the R based and java-based is that the R-based check relies on the data that makes it into pedigree.sql. I noticed your java code pulled from demographicsFamily/demographicsOffspring/demographicsSiblings. Those are independent SQL scripts and therefore could easily be applying subtly different filters. This could either be the combination one or the other uses for including or not including supplemental pedigree, or whatever sources of data might exist.

Simply looking at rows counts and/or SQL joins on the results of pedigree.sql and those 3 might be informative.

@bbimber
Copy link
Collaborator Author

bbimber commented Sep 20, 2023

@labkey-martyp: to follow up on this specific sentence you wrote:

But it does look like the validation in the R script is <not?> using the same assumptions as it did not catch those missing relationships, where the current validation, using a different query, did pick them up.

If you are intending to say that the kinship pedigree/script does not catch a certain relationship, but the java validation process does catch it, and this is explicitly because they use different SQL queries as input, then I think I would ask the question: "what exactly are we trying to catch here"? If the actual difference is due to different SQL queries returning different results, then wouldnt it be a heck of a lot more direct to do a SQL test of pedigree.sql against those queries queries and ask whether discrepancies should exist? If there are, should the respective SQL get updated? Said another way, I think it's important to separate the questions of "are we giving R the identical/complete input data", and "does the kinship package give the expected results".

@dnicolalde
Copy link
Collaborator

@bbimber I was able to set this code in a test instance of our EHR and it looks like there might be a problem. The file generated in our production server is around 786MB and the one that this new code generated is 417MB. It is adding the data into the corresponding dataset and I will be able to see how many rows I am missing. Any idea why it would create a file half the size? Both started with the same Pedigree table which contains 20,316 rows.

@dnicolalde
Copy link
Collaborator

Follow up, it finished importing and it only generate 12,924,820, in comparison to the current production server that generates 28,219,394. I can work with Marty to see where is it having problems with the calculation.

@bbimber
Copy link
Collaborator Author

bbimber commented Sep 26, 2023

Follow up, it finished importing and it only generate 12,924,820, in comparison to the current production server that generates 28,219,394. I can work with Marty to see where is it having problems with the calculation.

Thanks for running this. I have some fairly straightforward checks where we compare the data at each step which would isolate what is or isnt changed. Running with or without the flag to handle hybrids (which is the primary intentional difference) would also be a fast way to determine whether this is the source of a difference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants