-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor kinship to allow other code to initiate import directly #730
Conversation
…re calculated separately
- Make R scripts exit immediately on error - Bugfix to expected kinship/validation
…y computed kinship data
- Refactor kinship script to optionally merge species where hybrids are present and process together
@labkey-martyp: this is a simplification of the earlier PR. There's a JHU postgres failure, but it does not seem related to the genetics calculations. |
Thanks @bbimber let me test this out on data from the centers, otherwise it looks good. |
Thanks @labkey-martyp. Like I posted on the support thread, please keep in mind that you can just run populateKinship standalone, where you replace the TSV loading with RLabkey/SelectRows. That should make it quick to point to any accessible center's server (even a production server) to non-invasive testing on real data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I tested this out across centers. Looks good with just a couple requests. Check them out, let me know if you have any follow up on them.
ehr/src/org/labkey/ehr/pipeline/GeneticCalculationsImportTask.java
Outdated
Show resolved
Hide resolved
Thanks @labkey-martyp. The last commit addresses those 2 comments. |
f5ed491
to
7f2c987
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks Ben.
This PR contains two main categories of changes. The motivation of these changes was to separate the process of computing the data for EHR genetics (e.g. kinship and inbreeding) from actually importing the data into the EHR. In total, the idea is that PRIMe-seq (which already has a mirrored copy of pedigree data), will execute the default EHR GeneticsCalculation pipeline. It will farm the computation to our cluster, which that server is already configured to do. When complete, this pipeline already saves the results as TSV files. I wrote a separate ETL that will be defined in ONPRC modules, which copies the resulting TSVs to a location visible to PRIMe, and then pings PRIMe via a new server-side action to cause PRIMe to take those TSVs and call EHRService.standaloneProcessKinshipAndInbreeding to actually import them.
The changes within EHR itself are primarily to refactor the portions of the code that import the TSVs be to static, allow it to be called separately, and expose this through EHRService. These changes are basically a refactor without touching much within the code itself.
When I got into the weeds of the R code, I noticed a number of other things that seemed worth cleaning up. These are not directly related to the importing of data, but should be broadly useful:
I did some general style improvements in the R code, and removed some ancient strange patterns. This resulted in more code being touched than strictly needed, but most changes are minor and make the script a little more standard.
I think I understand the intent behind "options(error = dump.frames)", but I dont think it was giving the result it should. In my hands, not only was this not writing anything to a file, the script did not die on errors. I'd argue it's a lot more useful if it dies when there's an error (such as not having kinship2 installed), rather than continue onward after errors and accumulate strange downstream errors. I tested this pretty thoroughly locally and I think R's default error handling (i.e. dont specify anything with options()) does as good a job as anything here. Here is an example (which requires a login), showing how the existing 23.7 code ignores R errors: https://prime-seq.ohsu.edu/pipeline-status/Internal/PMR/details.view?rowId=433466.
This was originally opened for 23.7 under #650. This PR backs out some of the changes originally part of that PR. Notably, this is not attempting to address the WNPRC issue around hybrid animals, nor does it try to perform sanity checking of minimum expected kinship coefficients in R.