Add tool compatibility check to datasets #5280

ryan-preble · 2025-01-22T16:49:40Z

Description

Adds functions to predict the tool compatibility of a dataset based on dataset definition and the help page of the tool. Tools are reported as compatible if they have sufficient trials, phenotype measurements, genotypes, and locations to meet all the requirements of a tool. Tool compatibility is calculated on dataset creation and is displayed on the dataset details page. The page reports the tools which can and cannot be used with this dataset, traits that can be analyzed, and warns the user if sample sizes are low. Compatibility can be manually recalculated if genotypes or phenotype measurements are added to a trial after dataset creation. Tool compatibility is stored as a JSON in dataset definition. A future pull request will make it so that tools check dataset compatibility and only display compatible datasets in the data selection screen.

#5174

Checklist

…fferent genotyping protocols and split results by traits. Store as JSON in database

…dataset details page. Added button to calculate compatibility if not already stored.

…clude warning tooltips

… calculation to not use genotype search

…s call

…nctions. Datasets now have tool compatibility auto calculated on creation. Remove debug comments.

isaak · 2025-01-29T10:18:12Z

This is an excellent informative feature.

Suggestions for the next iteration:

Remove check for GEBV clustering compatibility. GEBV is a result from solGS prediction and may not be stored in the database.
When checking for genotype data, if no genotyping protocol exists for the dataset, check based on default genotyping protocol.
Check PCA compatibility for phenotype data, as well. Check for presence of any quantitative data, for that matter. It could be NIRS, expression data, etc.
For PCA and Clustering checks, it would be helpful to show the number of phenotyped observations and genotyped markers in the warning message. And let the user judge. What is appropriate number of observations and variables (markers, traits) is tricky to decide.
Show the count of whatever criteria (observation units, markers, etc) a warning is made about.
Add to the tool_compatibility JSON the following:
- number_of_phenotyped_accessions (for one trait is enough),
- number_of_phenotyped_traits,
- number_of_genotyped_accessions,
- number_of_markers,
- number_of_whatever_criteria the tool is assessing for compatibility.
Consider 'check for compatibility' instead of 'calculate for compatibility'
Add compatibility check for correlation based on phenotype data.

ryan-preble · 2025-01-29T15:07:18Z

This is an excellent informative feature.

Suggestions for the next iteration:

Remove check for GEBV clustering compatibility. GEBV is a result from solGS prediction and may not be stored in the database.

When checking for genotype data, if no genotyping protocol exists for the dataset, check based on default genotyping protocol.

Check PCA compatibility for phenotype data, as well. Check for presence of any quantitative data, for that matter. It could be NIRS, expression data, etc.

For PCA and Clustering checks, it would be helpful to show the number of phenotyped observations and genotyped markers in the warning message. And let the user judge. What is appropriate number of observations and variables (markers, traits) is tricky to decide.

Show the count of whatever criteria (observation units, markers, etc) a warning is made about.

Add to the tool_compatibility JSON the following:

number_of_phenotyped_accessions (for one trait is enough),

number_of_phenotyped_traits,

number_of_genotyped_accessions,

number_of_markers,

number_of_whatever_criteria the tool is assessing for compatibility.

Consider 'check for compatibility' instead of 'calculate for compatibility'

Add compatibility check for correlation based on phenotype data.

Ok, I will get to implementing these and adding them to the pull request. I do have a few questions for clarification though:

What is a "default" genotyping protocol, and how do I determine what the default is?
Should there be a limit to the type of phenotyping data that can be used in PCA? For example, should it only consider quantitative pheno data?

…show summary box

isaak · 2025-01-30T10:40:05Z

For the default genotyping protocol, you can access the env variable: default_genotyping_protocol (sgn_local.conf).
my $default_genotyping_protocol = $c->config->{default_genotyping_protocol};

lukasmueller

more POD could be added for the different new functions. Not sure why the seedlot tests don't pass...
:-)

ryan-preble · 2025-02-03T14:32:36Z

more POD could be added for the different new functions. Not sure why the seedlot tests don't pass... :-)

???? Yeah I am not sure. All passing on local machine...

ryan-preble added 11 commits December 9, 2024 12:32

First draft of function to determine tool compatibility

fe77c6f

Polish tool compatibility function; have it correctly work through di…

7ae1870

…fferent genotyping protocols and split results by traits. Store as JSON in database

Tweaks to tool compatibility calculation and adding compatibility to …

bedd617

…dataset details page. Added button to calculate compatibility if not already stored.

Tweaks to tool compatibility table display

5a51b1a

Added warnings to tool compatibility JSON. Adjust table display to in…

b2c9af9

…clude warning tooltips

Tweaks to user instructions and starting to change tool compatibility…

458e218

… calculation to not use genotype search

Change function to use DB genotype query instead of retrieve_genotype…

73f771d

…s call

Change tool compatibility to have separate storing and calculating fu…

11d2e7c

…nctions. Datasets now have tool compatibility auto calculated on creation. Remove debug comments.

Update to match master branch

c92ea02

Added short selenium test to verify tool compatibility on details page

91327a7

Appeasing the linter

cf6718a

lukasmueller requested review from lukasmueller and isaak January 27, 2025 16:14

isaak approved these changes Jan 29, 2025

View reviewed changes

isaak self-assigned this Jan 29, 2025

ryan-preble added 4 commits January 29, 2025 11:30

Add sample sizes and marker numbers to warning messages

52fb716

Add correlation tool

cd453fa

Add data summary to tool compatibility JSON & change details page to …

64bb519

…show summary box

Adjusting div size

eaacf17

ryan-preble added 4 commits January 31, 2025 11:29

Improving stability and heritability criteria

f65af9b

Adding check to make sure traits are quantitative only

9b3df35

Add pheno and geno types to PCA tool compatibility check

b7e2576

Adding default genotyping protocol enforcement

dfb764d

lukasmueller approved these changes Feb 3, 2025

View reviewed changes

More POD

a3697a0

afpowell merged commit a35ca4f into master Feb 3, 2025
4 checks passed

afpowell deleted the 5714_phenotype_genotype_data_check branch February 3, 2025 16:20

ryan-preble mentioned this pull request Feb 4, 2025

Enable dataset filtering for different analyses #5009

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tool compatibility check to datasets #5280

Add tool compatibility check to datasets #5280

ryan-preble commented Jan 22, 2025 •

edited

Loading

isaak commented Jan 29, 2025

ryan-preble commented Jan 29, 2025

isaak commented Jan 30, 2025

lukasmueller left a comment •

edited

Loading

ryan-preble commented Feb 3, 2025

Add tool compatibility check to datasets #5280

Add tool compatibility check to datasets #5280

Conversation

ryan-preble commented Jan 22, 2025 • edited Loading

Description

Checklist

isaak commented Jan 29, 2025

ryan-preble commented Jan 29, 2025

isaak commented Jan 30, 2025

lukasmueller left a comment • edited Loading

Choose a reason for hiding this comment

ryan-preble commented Feb 3, 2025

ryan-preble commented Jan 22, 2025 •

edited

Loading

lukasmueller left a comment •

edited

Loading