Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tool compatibility check to datasets #5280

Merged
merged 20 commits into from
Feb 3, 2025

Conversation

ryan-preble
Copy link
Contributor

@ryan-preble ryan-preble commented Jan 22, 2025

Description

Adds functions to predict the tool compatibility of a dataset based on dataset definition and the help page of the tool. Tools are reported as compatible if they have sufficient trials, phenotype measurements, genotypes, and locations to meet all the requirements of a tool. Tool compatibility is calculated on dataset creation and is displayed on the dataset details page. The page reports the tools which can and cannot be used with this dataset, traits that can be analyzed, and warns the user if sample sizes are low. Compatibility can be manually recalculated if genotypes or phenotype measurements are added to a trial after dataset creation. Tool compatibility is stored as a JSON in dataset definition. A future pull request will make it so that tools check dataset compatibility and only display compatible datasets in the data selection screen.

#5174

Checklist

  • Refactoring only
  • Documentation only
  • Fixture update only
  • Bug fix
    • The relevant issue has been closed.
    • Further work is required.
  • New feature
    • Relevant tests have been created and run.
    • Data was added to the fixture
      • Data was added via a patch in /t/data/fixture/patches/.
    • User-Facing Change
      • The user manual in /docs has been updated.
    • Any new Perl has been documented using perldoc.
    • Any new JavaScript has been documented using JSDoc.
    • Any new legacy JavaScript has been moved from /js to /js/source/legacy.

@isaak
Copy link
Member

isaak commented Jan 29, 2025

This is an excellent informative feature.

Suggestions for the next iteration:

  • Remove check for GEBV clustering compatibility. GEBV is a result from solGS prediction and may not be stored in the database.
  • When checking for genotype data, if no genotyping protocol exists for the dataset, check based on default genotyping protocol.
  • Check PCA compatibility for phenotype data, as well. Check for presence of any quantitative data, for that matter. It could be NIRS, expression data, etc.
  • For PCA and Clustering checks, it would be helpful to show the number of phenotyped observations and genotyped markers in the warning message. And let the user judge. What is appropriate number of observations and variables (markers, traits) is tricky to decide.
  • Show the count of whatever criteria (observation units, markers, etc) a warning is made about.
  • Add to the tool_compatibility JSON the following:
    • number_of_phenotyped_accessions (for one trait is enough),
    • number_of_phenotyped_traits,
    • number_of_genotyped_accessions,
    • number_of_markers,
    • number_of_whatever_criteria the tool is assessing for compatibility.
  • Consider 'check for compatibility' instead of 'calculate for compatibility'
  • Add compatibility check for correlation based on phenotype data.

@isaak isaak self-assigned this Jan 29, 2025
@ryan-preble
Copy link
Contributor Author

This is an excellent informative feature.

Suggestions for the next iteration:

  • Remove check for GEBV clustering compatibility. GEBV is a result from solGS prediction and may not be stored in the database.

  • When checking for genotype data, if no genotyping protocol exists for the dataset, check based on default genotyping protocol.

  • Check PCA compatibility for phenotype data, as well. Check for presence of any quantitative data, for that matter. It could be NIRS, expression data, etc.

  • For PCA and Clustering checks, it would be helpful to show the number of phenotyped observations and genotyped markers in the warning message. And let the user judge. What is appropriate number of observations and variables (markers, traits) is tricky to decide.

  • Show the count of whatever criteria (observation units, markers, etc) a warning is made about.

  • Add to the tool_compatibility JSON the following:

    • number_of_phenotyped_accessions (for one trait is enough),
    • number_of_phenotyped_traits,
    • number_of_genotyped_accessions,
    • number_of_markers,
    • number_of_whatever_criteria the tool is assessing for compatibility.
  • Consider 'check for compatibility' instead of 'calculate for compatibility'

  • Add compatibility check for correlation based on phenotype data.

Ok, I will get to implementing these and adding them to the pull request. I do have a few questions for clarification though:

  • What is a "default" genotyping protocol, and how do I determine what the default is?
  • Should there be a limit to the type of phenotyping data that can be used in PCA? For example, should it only consider quantitative pheno data?

@isaak
Copy link
Member

isaak commented Jan 30, 2025

For the default genotyping protocol, you can access the env variable: default_genotyping_protocol (sgn_local.conf).
my $default_genotyping_protocol = $c->config->{default_genotyping_protocol};

Copy link
Member

@lukasmueller lukasmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more POD could be added for the different new functions. Not sure why the seedlot tests don't pass...
:-)

@ryan-preble
Copy link
Contributor Author

more POD could be added for the different new functions. Not sure why the seedlot tests don't pass... :-)

image
???? Yeah I am not sure. All passing on local machine...

@afpowell afpowell merged commit a35ca4f into master Feb 3, 2025
4 checks passed
@afpowell afpowell deleted the 5714_phenotype_genotype_data_check branch February 3, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants