Make report shareable (#379)

* Update README.md be extra clear about WGS 👀 * remove individual level information from report
PGScatalog · Oct 3, 2024 · 4d31352 · 4d31352
1 parent f17b947
commit 4d31352
Show file tree

Hide file tree

Showing 2 changed files with 36 additions and 16 deletions.
diff --git a/README.md b/README.md
@@ -18,6 +18,10 @@ and/or user-defined PGS/PRS.
 
 ## Pipeline summary
 
+> [!IMPORTANT]  
+> * Whole genome sequencing (WGS) data [are not currently supported by the calculator](https://pgsc-calc.readthedocs.io/en/latest/explanation/match.html#are-your-target-genomes-imputed-are-they-wgs)
+> * It’s possible to [create compatible gVCFs from WGS data](https://github.com/PGScatalog/pgsc_calc/discussions/123#discussioncomment-6469422). We plan to improve support for WGS data in the near future.
+
 <p align="center">
   <img width="80%" src="https://github.com/PGScatalog/pgsc_calc/assets/11425618/f766b28c-0f75-4344-abf3-3463946e36cc">
 </p>

diff --git a/assets/report/report.qmd b/assets/report/report.qmd
@@ -36,6 +36,10 @@ library(DT)
 library(tibble)
 library(forcats)
 library(readr)
+
+# prevent plots with small sample sets
+MINIMUM_N_SAMPLES <- 50
+LOW_SAMPLE_SIZE <- TRUE
 ```
 
 ```{r setup_logs, echo=FALSE}
@@ -64,6 +68,14 @@ log_df$sampleset <- gsub("_", " ", log_df$sampleset)  # page breaking issues
 cat command.txt | fold -w 80 -s | awk -F ' ' 'NR==1 { print "$", $0} NR>1 { print "    " $0}' | sed 's/$/\\/' | sed '$ s/.$//' 
 ```
 
+```{asis, echo = grepl("-profile test", readLines("command.txt"))}
+:::{.callout-tip}
+* If you're using the test profile, this report and these results are not biologically meaningful 
+* The test profile is only used to check that all software is installed and working correctly 
+* If you're reading this message, then that means everything is OK and you're ready to use your own data!
+:::
+```
+
 ## Version
 
 ```{r, echo=FALSE}
@@ -386,18 +398,23 @@ pop_summary %>%
 scores <- readr::read_tsv(params$score_path) 
 n_scores <- length(unique(scores$PGS))
 n_samples <- length(unique(scores$IID))
-print(n_samples)
+if (n_samples < MINIMUM_N_SAMPLES) {
+  LOW_SAMPLE_SIZE <- TRUE
+} else {
+  LOW_SAMPLE_SIZE <- FALSE
+}
 ```
 
-```{asis, echo = any(table(scores$sampleset) < 50) && !params$run_ancestry}
+
+```{asis, echo = (LOW_SAMPLE_SIZE && !params$run_ancestry)}
 
 ::: {.callout-important title="Warning: small sampleset size (n < 50) detected"}
 * plink2 uses allele frequency data to [mean impute](https://www.cog-genomics.org/plink/2.0/score) the dosages of missing genotypes
 * Currently `pgsc_calc` disables mean-imputation in these small sample sets to make sure that the calculated PGS is as consistent with the genotype data as possible
 * With a small sample size, the resulting score sums may be inconsistent between samples
 * The average `([scorename]_AVG)` may be more applicable as it calculates an average weighting over all genotypes present
 
-In the future mean-imputation will be supported in small samplesets using ancestry-matched reference samplesets to ensure consistent calculation of score sums (e.g. 1000G Genomes).
+It's recommended to use `--run_ancestry` with small samplesets to ensure consistent calculation of score sums (e.g. 1000G Genomes).
 :::
 
 ```
@@ -419,24 +436,21 @@ In the future mean-imputation will be supported in small samplesets using ancest
 
 ### Score data 
 
-#### Score extract
+#### Density plot(s)
 
+```{asis, echo = !LOW_SAMPLE_SIZE}
 ::: {.callout-note}
-Below is a summary of the aggregated scores, which might be useful for debugging. See here for an explanation of [plink2](https://www.cog-genomics.org/plink/2.0/formats#sscore) column names
+The summary density plots show up to six scoring files
 :::
-
-```{r, echo = FALSE}
-scores %>%
-  tibble::as_tibble(.)
 ```
 
-#### Density plot(s)
-
-::: {.callout-note}
-The summary density plots show up to six scoring files
+```{asis, echo = LOW_SAMPLE_SIZE}
+::: {.callout-warning}
+Density plots are disabled for low sample sizes
 :::
+```
 
-```{r density_ancestry, echo=FALSE, message=FALSE, warning=FALSE, eval=params$run_ancestry}
+```{r density_ancestry, echo=FALSE, message=FALSE, warning=FALSE, eval=(!LOW_SAMPLE_SIZE & params$run_ancestry)}
 # Select which PGS to plot
 uscores <- unique(scores$PGS)
 uscores_plot <- uscores[1:min(length(uscores), 6)] # plot max 6 PGS
@@ -454,7 +468,7 @@ for(current_pgs in uscores_plot){
 }
 ```
 
-```{r, echo = FALSE, message=FALSE, warning=FALSE, eval=!params$run_ancestry}
+```{r, echo = FALSE, message=FALSE, warning=FALSE, eval=(!LOW_SAMPLE_SIZE & !params$run_ancestry)}
 scores %>%
   ungroup() %>%
   select(IID, sampleset, PGS, SUM) %>%
@@ -488,7 +502,9 @@ stringr::str_glue("{params$sampleset}/score/aggregated_scores.txt.gz")
 
 # Citation
 
-> Lambert, Wingfield, et al. (2024) The Polygenic Score Catalog: new functionality and tools to enable FAIR research. medRxiv. doi:[10.1101/2024.05.29.24307783](https://doi.org/10.1101/2024.05.29.24307783).
+> Samuel A. Lambert, Benjamin Wingfield, Joel T. Gibson, Laurent Gil, Santhi Ramachandran, Florent Yvon, Shirin Saverimuttu, Emily Tinsley, Elizabeth Lewis, Scott C. Ritchie, Jingqin Wu, Rodrigo Canovas, Aoife McMahon, Laura W. Harris, Helen Parkinson, Michael Inouye.
+Enhancing the Polygenic Score Catalog with tools for score calculation and ancestry normalization.
+Nature Genetics | doi: [10.1038/s41588-024-01937-x](https://doi.org/10.1038/s41588-024-01937-x)
 
 ::: {.callout-important}
 For scores from the PGS Catalog, please remember to cite the original publications from which they came (these are listed in the metadata table).