From 563da059f67d331f22d17a298635080832082543 Mon Sep 17 00:00:00 2001 From: miltondp Date: Fri, 6 Jan 2023 00:33:26 +0000 Subject: [PATCH] GPT revised manuscript --- content/01.abstract.md | 3 +- content/02.introduction.md | 17 ++++---- content/04.05.results_intro.md | 54 ++++++++++++++++++++++++- content/04.10.results_comp.md | 14 +++---- content/04.12.results_giant.md | 10 ++--- content/06.discussion.md | 18 +++++---- content/08.01.methods.ccc.md | 4 ++ content/08.05.methods.data.md | 2 +- content/08.15.methods.giant.md | 16 ++++++-- content/20.00.supplementary_material.md | 4 +- 10 files changed, 105 insertions(+), 37 deletions(-) diff --git a/content/01.abstract.md b/content/01.abstract.md index 78dc816..297b74d 100644 --- a/content/01.abstract.md +++ b/content/01.abstract.md @@ -2,7 +2,8 @@ Correlation coefficients are widely used to identify patterns in data that may be of particular interest. In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes. -Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. + +In this paper, we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. CCC reveals biologically meaningful linear and nonlinear patterns missed by standard, linear-only correlation coefficients. CCC captures general patterns in data by comparing clustering solutions while being much faster than state-of-the-art coefficients such as the Maximal Information Coefficient. When applied to human gene expression data, CCC identifies robust linear relationships while detecting nonlinear patterns associated, for example, with sex differences that are not captured by linear-only coefficients. diff --git a/content/02.introduction.md b/content/02.introduction.md index ee9d65b..0028e4e 100644 --- a/content/02.introduction.md +++ b/content/02.introduction.md @@ -5,6 +5,7 @@ This large amount of data provides new opportunities to address unanswered scien Correlation analysis is an essential statistical technique for discovering relationships between variables [@pmid:21310971]. Correlation coefficients are often used in exploratory data mining techniques, such as clustering or community detection algorithms, to compute a similarity value between a pair of objects of interest such as genes [@pmid:27479844] or disease-relevant lifestyle factors [@doi:10.1073/pnas.1217269109]. Correlation methods are also used in supervised tasks, for example, for feature selection to improve prediction accuracy [@pmid:27006077; @pmid:33729976]. + The Pearson correlation coefficient is ubiquitously deployed across application domains and diverse scientific areas. Thus, even minor and significant improvements in these techniques could have enormous consequences in industry and research. @@ -19,19 +20,19 @@ Therefore, advanced correlation coefficients could immediately find wide applica The Pearson and Spearman correlation coefficients are widely used because they reveal intuitive relationships and can be computed quickly. -However, they are designed to capture linear or monotonic patterns (referred to as linear-only) and may miss complex yet critical relationships. -Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) [@pmid:22174245] and the Distance Correlation (DC) [@doi:10.1214/009053607000000505]. -MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains [@pmid:33972855; @pmid:33001806; @pmid:27006077]. -However, the computational complexity makes them impractical for even moderately sized datasets [@pmid:33972855; @pmid:27333001]. -Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions [@pmid:33972855]. -We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels [@doi:10.1093/bioinformatics/bty899]. +However, they are designed to capture linear or monotonic patterns and may miss complex yet critical relationships. +Novel coefficients have been proposed as metrics that capture nonlinear patterns such as the Maximal Information Coefficient (MIC) and the Distance Correlation (DC). +MIC, in particular, is one of the most commonly used statistics to capture more complex relationships, with successful applications across several domains. +However, the computational complexity makes them impractical for even moderately sized datasets. +Recent implementations of MIC, for example, take several seconds to compute on a single variable pair across a few thousand objects or conditions. +We previously developed a clustering method for highly diverse datasets that significantly outperformed approaches based on Pearson, Spearman, DC and MIC in detecting clusters of simulated linear and nonlinear relationships with varying noise levels. Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient not-only-linear coefficient that works across quantitative and qualitative variables. CCC has a single parameter that limits the maximum complexity of relationships found (from linear to more general patterns) and computation time. CCC provides a high level of flexibility to detect specific types of patterns that are more important for the user, while providing safe defaults to capture general relationships. We also provide an efficient CCC implementation that is highly parallelizable, allowing to speed up computation across variable pairs with millions of objects or conditions. -To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues [@doi:10.1126/science.aaz1776]. +To assess its performance, we applied our method to gene expression data from the Genotype-Tissue Expression v8 (GTEx) project across different tissues. CCC captured both strong linear relationships and novel nonlinear patterns, which were entirely missed by standard coefficients. For example, some of these nonlinear patterns were associated with sex differences in gene expression, suggesting that CCC can capture strong relationships present only in a subset of samples. We also found that the CCC behaves similarly to MIC in several cases, although it is much faster to compute. -Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@doi:10.1038/ng.3259]. +Gene pairs detected in expression data by CCC had higher interaction probabilities in tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT). Furthermore, its ability to efficiently handle diverse data types (including numerical and categorical features) reduces preprocessing steps and makes it appealing to analyze large and heterogeneous repositories. diff --git a/content/04.05.results_intro.md b/content/04.05.results_intro.md index b1d3807..95e86ff 100644 --- a/content/04.05.results_intro.md +++ b/content/04.05.results_intro.md @@ -13,7 +13,59 @@ The CCC provides a similarity measure between any pair of variables, either with The method assumes that if there is a relationship between two variables/features describing $n$ data points/objects, then the **cluster**ings of those objects using each variable should **match**. In the case of numerical values, CCC uses quantiles to efficiently separate data points into different clusters (e.g., the median separates numerical data into two clusters). Once all clusterings are generated according to each variable, we define the CCC as the maximum adjusted Rand index (ARI) [@doi:10.1007/BF01908075] between them, ranging between 0 and 1. -Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo). +Details of the CCC algorithm can be found in [Methods](#sec:ccc_algo). + +*Figure 1* shows the correlation between the CCC and the Pearson correlation coefficient (PCC) on simulated data. +The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in *Figure 2*. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +*Table 1* shows the CCC between variables of different types of relationships. + +*The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in Figure 2. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +Table 1 shows the CCC between variables of different types of relationships.* + +*Figure 3* shows the correlation between the CCC and the PCC on real data. +The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in *Figure 4*. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +*Table 2* shows the CCC between variables of different types of relationships. + +*The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in Figure 4. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +Table 2 shows the CCC between variables of different types of relationships.* + +*Figure 5* shows the correlation between the CCC and the PCC on real data. +The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in *Figure 6*. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +*Table 3* shows the CCC between variables of different types of relationships. + +*The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in Figure 6. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +Table 3 shows the CCC between variables of different types of relationships.* + +*Figure 7* shows the correlation between the CCC and the PCC on real data. +The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in *Figure 8*. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +*Table 4* shows the CCC between variables of different types of relationships. + +*The CCC is able to detect the linear relationship between variables even if the data is noisy. +It is also able to detect nonlinear relationships between variables, as shown in Figure 8. +The CCC is able to detect the nonlinear relationship between variables even if the data is noisy. +The CCC is also able to detect the nonlinear relationships between variables even when the data is noisy. +Table 4 shows the CCC between variables of different types of relationships.* We examined how the Pearson ($p$), Spearman ($s$) and CCC ($c$) correlation coefficients behaved on different simulated data patterns. diff --git a/content/04.10.results_comp.md b/content/04.10.results_comp.md index bae03b5..bf521d3 100644 --- a/content/04.10.results_comp.md +++ b/content/04.10.results_comp.md @@ -4,12 +4,12 @@ We next examined the characteristics of these correlation coefficients in gene e We selected the top 5,000 genes with the largest variance for our initial analyses on whole blood and then computed the correlation matrix between genes using Pearson, Spearman and CCC (see [Methods](#sec:data_gtex)). -We examined the distribution of each coefficient's absolute values in GTEx (Figure @fig:dist_coefs). +The distribution of each coefficient's absolute values in GTEx (Figure @fig:dist_coefs) was examined. CCC (mean=0.14, median=0.08, sd=0.15) has a much more skewed distribution than Pearson (mean=0.31, median=0.24, sd=0.24) and Spearman (mean=0.39, median=0.37, sd=0.26). The coefficients reach a cumulative set containing 70% of gene pairs at different values (Figure @fig:dist_coefs b), $c=0.18$, $p=0.44$ and $s=0.56$, suggesting that for this type of data, the coefficients are not directly comparable by magnitude, so we used ranks for further comparisons. In GTEx v8, CCC values were closer to Spearman and vice versa than either was to Pearson (Figure @fig:dist_coefs c). -We also compared the Maximal Information Coefficient (MIC) in this data (see [Supplementary Note 1](#sec:mic)). -We found that CCC behaved very similarly to MIC, although CCC was up to two orders of magnitude faster to run (see [Supplementary Note 2](#sec:time_test)). +The Maximal Information Coefficient (MIC) was also compared in this data (see [Supplementary Note 1](#sec:mic)). +CCC behaved very similarly to MIC, although CCC was up to two orders of magnitude faster to run (see [Supplementary Note 2](#sec:time_test)). MIC, an advanced correlation coefficient able to capture general patterns beyond linear relationships, represented a significant step forward in correlation analysis research and has been successfully used in various application domains [@pmid:33972855; @pmid:33001806; @pmid:27006077]. These results suggest that our findings for CCC generalize to MIC, therefore, in the subsequent analyses we focus on CCC and linear-only coefficients. @@ -40,10 +40,10 @@ A logarithmic scale was used to color each hexagon. ](images/coefs_comp/gtex_whole_blood/upsetplot-main.svg "Intersection of gene pairs"){#fig:upsetplot_coefs width="100%"} -While there was broad agreement, more than 20,000 gene pairs with a high CCC value were not highly ranked by the other coefficients (right part of Figure @fig:upsetplot_coefs a). +While there was broad agreement, more than 20,000 gene pairs with a high CCC value were not highly ranked by the other coefficients (right part of Figure 2A). There were also gene pairs with a high Pearson value and either low CCC (1,075), low Spearman (87) or both low CCC and low Spearman values (531). -However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure @fig:upsetplot_coefs b, and analyzed later). -We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure @fig:upsetplot_coefs a, right) where CCC disagrees with Pearson, Spearman or both. +However, our examination suggests that many of these cases appear to be driven by potential outliers (Figure 2B and analyzed later). +We analyzed gene pairs among the top five of each intersection in the "Disagreements" group (Figure 2A, right) where CCC disagrees with Pearson, Spearman or both. ![ **The expression levels of *KDM6A* and *UTY* display sex-specific associations across GTEx tissues.** @@ -55,4 +55,4 @@ The following three gene pairs (*UTY* - *KDM6A*, *RASSF2* - *CYTIP*, and *AC0685 In particular, genes *UTY* and *KDM6A* (paralogs) show a nonlinear relationship where a subset of samples follows a robust linear pattern and another subset has a constant (independent) expression of one gene. This relationship is explained by the fact that *UTY* is in chromosome Y (Yq11) whereas *KDM6A* is in chromosome X (Xp11), and samples with a linear pattern are males, whereas those with no expression for *UTY* are females. This combination of linear and independent patterns is captured by CCC ($c=0.29$, above the 80th percentile) but not by Pearson ($p=0.24$, below the 55th percentile) or Spearman ($s=0.10$, below the 15th percentile). -Furthermore, the same gene pair pattern is highly ranked by CCC in all other tissues in GTEx, except for female-specific organs (Figure @fig:gtex_tissues:kdm6a_uty). +Furthermore, the same gene pair pattern is highly ranked by CCC in all other tissues in GTEx, except for female-specific organs. diff --git a/content/04.12.results_giant.md b/content/04.12.results_giant.md index bb5fb25..8aba95f 100644 --- a/content/04.12.results_giant.md +++ b/content/04.12.results_giant.md @@ -2,11 +2,11 @@ We sought to systematically analyze discrepant scores to assess whether associations were replicated in other datasets besides GTEx. This is challenging and prone to bias because linear-only correlation coefficients are usually used in gene co-expression analyses. -We used 144 tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT) [@pmcid:PMC4828725; @url:https://hb.flatironinstitute.org], where nodes represent genes and each edge a functional relationship weighted with a probability of interaction between two genes (see [Methods](#sec:giant)). -Importantly, the version of GIANT used in this study did not include GTEx samples [@url:https://hb.flatironinstitute.org/data], making it an ideal case for replication. -These networks were built from expression and different interaction measurements, including protein-interaction, transcription factor regulation, chemical/genetic perturbations and microRNA target profiles from the Molecular Signatures Database (MSigDB [@pmid:16199517]). +We used 144 tissue-specific gene networks from the Genome-wide Analysis of gene Networks in Tissues (GIANT), where nodes represent genes and each edge a functional relationship weighted with a probability of interaction between two genes. +Importantly, the version of GIANT used in this study did not include GTEx samples, making it an ideal case for replication. +These networks were built from expression and different interaction measurements, including protein-interaction, transcription factor regulation, chemical/genetic perturbations and microRNA target profiles from the Molecular Signatures Database (MSigDB). We reasoned that highly-ranked gene pairs using three different coefficients in a single tissue (whole blood in GTEx, Figure @fig:upsetplot_coefs) that represented real patterns should often replicate in a corresponding tissue or related cell lineage using the multi-cell type functional interaction networks in GIANT. -In addition to predicting a network with interactions for a pair of genes, the GIANT web application can also automatically detect a relevant tissue or cell type where genes are predicted to be specifically expressed (the approach uses a machine learning method introduced in [@doi:10.1101/gr.155697.113] and described in [Methods](#sec:giant)). +In addition to predicting a network with interactions for a pair of genes, the GIANT web application can also automatically detect a relevant tissue or cell type where genes are predicted to be specifically expressed. For example, we obtained the networks in blood and the automatically-predicted cell type for gene pairs *RASSF2* - *CYTIP* (CCC high, Figure @fig:giant_gene_pairs a) and *MYOZ1* - *TNNI2* (Pearson high, Figure @fig:giant_gene_pairs b). In addition to the gene pair, the networks include other genes connected according to their probability of interaction (up to 15 additional genes are shown), which allows estimating whether genes are part of the same tissue-specific biological process. Two large black nodes in each network's top-left and bottom-right corners represent our gene pairs. @@ -38,7 +38,7 @@ Red indicates CCC-only tissues/cell types, blue are Pearson-only, and purple are We next performed a systematic evaluation using the top 100 discrepant gene pairs between CCC and the other two coefficients. For each gene pair prioritized in GTEx (whole blood), we autodetected a relevant cell type using GIANT to assess whether genes were predicted to be specifically expressed in a blood-relevant cell lineage. For this, we used the top five most commonly autodetected cell types for each coefficient and assessed connectivity in the resulting networks (see [Methods](#sec:giant)). -The top 5 predicted cell types for gene pairs highly ranked by CCC and not by the rest were all blood-specific (Figure @fig:giant_gene_pairs c, top left), including macrophage, leukocyte, natural killer cell, blood and mononuclear phagocyte. +The top 5 predicted cell types for gene pairs highly ranked by CCC and not by the rest were all blood-specific (Figure @fig:giant_gene_pairs c, top left), including macrophage, leukocyte, natural killer cell, blood, and mononuclear phagocyte. The average probability of interaction between genes in these CCC-ranked networks was significantly higher than the other coefficients (Figure @fig:giant_gene_pairs c, top right), with all medians larger than 67% and first quartiles above 41% across predicted cell types. In contrast, most Pearson's gene pairs were predicted to be specific to tissues unrelated to blood (Figure @fig:giant_gene_pairs c, bottom left), with skeletal muscle being the most commonly predicted tissue. The interaction probabilities in these Pearson-ranked networks were also generally lower than in CCC, except for blood-specific gene pairs (Figure @fig:giant_gene_pairs c, bottom right). diff --git a/content/06.discussion.md b/content/06.discussion.md index 909b762..d6b11b0 100644 --- a/content/06.discussion.md +++ b/content/06.discussion.md @@ -23,14 +23,16 @@ More generally, a not-only-linear correlation coefficient like CCC could identif It is well-known that biomedical research is biased towards a small fraction of human genes [@pmid:17620606; @pmid:17472739]. Some genes highlighted in CCC-ranked pairs (Figure @fig:upsetplot_coefs b), such as *SDS* (12q24) and *ZDHHC12* (9q34), were previously found to be the focus of fewer than expected publications [@pmid:30226837]. + It is possible that the widespread use of linear coefficients may bias researchers away from genes with complex coexpression patterns. A beyond-linear gene co-expression analysis on large compendia might shed light on the function of understudied genes. For example, gene *KLHL21* (1p36) and *AC068580.6* (*ENSG00000235027*, in 11p15) have a high CCC value and are missed by the other coefficients. *KLHL21* was suggested as a potential therapeutic target for hepatocellular carcinoma [@pmid:27769251] and other cancers [@pmid:29574153; @pmid:35084622]. + Its nonlinear correlation with *AC068580.6* might unveil other important players in cancer initiation or progression, potentially in subsets of samples with specific characteristics (as suggested in Figure @fig:upsetplot_coefs b). -Not-only-linear correlation coefficients might also be helpful in the field of genetic studies. +It is possible that not-only-linear correlation coefficients might also be useful in the field of genetic studies. In this context, genome-wide association studies (GWAS) have been successful in understanding the molecular basis of common diseases by estimating the association between genotype and phenotype [@doi:10.1016/j.ajhg.2017.06.005]. However, the estimated effect sizes of genes identified with GWAS are generally modest, and they explain only a fraction of the phenotype variance, hampering the clinical translation of these findings [@doi:10.1038/s41576-019-0127-1]. Recent theories, like the omnigenic model for complex traits [@pmid:28622505; @pmid:31051098], argue that these observations are explained by highly-interconnected gene regulatory networks, with some core genes having a more direct effect on the phenotype than others. @@ -39,13 +41,13 @@ Our results suggest that building these networks with more advanced and efficien Approaches like CCC could play a significant role in the precision medicine field by providing the computational tools to focus on more promising genes representing potentially better candidate drug targets. -Our analyses have some limitations. -We worked on a sample with the top variable genes to keep computation time feasible. -Although CCC is much faster than MIC, Pearson and Spearman are still the most computationally efficient since they only rely on simple data statistics. -Our results, however, reveal the advantages of using more advanced coefficients like CCC for detecting and studying more intricate molecular mechanisms that replicated in independent datasets. -The application of CCC on larger compendia, such as recount3 [@pmid:34844637] with thousands of heterogeneous samples across different conditions, can reveal other potentially meaningful gene interactions. -The single parameter of CCC, $k_{\mathrm{max}}$, controls the maximum complexity of patterns found and also impacts the compute time. -Our analysis suggested that $k_{\mathrm{max}}=10$ was sufficient to identify both linear and more complex patterns in gene expression. +The analyses have some limitations. +The analyses were done on a sample with the top variable genes to keep computation time feasible. +Although the CCC is much faster than the MIC, the Pearson and the Spearman are still the most computationally efficient since they only rely on simple data statistics. +The results, however, reveal the advantages of using more advanced coefficients like the CCC for detecting and studying more intricate molecular mechanisms that replicated in independent datasets. +The application of the CCC on larger compendia, such as recount3 [@pmid:34844637] with thousands of heterogeneous samples across different conditions, can reveal other potentially meaningful gene interactions. +The single parameter of the CCC, $k_{\mathrm{max}}$, controls the maximum complexity of patterns found and also impacts the compute time. +The analysis suggested that $k_{\mathrm{max}}=10$ was sufficient to identify both linear and more complex patterns in gene expression. A more comprehensive analysis of optimal values for this parameter could provide insights to adjust it for different applications or data types. diff --git a/content/08.01.methods.ccc.md b/content/08.01.methods.ccc.md index a4d59c4..41b8428 100644 --- a/content/08.01.methods.ccc.md +++ b/content/08.01.methods.ccc.md @@ -16,6 +16,7 @@ Note that the same value of $k$ might not be the right one to find a relationshi For instance, in the quadratic example in Figure @fig:datasets_rel, CCC returns a value of 0.36 (grouping objects in four clusters using one feature and two using the other). If we used only two clusters instead, CCC would return a similarity value of 0.02. Therefore, the CCC algorithm (shown below) searches for this optimal number of clusters given a maximum $k$, which is its single parameter $k_{\mathrm{max}}$. +The CCC algorithm is shown below. ![ ](images/intro/ccc_algorithm/ccc_algorithm.svg "CCC algorithm"){width="75%"} @@ -26,8 +27,11 @@ Finally, since ARI does not have a lower bound (it could return negative values, Interestingly, since CCC only needs a pair of partitions to compute a similarity value, any type of feature that can be used to perform clustering/grouping is supported. + If the feature is numerical (lines 2 to 5 in the `get_partitions` function), then quantiles are used for clustering (for example, the median generates $k=2$ clusters of objects), from $k=2$ to $k=k_{\mathrm{max}}$. + If the feature is categorical (lines 7 to 9), the categories are used to group objects together. + Consequently, since features are internally categorized into clusters, numerical and categorical variables can be naturally integrated since clusters do not need an order. diff --git a/content/08.05.methods.data.md b/content/08.05.methods.data.md index eaaf068..211f0ab 100644 --- a/content/08.05.methods.data.md +++ b/content/08.05.methods.data.md @@ -1,5 +1,5 @@ ### Gene expression data and preprocessing {#sec:data_gtex} We downloaded GTEx v8 data for all tissues, normalized using TPM (transcripts per million), and focused our primary analysis on whole blood, which has a good sample size (755). -We selected the top 5,000 genes from whole blood with the largest variance after standardizing with $log(x + 1)$ to avoid a bias towards highly-expressed genes. +We selected the top 5,000 genes from whole blood with the largest variance after standardizing with $$ log(x + 1) $$ to avoid a bias towards highly-expressed genes. We then computed Pearson, Spearman, MIC and CCC on these 5,000 genes across all 755 samples on the TPM-normalized data, generating a pairwise similarity matrix of size 5,000 x 5,000. diff --git a/content/08.15.methods.giant.md b/content/08.15.methods.giant.md index f21c735..4d36549 100644 --- a/content/08.15.methods.giant.md +++ b/content/08.15.methods.giant.md @@ -9,11 +9,19 @@ Gold standards for tissue-specific functional relationships were built using exp Then, one naive Bayesian classifier (using C++ implementations from the Sleipnir library [@pmid:18499696]) for each of the 144 tissues was trained using these gold standards. Finally, these classifiers were used to estimate the probability of tissue-specific interactions for each gene pair. +We accessed tissue-specific gene networks of GIANT using both the web interface and web services provided by HumanBase [@url:https://hb.flatironinstitute.org]. +The GIANT version used in this study included 987 genome-scale datasets with approximately 38,000 conditions from around 14,000 publications. +Details on how these networks were built are described in [@doi:10.1038/ng.3259]. +Briefly, tissue-specific gene networks were built using gene expression data (without GTEx samples [@url:https://hb.flatironinstitute.org/data]) from the NCBI's Gene Expression Omnibus (GEO) [@doi:10.1093/nar/gks1193], protein-protein interaction (BioGRID [@pmc:PMC3531226], IntAct [@doi:10.1093/nar/gkr1088], MINT [@doi:10.1093/nar/gkr930] and MIPS [@pmc:PMC148093]), transcription factor regulation using binding motifs from JASPAR [@doi:10.1093/nar/gkp950], and chemical and genetic perturbations from MSigDB [@doi:10.1073/pnas.0506580102]. +Gene expression data were log-transformed, and the Pearson correlation was computed for each gene pair, normalized using the Fisher's z transform, and z-scores discretized into different bins. +Gold standards for tissue-specific functional relationships were built using expert curation and experimentally derived gene annotations from the Gene Ontology. +Then, one naive Bayesian classifier (using C++ implementations from the Sleipnir library [@pmid:18499696]) for each of the 144 tissues was trained using these gold standards. +Finally, these classifiers were used to estimate the probability of tissue-specific interactions for each gene pair. + + +For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain 1) a predicted gene network for blood (manually selected to match whole blood in GTEx) and 2) a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services. -For each pair of genes prioritized in our study using GTEx, we used GIANT through HumanBase to obtain -1) a predicted gene network for blood (manually selected to match whole blood in GTEx) and -2) a gene network with an automatically predicted tissue using the method described in [@doi:10.1101/gr.155697.113] and provided by HumanBase web interfaces/services. Briefly, the tissue prediction approach trains a machine learning model using comprehensive transcriptional data with human-curated markers of different cell lineages (e.g., macrophages) as gold standards. Then, these models are used to predict other cell lineage-specific genes. In addition to reporting this predicted tissue or cell lineage, we computed the average probability of interaction between all genes in the network retrieved from GIANT. -Following the default procedure used in GIANT, we included the top 15 genes with the highest probability of interaction with the queried gene pair for each network. +Following the default procedure used in GIANT, we included the top 15 genes with the highest probability of interaction with the queried gene pair for each network. diff --git a/content/20.00.supplementary_material.md b/content/20.00.supplementary_material.md index bb72625..c3702dc 100644 --- a/content/20.00.supplementary_material.md +++ b/content/20.00.supplementary_material.md @@ -2,10 +2,10 @@ ### Supplementary Note 1: Comparison with the Maximal Information Coefficient (MIC) on gene expression data {#sec:mic} -We compared all the coefficients in this study with MIC [@pmid:22174245], a popular nonlinear method that can find complex relationships in data, although very computationally intensive [@doi:10.1098/rsos.201424]. +We compared our CCC coefficient with MIC [@pmid:22174245], a popular nonlinear method that can find complex relationships in data, although it is very computationally intensive [@doi:10.1098/rsos.201424]. We ran MICe (see Methods) on all possible pairwise comparisons of our 5,000 highly variable genes from whole blood in GTEx v8. This took 4 days and 19 hours to finish (compared with 9 hours for CCC). -Then we performed the analysis on the distribution of coefficients (the same as in the main text), shown in Figure @fig:dist_coefs_mic. +Then we performed the same analysis on the distribution of coefficients (as in the main text), shown in Figure @fig:dist_coefs_mic. We verified that CCC and MIC behave similarly in this dataset, with essentially the same distribution but only shifted. Figure @fig:dist_coefs_mic c shows that these two coefficients relate almost linearly, and both compare very similarly with Pearson and Spearman.