Normalization and gene selection by analytical Pearson residuals (#1715)

* adding core functions and documentation for pearson residual normalization and hvg selection * adding Pearson residual+PCA bundles, minor bug fixes * some style cleanup, minor fixes * adapting _normalize_pearson_residuals() to cleaned-up _normalized_total() from #1667 * updating layer management as in #1667 for _highly_variable_pearson_residuals() as well * slight performance improvement for sparse input * style cleanup * fixing import issue, fixing docstring style, adding check_values param and warning as in #1642 * fixed small NameError, simplified clip argument * remove pd.categorical() * adding check_values to docstrings and remaining pearson residual functions * np.empty instead of np.nan * add references to docstrings, add HVG details to docstring * exposing pca keyword arguments to the user for the bundle/recipe functions * removed unneeded reversal in hvg, fix kwargs_pca bug, consistent defaults across files * fixing handling of `inplace` and `subset` arguments (see issue #1886), explicit typing of output, adding theta input check * renaming output fields for consistency, fixing minor bug * renaming output fields for consistency * adding function that prepares testdata (used for pearson residual tests) * adding tests for all pearson residual functions * fix precommit high_var_genes * try to get precommit to work * try to get precommit to work * fix recipes * fix normalization * remove relative imports * fix docstrings * retry to build docs * fix highvar docstring * more fixing docstrings * docs build locally ? 🔨 * minor cleanup test normalization * more minor cleanups * final cleanup normalization * fixes high var * init experimental module * fix column ordering for batch case * moving to experimental, minor fix for experimental version of hvg selection * linking tests to new experimental submodule, style cleanup * adapt input arguments and docstring for experimental version of hvg selection function * add recipes * fix docs * add correct module docs * fix recipe docstrings * try fix indentation * fix indentation * fix * new indentation * add space * fixing typo in docstring * renaming pca output fields * adapting tests to new output fieldname * fix docs 🔨 * update docs * fix test 🔨 * ensure argument and docstring consistency * update citation year * cleaning imports in `preprocessing` functions * making inputcheck tests specific to error/warning messages * making inputcheck tests specific to error/warning messages * resolve HVGs across batches more cleanly, fix dtype issue * renaming pca input arguments * renaming pca input arguments * _pca bundle: more efficient copy handling, added input check. both _pca and _recipe: varm field for PCs, adapted tests and docs * move repeated inputcheck code to helpers * merging tests *_values and *_general * condense code in pearson hvg selection test, smaller test data for speedup * condensing code in normalization tests * add asteriks for keyword * updating refs to Genome Biology publication * cleanup helpers.py * cleanup main files as requested by @ivirshup * revert unneeded settingWithCopy fix * cache data * use doc_params for doc * fix doc_params var * finalize docs * fix param doc * wrong var still * add cached datasets module and test on high_var_genes tests * use new cache dataset module for tests * fix precommit * fix docs * fix reference and add notebook to tutorials * add release note * add release note * fix release note * typo * remove duplicate reference * fixing black flake etc requirements * add _pca function to release note * last edits to docs * fix release and tutorial image * try fix pre-commit * minor docs * Remove accidentally included files from merge Co-authored-by: giovp <[email protected]> Co-authored-by: Isaac Virshup <[email protected]>
scverse · Mar 29, 2022 · 636316d · 636316d
1 parent 0728d55
commit 636316d
Show file tree

Hide file tree

Showing 35 changed files with 1,509 additions and 93 deletions.
diff --git a/docs/api.md b/docs/api.md
@@ -436,6 +436,26 @@ Collections of useful measurements for evaluating results.
 
 ```
 
+## Experimental
+
+```{eval-rst}
+.. module:: scanpy.experimental
+.. currentmodule:: scanpy
+```
+
+New methods that are in early development which are not (yet)
+integrated in Scanpy core.
+
+```{eval-rst}
+.. autosummary::
+   :toctree: generated/
+
+   experimental.pp.normalize_pearson_residuals
+   experimental.pp.normalize_pearson_residuals_pca
+   experimental.pp.highly_variable_genes
+   experimental.pp.recipe_pearson_residuals
+```
+
 ## Classes
 
 {class}`~anndata.AnnData` is reexported from {mod}`anndata`.

diff --git a/docs/references.rst b/docs/references.rst
@@ -119,6 +119,10 @@ References
    *Laplacian Dynamics and Multiscale Modular Structure in Networks*
    `arXiv <https://arxiv.org/abs/0812.1770>`__.
 
+.. [Lause21] Lause *et al.* (2021)
+   *Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data*,
+   `Genome Biology <https://doi.org/10.1186/s13059-021-02451-7>`__.
+
 .. [Leek12] Leek *et al.* (2012),
    *sva: Surrogate Variable Analysis. R package*
    `Bioconductor <https://doi.org/10.18129/B9.bioc.sva>`__.

diff --git a/docs/release-notes/1.9.0.md b/docs/release-notes/1.9.0.md
@@ -8,3 +8,14 @@
 - {func}`~scanpy.logging.print_versions` now uses `session_info` {pr}`2089` {smaller}`P Angerer` {smaller}`I Virshup`
 - `_choose_representation` now subsets the provided representation to n_pcs, regardless of the name of the provided representation (should affect mostly {func}`~scanpy.pp.neighbors`)  {pr}`2179`  {smaller}`I Virshup` {smaller}`PG Majev`
 - Embedding plots now have a `dimensions` argument, which lets users select which dimensions of their embedding to plot and uses the same broadcasting rules as other arguments {pr}`1538` {smaller}`I Virshup`
+
+```{rubric} Experimental module
+```
+
+- Added {mod}`scanpy.experimental` module!
+
+  - Added {func}`scanpy.experimental.pp.normalize_pearson_residuals` for Pearson Residuals normalization {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
+  - Added {func}`scanpy.experimental.pp.normalize_pearson_residuals_pca` for Pearson Residuals normalization and PCA {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
+  - Added {func}`scanpy.experimental.pp.highly_variable_genes` for HVG selection with Pearson Residuals {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
+  - Added {func}`scanpy.experimental.pp.normalize_pearson_residuals_pca` for Pearson Residuals normalization and dimensionality reduction with PCA {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
+  - Added {func}`scanpy.experimental.pp.recipe_pearson_residuals` for Pearson Residuals normalization, HVG selection and dimensionality reduction with PCA  {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
diff --git a/docs/tutorials.md b/docs/tutorials.md
@@ -94,6 +94,10 @@ See the [cell cycle] notebook.
 :width: 120px
 ```
 
+### Normalization with Pearson Residuals
+
+Normalization of scRNA-seq data with Pearson Residuals, from [^cite_lause21]: {tutorial}`tutorial_pearson_residuals`
+
 ### Scaling Computations
 
 - Visualize and cluster [1.3M neurons] from 10x Genomics.

diff --git a/scanpy/__init__.py b/scanpy/__init__.py
@@ -14,7 +14,7 @@
     from . import tools as tl
     from . import preprocessing as pp
     from . import plotting as pl
-    from . import datasets, logging, queries, external, get, metrics
+    from . import datasets, logging, queries, external, get, metrics, experimental
 
     from anndata import AnnData, concat
     from anndata import (

diff --git a/scanpy/experimental/__init__.py b/scanpy/experimental/__init__.py
@@ -0,0 +1 @@
+from . import pp
diff --git a/scanpy/experimental/_docs.py b/scanpy/experimental/_docs.py
@@ -0,0 +1,79 @@
+"""Shared docstrings for experimental function parameters.
+"""
+
+doc_adata = """\
+adata
+    The annotated data matrix of shape `n_obs` × `n_vars`.
+    Rows correspond to cells and columns to genes.
+"""
+
+doc_dist_params = """\
+theta
+    The negative binomial overdispersion parameter `theta` for Pearson residuals.
+    Higher values correspond to less overdispersion \
+    (`var = mean + mean^2/theta`), and `theta=np.Inf` corresponds to a Poisson model.
+clip
+    Determines if and how residuals are clipped:
+
+    * If `None`, residuals are clipped to the interval \
+    `[-sqrt(n_obs), sqrt(n_obs)]`, where `n_obs` is the number of cells in the dataset (default behavior).
+    * If any scalar `c`, residuals are clipped to the interval `[-c, c]`. Set \
+    `clip=np.Inf` for no clipping.
+"""
+
+doc_check_values = """\
+check_values
+    If `True`, checks if counts in selected layer are integers as expected by this
+    function, and return a warning if non-integers are found. Otherwise, proceed
+    without checking. Setting this to `False` can speed up code for large datasets.
+"""
+
+doc_layer = """\
+layer
+    Layer to use as input instead of `X`. If `None`, `X` is used.
+"""
+
+doc_subset = """\
+subset
+    Inplace subset to highly-variable genes if `True` otherwise merely indicate
+    highly variable genes.
+"""
+
+doc_genes_batch_chunk = """\
+n_top_genes
+    Number of highly-variable genes to keep. Mandatory if `flavor='seurat_v3'` or
+    `flavor='pearson_residuals'`.
+batch_key
+    If specified, highly-variable genes are selected within each batch separately
+    and merged. This simple process avoids the selection of batch-specific genes
+    and acts as a lightweight batch correction method. Genes are first sorted by
+    how many batches they are a HVG. If `flavor='pearson_residuals'`, ties are
+    broken by the median rank (across batches) based on within-batch residual
+    variance.
+chunksize
+    If `flavor='pearson_residuals'`, this dertermines how many genes are processed at
+    once while computing the residual variance. Choosing a smaller value will reduce
+    the required memory.
+"""
+
+doc_pca_chunk = """\
+n_comps
+    Number of principal components to compute in the PCA step.
+random_state
+    Random seed for setting the initial states for the optimization in the PCA step.
+kwargs_pca
+    Dictionary of further keyword arguments passed on to `scanpy.pp.pca()`.
+"""
+
+doc_inplace = """\
+inplace
+    If `True`, update `adata` with results. Otherwise, return results. See below for
+    details of what is returned.
+"""
+
+doc_copy = """\
+copy
+    If `True`, the function runs on a copy of the input object and returns the
+    modified copy. Otherwise, the input object is modified direcly. Not compatible
+    with `inplace=False`.
+"""
diff --git a/scanpy/experimental/pp/__init__.py b/scanpy/experimental/pp/__init__.py
@@ -0,0 +1,8 @@
+from scanpy.experimental.pp._normalization import (
+    normalize_pearson_residuals,
+    normalize_pearson_residuals_pca,
+)
+
+from scanpy.experimental.pp._highly_variable_genes import highly_variable_genes
+
+from scanpy.experimental.pp._recipes import recipe_pearson_residuals