Skip to content

Commit

Permalink
Normalization and gene selection by analytical Pearson residuals (#1715)
Browse files Browse the repository at this point in the history
* adding core functions and documentation for pearson residual normalization and hvg selection

* adding Pearson residual+PCA bundles, minor bug fixes

* some style cleanup, minor fixes

* adapting _normalize_pearson_residuals() to cleaned-up _normalized_total() from #1667

* updating layer management as in #1667 for _highly_variable_pearson_residuals() as well

* slight performance improvement for sparse input

* style cleanup

* fixing import issue, fixing docstring style, adding check_values param and warning as in #1642

* fixed small NameError, simplified clip argument

* remove pd.categorical()

* adding check_values to docstrings and remaining pearson residual functions

* np.empty instead of np.nan

* add references to docstrings, add HVG details to docstring

* exposing pca keyword arguments to the user for the bundle/recipe functions

* removed unneeded reversal in hvg, fix kwargs_pca bug, consistent defaults across files

* fixing handling of `inplace` and `subset` arguments (see issue #1886), explicit typing of output, adding theta input check

* renaming output fields for consistency, fixing minor bug

* renaming output fields for consistency

* adding function that prepares testdata (used for pearson residual tests)

* adding tests for all pearson residual functions

* fix precommit high_var_genes

* try to get precommit to work

* try to get precommit to work

* fix recipes

* fix normalization

* remove relative imports

* fix docstrings

* retry to build docs

* fix highvar docstring

* more fixing docstrings

* docs build locally ? 🔨

* minor cleanup test normalization

* more minor cleanups

* final cleanup normalization

* fixes high var

* init experimental module

* fix column ordering for batch case

* moving to experimental, minor fix for experimental version of hvg selection

* linking tests to new experimental submodule, style cleanup

* adapt input arguments and docstring for experimental version of hvg selection function

* add recipes

* fix docs

* add correct module docs

* fix recipe docstrings

* try fix indentation

* fix indentation

* fix

* new indentation

* add space

* fixing typo in docstring

* renaming pca output fields

* adapting tests to new output fieldname

* fix docs 🔨

* update docs

* fix test 🔨

* ensure argument and docstring consistency

* update citation year

* cleaning imports in `preprocessing` functions

* making inputcheck tests specific to error/warning messages

* making inputcheck tests specific to error/warning messages

* resolve HVGs across batches more cleanly, fix dtype issue

* renaming pca input arguments

* renaming pca input arguments

* _pca bundle: more efficient copy handling, added input check. both _pca and _recipe: varm field for PCs, adapted tests and docs

* move repeated inputcheck code to helpers

* merging tests *_values and *_general

* condense code in pearson hvg selection test, smaller test data for speedup

* condensing code in normalization tests

* add asteriks for keyword

* updating refs to Genome Biology publication

* cleanup helpers.py

* cleanup main files as requested by @ivirshup

* revert unneeded settingWithCopy fix

* cache data

* use doc_params for doc

* fix doc_params var

* finalize docs

* fix param doc

* wrong var still

* add cached datasets module and test on high_var_genes tests

* use new cache dataset module for tests

* fix precommit

* fix docs

* fix reference and add notebook to tutorials

* add release note

* add release note

* fix release note

* typo

* remove duplicate reference

* fixing black flake etc requirements

* add _pca function to release note

* last edits to docs

* fix release and tutorial image

* try fix pre-commit

* minor docs

* Remove accidentally included files from merge

Co-authored-by: giovp <[email protected]>
Co-authored-by: Isaac Virshup <[email protected]>
  • Loading branch information
3 people authored Mar 29, 2022
1 parent 0728d55 commit 636316d
Show file tree
Hide file tree
Showing 35 changed files with 1,509 additions and 93 deletions.
20 changes: 20 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,26 @@ Collections of useful measurements for evaluating results.
```

## Experimental

```{eval-rst}
.. module:: scanpy.experimental
.. currentmodule:: scanpy
```

New methods that are in early development which are not (yet)
integrated in Scanpy core.

```{eval-rst}
.. autosummary::
:toctree: generated/
experimental.pp.normalize_pearson_residuals
experimental.pp.normalize_pearson_residuals_pca
experimental.pp.highly_variable_genes
experimental.pp.recipe_pearson_residuals
```

## Classes

{class}`~anndata.AnnData` is reexported from {mod}`anndata`.
Expand Down
4 changes: 4 additions & 0 deletions docs/references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,10 @@ References
*Laplacian Dynamics and Multiscale Modular Structure in Networks*
`arXiv <https://arxiv.org/abs/0812.1770>`__.
.. [Lause21] Lause *et al.* (2021)
*Analytic Pearson residuals for normalization of single-cell RNA-seq UMI data*,
`Genome Biology <https://doi.org/10.1186/s13059-021-02451-7>`__.
.. [Leek12] Leek *et al.* (2012),
*sva: Surrogate Variable Analysis. R package*
`Bioconductor <https://doi.org/10.18129/B9.bioc.sva>`__.
Expand Down
11 changes: 11 additions & 0 deletions docs/release-notes/1.9.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,14 @@
- {func}`~scanpy.logging.print_versions` now uses `session_info` {pr}`2089` {smaller}`P Angerer` {smaller}`I Virshup`
- `_choose_representation` now subsets the provided representation to n_pcs, regardless of the name of the provided representation (should affect mostly {func}`~scanpy.pp.neighbors`) {pr}`2179` {smaller}`I Virshup` {smaller}`PG Majev`
- Embedding plots now have a `dimensions` argument, which lets users select which dimensions of their embedding to plot and uses the same broadcasting rules as other arguments {pr}`1538` {smaller}`I Virshup`

```{rubric} Experimental module
```

- Added {mod}`scanpy.experimental` module!

- Added {func}`scanpy.experimental.pp.normalize_pearson_residuals` for Pearson Residuals normalization {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
- Added {func}`scanpy.experimental.pp.normalize_pearson_residuals_pca` for Pearson Residuals normalization and PCA {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
- Added {func}`scanpy.experimental.pp.highly_variable_genes` for HVG selection with Pearson Residuals {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
- Added {func}`scanpy.experimental.pp.normalize_pearson_residuals_pca` for Pearson Residuals normalization and dimensionality reduction with PCA {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
- Added {func}`scanpy.experimental.pp.recipe_pearson_residuals` for Pearson Residuals normalization, HVG selection and dimensionality reduction with PCA {pr}`1715` {smaller}`J Lause, G Palla, I Virshup`
4 changes: 4 additions & 0 deletions docs/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,10 @@ See the [cell cycle] notebook.
:width: 120px
```

### Normalization with Pearson Residuals

Normalization of scRNA-seq data with Pearson Residuals, from [^cite_lause21]: {tutorial}`tutorial_pearson_residuals`

### Scaling Computations

- Visualize and cluster [1.3M neurons] from 10x Genomics.
Expand Down
2 changes: 1 addition & 1 deletion scanpy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
from . import tools as tl
from . import preprocessing as pp
from . import plotting as pl
from . import datasets, logging, queries, external, get, metrics
from . import datasets, logging, queries, external, get, metrics, experimental

from anndata import AnnData, concat
from anndata import (
Expand Down
1 change: 1 addition & 0 deletions scanpy/experimental/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from . import pp
79 changes: 79 additions & 0 deletions scanpy/experimental/_docs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
"""Shared docstrings for experimental function parameters.
"""

doc_adata = """\
adata
The annotated data matrix of shape `n_obs` × `n_vars`.
Rows correspond to cells and columns to genes.
"""

doc_dist_params = """\
theta
The negative binomial overdispersion parameter `theta` for Pearson residuals.
Higher values correspond to less overdispersion \
(`var = mean + mean^2/theta`), and `theta=np.Inf` corresponds to a Poisson model.
clip
Determines if and how residuals are clipped:
* If `None`, residuals are clipped to the interval \
`[-sqrt(n_obs), sqrt(n_obs)]`, where `n_obs` is the number of cells in the dataset (default behavior).
* If any scalar `c`, residuals are clipped to the interval `[-c, c]`. Set \
`clip=np.Inf` for no clipping.
"""

doc_check_values = """\
check_values
If `True`, checks if counts in selected layer are integers as expected by this
function, and return a warning if non-integers are found. Otherwise, proceed
without checking. Setting this to `False` can speed up code for large datasets.
"""

doc_layer = """\
layer
Layer to use as input instead of `X`. If `None`, `X` is used.
"""

doc_subset = """\
subset
Inplace subset to highly-variable genes if `True` otherwise merely indicate
highly variable genes.
"""

doc_genes_batch_chunk = """\
n_top_genes
Number of highly-variable genes to keep. Mandatory if `flavor='seurat_v3'` or
`flavor='pearson_residuals'`.
batch_key
If specified, highly-variable genes are selected within each batch separately
and merged. This simple process avoids the selection of batch-specific genes
and acts as a lightweight batch correction method. Genes are first sorted by
how many batches they are a HVG. If `flavor='pearson_residuals'`, ties are
broken by the median rank (across batches) based on within-batch residual
variance.
chunksize
If `flavor='pearson_residuals'`, this dertermines how many genes are processed at
once while computing the residual variance. Choosing a smaller value will reduce
the required memory.
"""

doc_pca_chunk = """\
n_comps
Number of principal components to compute in the PCA step.
random_state
Random seed for setting the initial states for the optimization in the PCA step.
kwargs_pca
Dictionary of further keyword arguments passed on to `scanpy.pp.pca()`.
"""

doc_inplace = """\
inplace
If `True`, update `adata` with results. Otherwise, return results. See below for
details of what is returned.
"""

doc_copy = """\
copy
If `True`, the function runs on a copy of the input object and returns the
modified copy. Otherwise, the input object is modified direcly. Not compatible
with `inplace=False`.
"""
8 changes: 8 additions & 0 deletions scanpy/experimental/pp/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from scanpy.experimental.pp._normalization import (
normalize_pearson_residuals,
normalize_pearson_residuals_pca,
)

from scanpy.experimental.pp._highly_variable_genes import highly_variable_genes

from scanpy.experimental.pp._recipes import recipe_pearson_residuals
Loading

0 comments on commit 636316d

Please sign in to comment.