sc.pp.highly_variable_genes with flavor='seurat_v3' returns wrong means/variances when batch_key argument is used #1725

jlause · 2021-03-09T18:48:01Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the master branch of scanpy.

When working on PR #1715, I noticed a small bug when sc.pp.highly_variable() is run with flavor='seurat_v3' and the batch_key argument is used on a dataset with multiple batches:

The columns in the returned data frame means and variances do not give the correct gene means and gene variances across the whole dataset, but instead give the means and variances of the batch with the last index.

I think this because here means and variances are first computed within each batch:
https://github.com/theislab/scanpy/blob/a085333ead6ff8a0f64c25734060ffc048be56c0/scanpy/preprocessing/_highly_variable_genes.py#L72-L77
..and then not re-computed for the whole dataset before inserting them here:
https://github.com/theislab/scanpy/blob/a085333ead6ff8a0f64c25734060ffc048be56c0/scanpy/preprocessing/_highly_variable_genes.py#L136-L137

Should be an easy fix, I can also prepare it if you want :)

Code to reproduce:

import numpy as np
import scanpy as sc
import anndata 

import sys
sys.path.append("scanpy/preprocessing")
from _utils import _get_mean_var

np.random.seed(42)
adata = anndata.AnnData(np.random.randint(0,5,(100,100)))
adata.obs['batch'] = np.random.randint(0,5,(100))

#show that without batch_key, everything works fine
true_mean,true_var = _get_mean_var(adata.X)
result = sc.pp.highly_variable_genes(adata,n_top_genes=2,flavor='seurat_v3',inplace=False)
print('Results correct without batch_key?')
print(np.all(result['means']==true_mean))
print(np.all(result['variances']==true_var))

#show that it goes wrong with batch_key..
result_with_batchkey = sc.pp.highly_variable_genes(adata,n_top_genes=2,flavor='seurat_v3',inplace=False,batch_key='batch')
print('Results correct with batch_key?')
print(np.all(result_with_batchkey['means']==true_mean))
print(np.all(result_with_batchkey['variances']==true_var))

#..and that the wrong result comes from the last batch
adata_batch4 = adata[adata.obs['batch']==4,:].copy()
means_batch4,vars_batch4 = _get_mean_var(adata_batch4.X)
print('Wrong results come from last batch?')
print(np.all(result_with_batchkey['means']==means_batch4))
print(np.all(result_with_batchkey['variances']==vars_batch4))

Results correct without batch_key?
True
True
Results correct with batch_key?
False
False
Wrong results come from last batch?
True
True

/gpfs01/berens/user/jlause/libs/scanpy/scanpy/preprocessing/_highly_variable_genes.py:146: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version.  Use .loc with labels or .iloc with positions instead.
  df.loc[: int(n_top_genes), 'highly_variable'] = True
/gpfs01/berens/user/jlause/libs/scanpy/scanpy/preprocessing/_highly_variable_genes.py:146: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version.  Use .loc with labels or .iloc with positions instead.
  df.loc[: int(n_top_genes), 'highly_variable'] = True

Versions

anndata 0.7.5
scanpy 1.7.1.dev2+g8c469411
sinfo 0.3.1

PIL 8.1.0
anndata 0.7.5
anyio NA
attr 20.3.0
babel 2.9.0
backcall 0.2.0
certifi 2020.12.05
cffi 1.14.4
chardet 4.0.0
constants NA
cycler 0.10.0
cython_runtime NA
dateutil 2.8.1
decorator 4.4.2
get_version 2.1
google NA
h5py 2.10.0
highs_wrapper NA
idna 2.10
ipykernel 5.4.2
ipython_genutils 0.2.0
jedi 0.17.2
jinja2 2.11.2
joblib 1.0.0
json5 NA
jsonschema 3.2.0
jupyter_server 1.2.1
jupyterlab_server 2.1.1
kiwisolver 1.3.1
legacy_api_wrap 1.2
llvmlite 0.35.0
markupsafe 1.1.1
matplotlib 3.3.3
mpl_toolkits NA
natsort 7.1.0
nbclassic NA
nbformat 5.0.8
numba 0.52.0
numexpr 2.7.2
numpy 1.19.5
packaging 20.8
pandas 1.2.0
parso 0.7.1
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
prometheus_client NA
prompt_toolkit 3.0.10
ptyprocess 0.7.0
pvectorc NA
pygments 2.7.3
pyparsing 2.4.7
pyrsistent NA
pytz 2020.5
requests 2.25.1
scanpy 1.7.1.dev2+g8c469411
scipy 1.6.0
send2trash NA
sinfo 0.3.1
sitecustomize NA
six 1.15.0
sklearn 0.24.0
sniffio 1.2.0
storemagic NA
tables 3.6.1
tornado 6.1
traitlets 5.0.5
typing_extensions NA
urllib3 1.26.2
wcwidth 0.2.5
yaml 5.3.1
zmq 20.0.0

IPython 7.19.0
jupyter_client 6.1.11
jupyter_core 4.7.0
jupyterlab 3.0.3
notebook 6.1.6

Python 3.8.0 (default, Oct 28 2019, 16:14:01) [GCC 8.3.0]
Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.27
40 logical CPU cores, x86_64

Session information updated at 2021-03-09 19:18

The text was updated successfully, but these errors were encountered:

ivirshup · 2021-03-10T03:11:32Z

Makes sense, thanks for the detailed report!

Would you like to make a bugfix PR for this?

jlause · 2021-03-10T14:39:22Z

done :)

gokceneraslan · 2021-03-11T16:55:27Z

Oh maaaan, that's cool. I had noticed that something was off about the batch argument indeed! Thanks so much!

adamgayoso · 2021-03-13T20:10:33Z

Thanks for catching this, though I'm not sure this is the cause of discrepancy between scanpy and seurat (#1733) and my original comments on the PR linked there.

* fixes scverse#1725 * removed unneeded mean_fulldataset/var_fulldataset variables * release note and tests * making variance dtype explicit in test * Minor cleanup -- ivirshup Co-authored-by: Isaac Virshup <[email protected]>

jlause added the Bug 🐛 label Mar 9, 2021

ivirshup added this to the 1.7.2 milestone Mar 10, 2021

jlause added a commit to jlause/scanpy that referenced this issue Mar 10, 2021

fixes scverse#1725

e6deabf

jlause mentioned this issue Mar 10, 2021

bugfix for #1725 #1732

Merged

ivirshup closed this as completed in 05792b1 Mar 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sc.pp.highly_variable_genes with flavor='seurat_v3' returns wrong means/variances when batch_key argument is used #1725

sc.pp.highly_variable_genes with flavor='seurat_v3' returns wrong means/variances when batch_key argument is used #1725

jlause commented Mar 9, 2021 •

edited

Loading

anndata 0.7.5
scanpy 1.7.1.dev2+g8c469411
sinfo 0.3.1

IPython 7.19.0
jupyter_client 6.1.11
jupyter_core 4.7.0
jupyterlab 3.0.3
notebook 6.1.6

Python 3.8.0 (default, Oct 28 2019, 16:14:01) [GCC 8.3.0]
Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.27
40 logical CPU cores, x86_64

ivirshup commented Mar 10, 2021

jlause commented Mar 10, 2021

gokceneraslan commented Mar 11, 2021

adamgayoso commented Mar 13, 2021

sc.pp.highly_variable_genes with flavor='seurat_v3' returns wrong means/variances when batch_key argument is used #1725

sc.pp.highly_variable_genes with flavor='seurat_v3' returns wrong means/variances when batch_key argument is used #1725

Comments

jlause commented Mar 9, 2021 • edited Loading

Code to reproduce:

Versions

anndata 0.7.5 scanpy 1.7.1.dev2+g8c469411 sinfo 0.3.1

IPython 7.19.0 jupyter_client 6.1.11 jupyter_core 4.7.0 jupyterlab 3.0.3 notebook 6.1.6

Python 3.8.0 (default, Oct 28 2019, 16:14:01) [GCC 8.3.0] Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.27 40 logical CPU cores, x86_64

ivirshup commented Mar 10, 2021

jlause commented Mar 10, 2021

gokceneraslan commented Mar 11, 2021

adamgayoso commented Mar 13, 2021

jlause commented Mar 9, 2021 •

edited

Loading

anndata 0.7.5
scanpy 1.7.1.dev2+g8c469411
sinfo 0.3.1

IPython 7.19.0
jupyter_client 6.1.11
jupyter_core 4.7.0
jupyterlab 3.0.3
notebook 6.1.6

Python 3.8.0 (default, Oct 28 2019, 16:14:01) [GCC 8.3.0]
Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.27
40 logical CPU cores, x86_64