Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sc.pp.highly_variable_genes with flavor='seurat_v3' returns wrong means/variances when batch_key argument is used #1725

Closed
3 tasks done
jlause opened this issue Mar 9, 2021 · 4 comments
Labels
Milestone

Comments

@jlause
Copy link
Contributor

jlause commented Mar 9, 2021

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of scanpy.
  • (optional) I have confirmed this bug exists on the master branch of scanpy.

When working on PR #1715, I noticed a small bug when sc.pp.highly_variable() is run with flavor='seurat_v3' and the batch_key argument is used on a dataset with multiple batches:

The columns in the returned data frame means and variances do not give the correct gene means and gene variances across the whole dataset, but instead give the means and variances of the batch with the last index.

I think this because here means and variances are first computed within each batch:
https://github.com/theislab/scanpy/blob/a085333ead6ff8a0f64c25734060ffc048be56c0/scanpy/preprocessing/_highly_variable_genes.py#L72-L77
..and then not re-computed for the whole dataset before inserting them here:
https://github.com/theislab/scanpy/blob/a085333ead6ff8a0f64c25734060ffc048be56c0/scanpy/preprocessing/_highly_variable_genes.py#L136-L137

Should be an easy fix, I can also prepare it if you want :)

Code to reproduce:

import numpy as np
import scanpy as sc
import anndata 

import sys
sys.path.append("scanpy/preprocessing")
from _utils import _get_mean_var

np.random.seed(42)
adata = anndata.AnnData(np.random.randint(0,5,(100,100)))
adata.obs['batch'] = np.random.randint(0,5,(100))

#show that without batch_key, everything works fine
true_mean,true_var = _get_mean_var(adata.X)
result = sc.pp.highly_variable_genes(adata,n_top_genes=2,flavor='seurat_v3',inplace=False)
print('Results correct without batch_key?')
print(np.all(result['means']==true_mean))
print(np.all(result['variances']==true_var))

#show that it goes wrong with batch_key..
result_with_batchkey = sc.pp.highly_variable_genes(adata,n_top_genes=2,flavor='seurat_v3',inplace=False,batch_key='batch')
print('Results correct with batch_key?')
print(np.all(result_with_batchkey['means']==true_mean))
print(np.all(result_with_batchkey['variances']==true_var))

#..and that the wrong result comes from the last batch
adata_batch4 = adata[adata.obs['batch']==4,:].copy()
means_batch4,vars_batch4 = _get_mean_var(adata_batch4.X)
print('Wrong results come from last batch?')
print(np.all(result_with_batchkey['means']==means_batch4))
print(np.all(result_with_batchkey['variances']==vars_batch4))
Results correct without batch_key?
True
True
Results correct with batch_key?
False
False
Wrong results come from last batch?
True
True

/gpfs01/berens/user/jlause/libs/scanpy/scanpy/preprocessing/_highly_variable_genes.py:146: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version.  Use .loc with labels or .iloc with positions instead.
  df.loc[: int(n_top_genes), 'highly_variable'] = True
/gpfs01/berens/user/jlause/libs/scanpy/scanpy/preprocessing/_highly_variable_genes.py:146: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version.  Use .loc with labels or .iloc with positions instead.
  df.loc[: int(n_top_genes), 'highly_variable'] = True

Versions


anndata 0.7.5
scanpy 1.7.1.dev2+g8c469411
sinfo 0.3.1

PIL 8.1.0
anndata 0.7.5
anyio NA
attr 20.3.0
babel 2.9.0
backcall 0.2.0
certifi 2020.12.05
cffi 1.14.4
chardet 4.0.0
constants NA
cycler 0.10.0
cython_runtime NA
dateutil 2.8.1
decorator 4.4.2
get_version 2.1
google NA
h5py 2.10.0
highs_wrapper NA
idna 2.10
ipykernel 5.4.2
ipython_genutils 0.2.0
jedi 0.17.2
jinja2 2.11.2
joblib 1.0.0
json5 NA
jsonschema 3.2.0
jupyter_server 1.2.1
jupyterlab_server 2.1.1
kiwisolver 1.3.1
legacy_api_wrap 1.2
llvmlite 0.35.0
markupsafe 1.1.1
matplotlib 3.3.3
mpl_toolkits NA
natsort 7.1.0
nbclassic NA
nbformat 5.0.8
numba 0.52.0
numexpr 2.7.2
numpy 1.19.5
packaging 20.8
pandas 1.2.0
parso 0.7.1
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
prometheus_client NA
prompt_toolkit 3.0.10
ptyprocess 0.7.0
pvectorc NA
pygments 2.7.3
pyparsing 2.4.7
pyrsistent NA
pytz 2020.5
requests 2.25.1
scanpy 1.7.1.dev2+g8c469411
scipy 1.6.0
send2trash NA
sinfo 0.3.1
sitecustomize NA
six 1.15.0
sklearn 0.24.0
sniffio 1.2.0
storemagic NA
tables 3.6.1
tornado 6.1
traitlets 5.0.5
typing_extensions NA
urllib3 1.26.2
wcwidth 0.2.5
yaml 5.3.1
zmq 20.0.0

IPython 7.19.0
jupyter_client 6.1.11
jupyter_core 4.7.0
jupyterlab 3.0.3
notebook 6.1.6

Python 3.8.0 (default, Oct 28 2019, 16:14:01) [GCC 8.3.0]
Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.27
40 logical CPU cores, x86_64

Session information updated at 2021-03-09 19:18

@jlause jlause added the Bug 🐛 label Mar 9, 2021
@ivirshup ivirshup added this to the 1.7.2 milestone Mar 10, 2021
@ivirshup
Copy link
Member

Makes sense, thanks for the detailed report!

Would you like to make a bugfix PR for this?

jlause added a commit to jlause/scanpy that referenced this issue Mar 10, 2021
@jlause
Copy link
Contributor Author

jlause commented Mar 10, 2021

done :)

@gokceneraslan
Copy link
Collaborator

Oh maaaan, that's cool. I had noticed that something was off about the batch argument indeed! Thanks so much!

@adamgayoso
Copy link
Member

Thanks for catching this, though I'm not sure this is the cause of discrepancy between scanpy and seurat (#1733) and my original comments on the PR linked there.

ivirshup added a commit to ivirshup/scanpy that referenced this issue Mar 19, 2021
* fixes scverse#1725

* removed unneeded mean_fulldataset/var_fulldataset variables

* release note and tests

* making variance dtype explicit in test

* Minor cleanup -- ivirshup

Co-authored-by: Isaac Virshup <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants