You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of scanpy.
(optional) I have confirmed this bug exists on the master branch of scanpy.
When working on PR #1715, I noticed a small bug when sc.pp.highly_variable() is run with flavor='seurat_v3' and the batch_key argument is used on a dataset with multiple batches:
The columns in the returned data frame means and variances do not give the correct gene means and gene variances across the whole dataset, but instead give the means and variances of the batch with the last index.
Should be an easy fix, I can also prepare it if you want :)
Code to reproduce:
importnumpyasnpimportscanpyasscimportanndataimportsyssys.path.append("scanpy/preprocessing")
from_utilsimport_get_mean_varnp.random.seed(42)
adata=anndata.AnnData(np.random.randint(0,5,(100,100)))
adata.obs['batch'] =np.random.randint(0,5,(100))
#show that without batch_key, everything works finetrue_mean,true_var=_get_mean_var(adata.X)
result=sc.pp.highly_variable_genes(adata,n_top_genes=2,flavor='seurat_v3',inplace=False)
print('Results correct without batch_key?')
print(np.all(result['means']==true_mean))
print(np.all(result['variances']==true_var))
#show that it goes wrong with batch_key..result_with_batchkey=sc.pp.highly_variable_genes(adata,n_top_genes=2,flavor='seurat_v3',inplace=False,batch_key='batch')
print('Results correct with batch_key?')
print(np.all(result_with_batchkey['means']==true_mean))
print(np.all(result_with_batchkey['variances']==true_var))
#..and that the wrong result comes from the last batchadata_batch4=adata[adata.obs['batch']==4,:].copy()
means_batch4,vars_batch4=_get_mean_var(adata_batch4.X)
print('Wrong results come from last batch?')
print(np.all(result_with_batchkey['means']==means_batch4))
print(np.all(result_with_batchkey['variances']==vars_batch4))
Results correct without batch_key?
True
True
Results correct with batch_key?
False
False
Wrong results come from last batch?
True
True
/gpfs01/berens/user/jlause/libs/scanpy/scanpy/preprocessing/_highly_variable_genes.py:146: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead.
df.loc[: int(n_top_genes), 'highly_variable'] = True
/gpfs01/berens/user/jlause/libs/scanpy/scanpy/preprocessing/_highly_variable_genes.py:146: FutureWarning: Slicing a positional slice with .loc is not supported, and will raise TypeError in a future version. Use .loc with labels or .iloc with positions instead.
df.loc[: int(n_top_genes), 'highly_variable'] = True
Thanks for catching this, though I'm not sure this is the cause of discrepancy between scanpy and seurat (#1733) and my original comments on the PR linked there.
* fixesscverse#1725
* removed unneeded mean_fulldataset/var_fulldataset variables
* release note and tests
* making variance dtype explicit in test
* Minor cleanup -- ivirshup
Co-authored-by: Isaac Virshup <[email protected]>
When working on PR #1715, I noticed a small bug when
sc.pp.highly_variable()
is run withflavor='seurat_v3'
and thebatch_key
argument is used on a dataset with multiple batches:The columns in the returned data frame
means
andvariances
do not give the correct gene means and gene variances across the whole dataset, but instead give the means and variances of the batch with the last index.I think this because here means and variances are first computed within each batch:
https://github.com/theislab/scanpy/blob/a085333ead6ff8a0f64c25734060ffc048be56c0/scanpy/preprocessing/_highly_variable_genes.py#L72-L77
..and then not re-computed for the whole dataset before inserting them here:
https://github.com/theislab/scanpy/blob/a085333ead6ff8a0f64c25734060ffc048be56c0/scanpy/preprocessing/_highly_variable_genes.py#L136-L137
Should be an easy fix, I can also prepare it if you want :)
Code to reproduce:
Versions
anndata 0.7.5
scanpy 1.7.1.dev2+g8c469411
sinfo 0.3.1
PIL 8.1.0
anndata 0.7.5
anyio NA
attr 20.3.0
babel 2.9.0
backcall 0.2.0
certifi 2020.12.05
cffi 1.14.4
chardet 4.0.0
constants NA
cycler 0.10.0
cython_runtime NA
dateutil 2.8.1
decorator 4.4.2
get_version 2.1
google NA
h5py 2.10.0
highs_wrapper NA
idna 2.10
ipykernel 5.4.2
ipython_genutils 0.2.0
jedi 0.17.2
jinja2 2.11.2
joblib 1.0.0
json5 NA
jsonschema 3.2.0
jupyter_server 1.2.1
jupyterlab_server 2.1.1
kiwisolver 1.3.1
legacy_api_wrap 1.2
llvmlite 0.35.0
markupsafe 1.1.1
matplotlib 3.3.3
mpl_toolkits NA
natsort 7.1.0
nbclassic NA
nbformat 5.0.8
numba 0.52.0
numexpr 2.7.2
numpy 1.19.5
packaging 20.8
pandas 1.2.0
parso 0.7.1
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
prometheus_client NA
prompt_toolkit 3.0.10
ptyprocess 0.7.0
pvectorc NA
pygments 2.7.3
pyparsing 2.4.7
pyrsistent NA
pytz 2020.5
requests 2.25.1
scanpy 1.7.1.dev2+g8c469411
scipy 1.6.0
send2trash NA
sinfo 0.3.1
sitecustomize NA
six 1.15.0
sklearn 0.24.0
sniffio 1.2.0
storemagic NA
tables 3.6.1
tornado 6.1
traitlets 5.0.5
typing_extensions NA
urllib3 1.26.2
wcwidth 0.2.5
yaml 5.3.1
zmq 20.0.0
IPython 7.19.0
jupyter_client 6.1.11
jupyter_core 4.7.0
jupyterlab 3.0.3
notebook 6.1.6
Python 3.8.0 (default, Oct 28 2019, 16:14:01) [GCC 8.3.0]
Linux-3.10.0-957.el7.x86_64-x86_64-with-glibc2.27
40 logical CPU cores, x86_64
Session information updated at 2021-03-09 19:18
The text was updated successfully, but these errors were encountered: