-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execution Time for highly_variable_genes Function #1242
Comments
I am working on a better re-prex and will have more information posted shortly. |
These timings seem accurate for me too, running on my laptop. Some results: Setup: In [1]: %%time
...: import cellxgene_census
...: import tiledbsoma
...: from cellxgene_census.experimental.pp import highly_variable_genes
CPU times: user 2.92 s, sys: 2.32 s, total: 5.24 s
Wall time: 2.59 s
In [2]: %%time
...: cell_filter = (
...: "is_primary_data == True "
...: "and tissue_general == 'lung' "
...: "and cell_type == 'T cell' "
...: "and disease == 'small cell lung carcinoma'"
...: )
...:
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.91 µs
In [3]: %%time
...: census = cellxgene_census.open_soma()
The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.
CPU times: user 144 ms, sys: 51.9 ms, total: 196 ms
Wall time: 1.36 s
In [4]: %%time
...: query = census["census_data"]["homo_sapiens"].axis_query(
...: measurement_name="RNA",
...: obs_query=tiledbsoma.AxisQuery(value_filter= cell_filter),
...: )
CPU times: user 131 ms, sys: 41.1 ms, total: 172 ms
Wall time: 1.66 s In [5]: %%time
...: hvg_df = highly_variable_genes(
...: query,
...: n_top_genes = 200,
...: )
...: hvg_df.query("highly_variable == True")
CPU times: user 1min, sys: 14.8 s, total: 1min 15s
Wall time: 54.6 s
Out[5]:
means variances highly_variable_rank variances_norm highly_variable
soma_joinid
179 0.015136 0.057508 162.0 2.909841 True
295 0.084146 0.350629 152.0 2.969717 True
... ... ... ... ... ...
32293 0.842996 36.706675 3.0 9.840940 True
32713 0.319651 1.664284 191.0 2.761724 True
[200 rows x 5 columns]
In [7]: %%time
...: adata = query.to_anndata(X_name="raw")
CPU times: user 30.5 s, sys: 6.08 s, total: 36.6 s
Wall time: 28.2 s These times are cut by two thirds when I run this on AWS. For comparison: In [8]: import scanpy as sc
...: %time sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=200)
CPU times: user 279 ms, sys: 17 ms, total: 296 ms
Wall time: 299 ms The implementation of highly variable genes here assumes that you've got a lot of data, so definitely isn't optimized for the small data use case. But it seems that about half the time of this computation is just pulling down data. There's also I think a pretty good workaround in just using scanpy here, since there's fairly comprehensive testing that the results are the same. |
Describe the bug
When executing
from cellxgene_census.experimental.pp import highly_variable_genes
on 3898 cells, across 60K genes, it takes 38 seconds, and I am not sure if this is expected.To Reproduce
Expected behavior
I would expect the result to come back in less time than 38seconds.
Environment
cellxgene-census cell-census/2023-07-25
Python 3.9.19
The text was updated successfully, but these errors were encountered: