Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Execution Time for highly_variable_genes Function #1242

Open
ckrilow opened this issue Jul 12, 2024 · 2 comments
Open

Execution Time for highly_variable_genes Function #1242

ckrilow opened this issue Jul 12, 2024 · 2 comments

Comments

@ckrilow
Copy link

ckrilow commented Jul 12, 2024

Describe the bug

When executing from cellxgene_census.experimental.pp import highly_variable_genes on 3898 cells, across 60K genes, it takes 38 seconds, and I am not sure if this is expected.

To Reproduce

cell_filter = (
  "is_primary_data == True "
  "and tissue_general == 'lung' "
  "and cell_type == 'T cell' "
  "and disease == 'small cell lung carcinoma'"
 )

query = census["census_data"]["homo_sapiens"].axis_query(
  measurement_name="RNA",
  obs_query=tiledbsoma.AxisQuery(value_filter= cell_filter),
)

from cellxgene_census.experimental.pp import highly_variable_genes

hvg_df = highly_variable_genes(
    query,
    n_top_genes = 200,
    #batch_key = ["dataset_id"]
)
hvg_df.query("highly_variable == True")

Expected behavior

I would expect the result to come back in less time than 38seconds.

Environment

  • Machine: x86_64, Jupyternotebook
  • OS: Linux
  • Software versions:
    • Package Version

cellxgene-census cell-census/2023-07-25
Python 3.9.19

@ckrilow ckrilow added the bug Something isn't working label Jul 12, 2024
@ckrilow
Copy link
Author

ckrilow commented Jul 12, 2024

I am working on a better re-prex and will have more information posted shortly.

@ckrilow ckrilow changed the title Long Execution Time for highly_variable_genes Function Execution Time for highly_variable_genes Function Jul 12, 2024
@ivirshup
Copy link
Collaborator

ivirshup commented Jul 12, 2024

These timings seem accurate for me too, running on my laptop. Some results:

Setup:

In [1]: %%time
   ...: import cellxgene_census
   ...: import tiledbsoma
   ...: from cellxgene_census.experimental.pp import highly_variable_genes
CPU times: user 2.92 s, sys: 2.32 s, total: 5.24 s
Wall time: 2.59 s

In [2]: %%time
   ...: cell_filter = (
   ...:   "is_primary_data == True "
   ...:   "and tissue_general == 'lung' "
   ...:   "and cell_type == 'T cell' "
   ...:   "and disease == 'small cell lung carcinoma'"
   ...:  )
   ...: 
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.91 µs

In [3]: %%time
   ...: census = cellxgene_census.open_soma()
The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.
CPU times: user 144 ms, sys: 51.9 ms, total: 196 ms
Wall time: 1.36 s

In [4]: %%time
   ...: query = census["census_data"]["homo_sapiens"].axis_query(
   ...:   measurement_name="RNA",
   ...:   obs_query=tiledbsoma.AxisQuery(value_filter= cell_filter),
   ...: )
CPU times: user 131 ms, sys: 41.1 ms, total: 172 ms
Wall time: 1.66 s
In [5]: %%time
   ...: hvg_df = highly_variable_genes(
   ...:     query,
   ...:     n_top_genes = 200,
   ...: )
   ...: hvg_df.query("highly_variable == True")
CPU times: user 1min, sys: 14.8 s, total: 1min 15s
Wall time: 54.6 s
Out[5]: 
                means   variances  highly_variable_rank  variances_norm  highly_variable
soma_joinid                                                                             
179          0.015136    0.057508                 162.0        2.909841             True
295          0.084146    0.350629                 152.0        2.969717             True
...               ...         ...                   ...             ...              ...
32293        0.842996   36.706675                   3.0        9.840940             True
32713        0.319651    1.664284                 191.0        2.761724             True

[200 rows x 5 columns]

In [7]: %%time
   ...: adata = query.to_anndata(X_name="raw")
CPU times: user 30.5 s, sys: 6.08 s, total: 36.6 s
Wall time: 28.2 s

These times are cut by two thirds when I run this on AWS.

For comparison:

In [8]: import scanpy as sc
   ...: %time sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=200)
CPU times: user 279 ms, sys: 17 ms, total: 296 ms
Wall time: 299 ms

The implementation of highly variable genes here assumes that you've got a lot of data, so definitely isn't optimized for the small data use case. But it seems that about half the time of this computation is just pulling down data.

There's also I think a pretty good workaround in just using scanpy here, since there's fairly comprehensive testing that the results are the same.

@ivirshup ivirshup added performance and removed bug Something isn't working labels Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants