Execution Time for highly_variable_genes Function #1242

ckrilow · 2024-07-12T15:10:36Z

Describe the bug

When executing from cellxgene_census.experimental.pp import highly_variable_genes on 3898 cells, across 60K genes, it takes 38 seconds, and I am not sure if this is expected.

To Reproduce

cell_filter = (
  "is_primary_data == True "
  "and tissue_general == 'lung' "
  "and cell_type == 'T cell' "
  "and disease == 'small cell lung carcinoma'"
 )

query = census["census_data"]["homo_sapiens"].axis_query(
  measurement_name="RNA",
  obs_query=tiledbsoma.AxisQuery(value_filter= cell_filter),
)

from cellxgene_census.experimental.pp import highly_variable_genes

hvg_df = highly_variable_genes(
    query,
    n_top_genes = 200,
    #batch_key = ["dataset_id"]
)
hvg_df.query("highly_variable == True")

Expected behavior

I would expect the result to come back in less time than 38seconds.

Environment

Machine: x86_64, Jupyternotebook
OS: Linux
Software versions:
- Package Version

cellxgene-census cell-census/2023-07-25
Python 3.9.19

The text was updated successfully, but these errors were encountered:

ckrilow · 2024-07-12T17:13:24Z

I am working on a better re-prex and will have more information posted shortly.

ivirshup · 2024-07-12T17:35:09Z

These timings seem accurate for me too, running on my laptop. Some results:

Setup:

In [1]: %%time
   ...: import cellxgene_census
   ...: import tiledbsoma
   ...: from cellxgene_census.experimental.pp import highly_variable_genes
CPU times: user 2.92 s, sys: 2.32 s, total: 5.24 s
Wall time: 2.59 s

In [2]: %%time
   ...: cell_filter = (
   ...:   "is_primary_data == True "
   ...:   "and tissue_general == 'lung' "
   ...:   "and cell_type == 'T cell' "
   ...:   "and disease == 'small cell lung carcinoma'"
   ...:  )
   ...: 
CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.91 µs

In [3]: %%time
   ...: census = cellxgene_census.open_soma()
The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.
CPU times: user 144 ms, sys: 51.9 ms, total: 196 ms
Wall time: 1.36 s

In [4]: %%time
   ...: query = census["census_data"]["homo_sapiens"].axis_query(
   ...:   measurement_name="RNA",
   ...:   obs_query=tiledbsoma.AxisQuery(value_filter= cell_filter),
   ...: )
CPU times: user 131 ms, sys: 41.1 ms, total: 172 ms
Wall time: 1.66 s

In [5]: %%time
   ...: hvg_df = highly_variable_genes(
   ...:     query,
   ...:     n_top_genes = 200,
   ...: )
   ...: hvg_df.query("highly_variable == True")
CPU times: user 1min, sys: 14.8 s, total: 1min 15s
Wall time: 54.6 s
Out[5]: 
                means   variances  highly_variable_rank  variances_norm  highly_variable
soma_joinid                                                                             
179          0.015136    0.057508                 162.0        2.909841             True
295          0.084146    0.350629                 152.0        2.969717             True
...               ...         ...                   ...             ...              ...
32293        0.842996   36.706675                   3.0        9.840940             True
32713        0.319651    1.664284                 191.0        2.761724             True

[200 rows x 5 columns]

In [7]: %%time
   ...: adata = query.to_anndata(X_name="raw")
CPU times: user 30.5 s, sys: 6.08 s, total: 36.6 s
Wall time: 28.2 s

These times are cut by two thirds when I run this on AWS.

For comparison:

In [8]: import scanpy as sc
   ...: %time sc.pp.highly_variable_genes(adata, flavor="seurat_v3", n_top_genes=200)
CPU times: user 279 ms, sys: 17 ms, total: 296 ms
Wall time: 299 ms

The implementation of highly variable genes here assumes that you've got a lot of data, so definitely isn't optimized for the small data use case. But it seems that about half the time of this computation is just pulling down data.

There's also I think a pretty good workaround in just using scanpy here, since there's fairly comprehensive testing that the results are the same.

ckrilow added the bug Something isn't working label Jul 12, 2024

ckrilow changed the title ~~Long Execution Time for highly_variable_genes Function~~ Execution Time for highly_variable_genes Function Jul 12, 2024

ivirshup added performance and removed bug Something isn't working labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution Time for highly_variable_genes Function #1242

Execution Time for highly_variable_genes Function #1242

ckrilow commented Jul 12, 2024 •

edited

Loading

ckrilow commented Jul 12, 2024

ivirshup commented Jul 12, 2024 •

edited

Loading

Execution Time for highly_variable_genes Function #1242

Execution Time for highly_variable_genes Function #1242

Comments

ckrilow commented Jul 12, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Environment

ckrilow commented Jul 12, 2024

ivirshup commented Jul 12, 2024 • edited Loading

ckrilow commented Jul 12, 2024 •

edited

Loading

ivirshup commented Jul 12, 2024 •

edited

Loading