Skip to content

Commit

Permalink
[builder] schema 4.0 (#872)
Browse files Browse the repository at this point in the history
* schema 4

* update dep pins

* AnnData version update allows for compat code cleanup

* fix bug in feature_length

* bump tiledbsoma dependency to latest

* bump schema version

* update census schema version

* more dependency updates

* update to use production REST API

* [builder] normalized layer improvements (#884)

* improve normalized layer floating point precision, and correct normalized calc for smart-seq assays

* fix int32 overflow in sparse matrix code

* add check for tiledb issue 1969

* bump dependency versions

* work around SOMA bug temporarily

* pr feedback

* [builder] port to use enums in schema (#896)

* first pass at using enum types

* add better error logging for file size assertion

* add feature flag for dict schema fields

* update a few dependencies

* remove debugging print

* update comment

* bump compression level

* pr feedback

* fix typos in comments

* add schema_util tests and fix a bug found by those tests

* lint
  • Loading branch information
Bruce Martin authored Dec 21, 2023
1 parent 2f2bfd7 commit a53e34b
Show file tree
Hide file tree
Showing 18 changed files with 792 additions and 413 deletions.
33 changes: 26 additions & 7 deletions docs/cellxgene_census_schema.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# CZ CELLxGENE Discover Census Schema

**Version**: 1.2.0
**Version**: 1.3.0

**Last edited**: Sept, 2023.
**Last edited**: December, 2023.

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in [BCP 14](https://tools.ietf.org/html/bcp14), [RFC2119](https://www.rfc-editor.org/rfc/rfc2119.txt), and [RFC8174](https://www.rfc-editor.org/rfc/rfc8174.txt) when, and only when, they appear in all capitals, as shown here.

Expand Down Expand Up @@ -339,7 +339,7 @@ An example of this `SOMADataFrame` is shown below:
<tbody>
<tr>
<td>census_schema_version</td>
<td>1.2.0</td>
<td>1.3.0</td>
</tr>
<tr>
<td>census_build_date</td>
Expand Down Expand Up @@ -381,10 +381,15 @@ All datasets used to build the Census MUST be included in a table modeled as a `
</tr>
</thead>
<tbody>
<tr>
<td>citation</td>
<td>string</td>
<td>As defined in the CELLxGENE schema.</td>
</tr>
<tr>
<td>collection_id</td>
<td>string</td>
<td rowspan="5">As defined in CELLxGENE Discover <a href="https://api.cellxgene.cziscience.com/curation/ui/">data schema</a> (see &quot;Schemas&quot; section for field definitions)".</td>
<td rowspan="6">As defined in CELLxGENE Discover <a href="https://api.cellxgene.cziscience.com/curation/ui/">data schema</a> (see &quot;Schemas&quot; section for field definitions)".</td>
</tr>
<tr>
<td>collection_name</td>
Expand Down Expand Up @@ -719,7 +724,9 @@ Per the CELLxGENE dataset schema, [all RNA assays MUST include UMI or read count
This is an experimental data artifact - it may be removed at any time.

A library-sized normalized layer, containing a normalized variant of the count (raw) matrix.
For a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
For Smart-Seq assays, given a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
as `normalized[i,j] = (X[i,j] / var[j].feature_length) / sum(X[i, ] / var.feature_length[j])`.
For all other assays, for a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
as `normalized[i,j] = X[i,j] / sum(X[i, ])`.

#### Feature metadata – `census_obj["census_data"][organism].ms["RNA"].var``SOMADataFrame`
Expand Down Expand Up @@ -752,7 +759,7 @@ The following columns MUST be included:
<tr>
<td>feature_length</td>
<td>int</td>
<td>Gene length in base pairs derived from the <a href="https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.1.0/schema.md#required-gene-annotations">gene reference files from the CELLxGENE dataset schema</a>.</td>
<td>As defined in CELLxGENE dataset schema</a>.</td>
</tr>
<tr>
<td>nnz</td>
Expand Down Expand Up @@ -838,7 +845,7 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
</tr>
<tr>
<td>assay_ontology_term_id</td>
<td colspan="2" rowspan="17">As defined in CELLxGENE dataset schema</td>
<td colspan="2" rowspan="19">As defined in CELLxGENE dataset schema</td>
</tr>
<tr>
<td>assay</td>
Expand Down Expand Up @@ -867,6 +874,9 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
<tr>
<td>is_primary_data</td>
</tr>
<tr>
<td>observation_joinid</td>
</tr>
<tr>
<td>self_reported_ethnicity_ontology_term_id</td>
</tr>
Expand All @@ -888,6 +898,9 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
<tr>
<td>tissue</td>
</tr>
<tr>
<td>tissue_type</td>
</tr>
<tr>
<td>nnz</td>
<td>int64</td>
Expand Down Expand Up @@ -918,6 +931,12 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:

## Changelog

### Version 1.3.0

* Update to require [CELLxGENE schema version 4.0.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md)
* Adds `citation` to "Census table of CELLxGENE Discover datasets – `census_obj["census_info"]["datasets"]`"
* Adds `observation_joinid` and `tissue_type` to `obs` dataframe

### Version 1.2.0

* Update to require [CELLxGENE schema version 3.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.1.0/schema.md)
Expand Down
24 changes: 12 additions & 12 deletions tools/cellxgene_census_builder/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -26,26 +26,26 @@ classifiers = [
"Programming Language :: Python :: 3.11",
]
dependencies= [
"typing_extensions==4.8.0",
"pyarrow==13.0.0",
"pandas[performance]==2.0.3",
"anndata==0.9",
"typing_extensions==4.9.0",
"pyarrow==14.0.1",
"pandas[performance]==2.1.4",
"anndata==0.10.3",
"numpy==1.23.5",
# IMPORTANT: consider TileDB format compat before advancing this version. It is important that
# IMPORTANT: the tiledbsoma version lag that used in cellxgene-census package.
"tiledbsoma==1.4.4",
"cellxgene-census==1.6.0",
"scipy==1.10.1", # cellxgene-census==1.5.1 forces scipy<1.11
"fsspec==2023.9.2",
"s3fs==2023.9.2",
"tiledbsoma==1.6.1",
"cellxgene-census==1.9.1",
"scipy==1.11.4",
"fsspec==2023.12.2",
"s3fs==2023.12.2",
"requests==2.31.0",
"aiohttp==3.9.0",
"aiohttp==3.9.1",
"Cython", # required by owlready2
"wheel", # required by owlready2
"owlready2==0.44",
"gitpython==3.1.37",
"gitpython==3.1.40",
"attrs==23.1.0",
"psutil==5.9.5",
"psutil==5.9.6",
"pyyaml==6.0.1",
"numba==0.56.4",
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def open_anndata(
# These are schema versions this code is known to work with. This is a
# sanity check, which would be better implemented via a unit test at
# some point in the future.
assert CXG_SCHEMA_VERSION in ["3.1.0", "3.0.0"]
assert CXG_SCHEMA_VERSION in ["4.0.0"]

if h5ad.schema_version == "":
h5ad.schema_version = get_cellxgene_schema_version(ad)
Expand Down Expand Up @@ -80,6 +80,7 @@ def open_anndata(
# TODO - these should be looked up in the ontology
raw_var["feature_name"] = "unknown"
raw_var["feature_reference"] = "unknown"
raw_var["feature_length"] = 0
var = pd.concat([ad.var, raw_var])
else:
var = ad.raw.var
Expand All @@ -96,7 +97,7 @@ def open_anndata(
not isinstance(X, (sparse.csr_matrix, sparse.csc_matrix)) or X.has_canonical_format
), f"Found H5AD with non-canonical X matrix in {path}"

ad = anndata.AnnData(X=X if need_X else None, obs=ad.obs, var=var, raw=None, uns=ad.uns, dtype=np.float32)
ad = anndata.AnnData(X=X if need_X else None, obs=ad.obs, var=var, raw=None, uns=ad.uns)
assert not need_X or ad.X.shape == (len(ad.obs), len(ad.var))

# TODO: In principle, we could look up missing feature_name, but for now, just assert they exist
Expand Down Expand Up @@ -154,7 +155,7 @@ def _filter(ad: anndata.AnnData, need_X: Optional[bool] = True) -> anndata.AnnDa
assert ad.raw is None

# This discards all other ancillary state, eg, obsm/varm/....
ad = anndata.AnnData(X=X, obs=obs, var=var, dtype=np.float32)
ad = anndata.AnnData(X=X, obs=obs, var=var)

assert (
X is None or isinstance(X, np.ndarray) or X.has_canonical_format
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import pyarrow as pa
import tiledbsoma as soma

from .globals import CENSUS_DATASETS_COLUMNS, CENSUS_DATASETS_NAME
from .globals import CENSUS_DATASETS_NAME, CENSUS_DATASETS_TABLE_SPEC

T = TypeVar("T", bound="Dataset")

Expand All @@ -25,6 +25,7 @@ class Dataset:

# Optional - as reported by REST API
dataset_title: str = "" # CELLxGENE dataset title
citation: str = "" # CELLxGENE citation
collection_id: str = "" # CELLxGENE collection id
collection_name: str = "" # CELLxGENE collection name
collection_doi: str = "" # CELLxGENE collection doi
Expand Down Expand Up @@ -69,14 +70,14 @@ def create_dataset_manifest(info_collection: soma.Collection, datasets: List[Dat
"""
logging.info("Creating dataset_manifest")
manifest_df = Dataset.to_dataframe(datasets)
manifest_df = manifest_df[CENSUS_DATASETS_COLUMNS + ["soma_joinid"]]
manifest_df = manifest_df[list(CENSUS_DATASETS_TABLE_SPEC.field_names())]
if len(manifest_df) == 0:
return

schema = CENSUS_DATASETS_TABLE_SPEC.to_arrow_schema(manifest_df)

# write to a SOMA dataframe
with info_collection.add_new_dataframe(
CENSUS_DATASETS_NAME,
schema=pa.Schema.from_pandas(manifest_df, preserve_index=False),
index_column_names=["soma_joinid"],
CENSUS_DATASETS_NAME, schema=schema, index_column_names=["soma_joinid"]
) as manifest:
manifest.write(pa.Table.from_pandas(manifest_df, preserve_index=False))
Loading

0 comments on commit a53e34b

Please sign in to comment.