[builder] schema 4.0 (#872)

* schema 4 * update dep pins * AnnData version update allows for compat code cleanup * fix bug in feature_length * bump tiledbsoma dependency to latest * bump schema version * update census schema version * more dependency updates * update to use production REST API * [builder] normalized layer improvements (#884) * improve normalized layer floating point precision, and correct normalized calc for smart-seq assays * fix int32 overflow in sparse matrix code * add check for tiledb issue 1969 * bump dependency versions * work around SOMA bug temporarily * pr feedback * [builder] port to use enums in schema (#896) * first pass at using enum types * add better error logging for file size assertion * add feature flag for dict schema fields * update a few dependencies * remove debugging print * update comment * bump compression level * pr feedback * fix typos in comments * add schema_util tests and fix a bug found by those tests * lint
chanzuckerberg · Dec 21, 2023 · a53e34b · a53e34b
1 parent 2f2bfd7
commit a53e34b
Show file tree

Hide file tree

Showing 18 changed files with 792 additions and 413 deletions.
diff --git a/docs/cellxgene_census_schema.md b/docs/cellxgene_census_schema.md
@@ -1,8 +1,8 @@
 # CZ CELLxGENE Discover Census Schema
 
-**Version**: 1.2.0
+**Version**: 1.3.0
 
-**Last edited**: Sept, 2023.
+**Last edited**: December, 2023.
 
 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED" "MAY", and "OPTIONAL" in this document are to be interpreted as described in [BCP 14](https://tools.ietf.org/html/bcp14), [RFC2119](https://www.rfc-editor.org/rfc/rfc2119.txt), and [RFC8174](https://www.rfc-editor.org/rfc/rfc8174.txt) when, and only when, they appear in all capitals, as shown here.
 
@@ -339,7 +339,7 @@ An example of this `SOMADataFrame` is shown below:
 <tbody>
   <tr>
     <td>census_schema_version</td>
-    <td>1.2.0</td>
+    <td>1.3.0</td>
   </tr>
   <tr>
     <td>census_build_date</td>
@@ -381,10 +381,15 @@ All datasets used to build the Census MUST be included in a table modeled as a `
   </tr>
 </thead>
 <tbody>
+  <tr>
+    <td>citation</td>
+    <td>string</td>
+    <td>As defined in the CELLxGENE schema.</td>
+  </tr>
   <tr>
     <td>collection_id</td>
     <td>string</td>
-    <td rowspan="5">As defined in CELLxGENE Discover <a href="https://api.cellxgene.cziscience.com/curation/ui/">data schema</a> (see &quot;Schemas&quot; section for field definitions)".</td>
+    <td rowspan="6">As defined in CELLxGENE Discover <a href="https://api.cellxgene.cziscience.com/curation/ui/">data schema</a> (see &quot;Schemas&quot; section for field definitions)".</td>
   </tr>
   <tr>
     <td>collection_name</td>
@@ -719,7 +724,9 @@ Per the CELLxGENE dataset schema, [all RNA assays MUST include UMI or read count
 This is an experimental data artifact - it may be removed at any time.
 
 A library-sized normalized layer, containing a normalized variant of the count (raw) matrix.
-For a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
+For Smart-Seq assays, given a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
+as `normalized[i,j] = (X[i,j] / var[j].feature_length) / sum(X[i, ] / var.feature_length[j])`.
+For all other assays, for a value `X[i,j]` in the counts (raw) matrix, library-size normalized values are defined
 as `normalized[i,j] = X[i,j] / sum(X[i, ])`.
 
 #### Feature metadata – `census_obj["census_data"][organism].ms["RNA"].var` – `SOMADataFrame`
@@ -752,7 +759,7 @@ The following columns MUST be included:
   <tr>
     <td>feature_length</td>
     <td>int</td>
-    <td>Gene length in base pairs derived from the <a href="https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.1.0/schema.md#required-gene-annotations">gene reference files from the CELLxGENE dataset schema</a>.</td>
+    <td>As defined in CELLxGENE dataset schema</a>.</td>
   </tr>
   <tr>
     <td>nnz</td>
@@ -838,7 +845,7 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
   </tr>
   <tr>
     <td>assay_ontology_term_id</td>
-    <td colspan="2" rowspan="17">As defined in CELLxGENE dataset schema</td>
+    <td colspan="2" rowspan="19">As defined in CELLxGENE dataset schema</td>
   </tr>
   <tr>
     <td>assay</td>
@@ -867,6 +874,9 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
   <tr>
     <td>is_primary_data</td>
   </tr>
+  <tr>
+    <td>observation_joinid</td>
+  </tr>
   <tr>
     <td>self_reported_ethnicity_ontology_term_id</td>
   </tr>
@@ -888,6 +898,9 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
   <tr>
     <td>tissue</td>
   </tr>
+  <tr>
+    <td>tissue_type</td>
+  </tr>
   <tr>
     <td>nnz</td>
     <td>int64</td>
@@ -918,6 +931,12 @@ Cell metadata MUST be encoded as a `SOMADataFrame` with the following columns:
 
 ## Changelog
 
+### Version 1.3.0
+
+* Update to require [CELLxGENE schema version 4.0.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/4.0.0/schema.md)
+* Adds `citation` to "Census table of CELLxGENE Discover datasets – `census_obj["census_info"]["datasets"]`"
+* Adds `observation_joinid` and `tissue_type` to `obs` dataframe
+
 ### Version 1.2.0
 
 * Update to require [CELLxGENE schema version 3.1.0](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.1.0/schema.md)

diff --git a/tools/cellxgene_census_builder/pyproject.toml b/tools/cellxgene_census_builder/pyproject.toml
@@ -26,26 +26,26 @@ classifiers = [
     "Programming Language :: Python :: 3.11",
 ]
 dependencies= [
-    "typing_extensions==4.8.0",
-    "pyarrow==13.0.0",
-    "pandas[performance]==2.0.3",
-    "anndata==0.9",
+    "typing_extensions==4.9.0",
+    "pyarrow==14.0.1",
+    "pandas[performance]==2.1.4",
+    "anndata==0.10.3",
     "numpy==1.23.5",
     # IMPORTANT: consider TileDB format compat before advancing this version. It is important that
     # IMPORTANT: the tiledbsoma version lag that used in cellxgene-census package.
-    "tiledbsoma==1.4.4",
-    "cellxgene-census==1.6.0",
-    "scipy==1.10.1",  # cellxgene-census==1.5.1 forces scipy<1.11
-    "fsspec==2023.9.2",
-    "s3fs==2023.9.2",
+    "tiledbsoma==1.6.1",
+    "cellxgene-census==1.9.1",
+    "scipy==1.11.4",
+    "fsspec==2023.12.2",
+    "s3fs==2023.12.2",
     "requests==2.31.0",
-    "aiohttp==3.9.0",
+    "aiohttp==3.9.1",
     "Cython", # required by owlready2
     "wheel",  # required by owlready2
     "owlready2==0.44",
-    "gitpython==3.1.37",
+    "gitpython==3.1.40",
     "attrs==23.1.0",
-    "psutil==5.9.5",
+    "psutil==5.9.6",
     "pyyaml==6.0.1",
     "numba==0.56.4",
 ]

diff --git a/tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/anndata.py b/tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/anndata.py
@@ -42,7 +42,7 @@ def open_anndata(
         # These are schema versions this code is known to work with. This is a
         # sanity check, which would be better implemented via a unit test at
         # some point in the future.
-        assert CXG_SCHEMA_VERSION in ["3.1.0", "3.0.0"]
+        assert CXG_SCHEMA_VERSION in ["4.0.0"]
 
         if h5ad.schema_version == "":
             h5ad.schema_version = get_cellxgene_schema_version(ad)
@@ -80,6 +80,7 @@ def open_anndata(
                 # TODO - these should be looked up in the ontology
                 raw_var["feature_name"] = "unknown"
                 raw_var["feature_reference"] = "unknown"
+                raw_var["feature_length"] = 0
                 var = pd.concat([ad.var, raw_var])
             else:
                 var = ad.raw.var
@@ -96,7 +97,7 @@ def open_anndata(
             not isinstance(X, (sparse.csr_matrix, sparse.csc_matrix)) or X.has_canonical_format
         ), f"Found H5AD with non-canonical X matrix in {path}"
 
-        ad = anndata.AnnData(X=X if need_X else None, obs=ad.obs, var=var, raw=None, uns=ad.uns, dtype=np.float32)
+        ad = anndata.AnnData(X=X if need_X else None, obs=ad.obs, var=var, raw=None, uns=ad.uns)
         assert not need_X or ad.X.shape == (len(ad.obs), len(ad.var))
 
         # TODO: In principle, we could look up missing feature_name, but for now, just assert they exist
@@ -154,7 +155,7 @@ def _filter(ad: anndata.AnnData, need_X: Optional[bool] = True) -> anndata.AnnDa
         assert ad.raw is None
 
         # This discards all other ancillary state, eg, obsm/varm/....
-        ad = anndata.AnnData(X=X, obs=obs, var=var, dtype=np.float32)
+        ad = anndata.AnnData(X=X, obs=obs, var=var)
 
         assert (
             X is None or isinstance(X, np.ndarray) or X.has_canonical_format

diff --git a/tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/datasets.py b/tools/cellxgene_census_builder/src/cellxgene_census_builder/build_soma/datasets.py
@@ -6,7 +6,7 @@
 import pyarrow as pa
 import tiledbsoma as soma
 
-from .globals import CENSUS_DATASETS_COLUMNS, CENSUS_DATASETS_NAME
+from .globals import CENSUS_DATASETS_NAME, CENSUS_DATASETS_TABLE_SPEC
 
 T = TypeVar("T", bound="Dataset")
 
@@ -25,6 +25,7 @@ class Dataset:
 
     # Optional - as reported by REST API
     dataset_title: str = ""  # CELLxGENE dataset title
+    citation: str = ""  # CELLxGENE citation
     collection_id: str = ""  # CELLxGENE collection id
     collection_name: str = ""  # CELLxGENE collection name
     collection_doi: str = ""  # CELLxGENE collection doi
@@ -69,14 +70,14 @@ def create_dataset_manifest(info_collection: soma.Collection, datasets: List[Dat
     """
     logging.info("Creating dataset_manifest")
     manifest_df = Dataset.to_dataframe(datasets)
-    manifest_df = manifest_df[CENSUS_DATASETS_COLUMNS + ["soma_joinid"]]
+    manifest_df = manifest_df[list(CENSUS_DATASETS_TABLE_SPEC.field_names())]
     if len(manifest_df) == 0:
         return
 
+    schema = CENSUS_DATASETS_TABLE_SPEC.to_arrow_schema(manifest_df)
+
     # write to a SOMA dataframe
     with info_collection.add_new_dataframe(
-        CENSUS_DATASETS_NAME,
-        schema=pa.Schema.from_pandas(manifest_df, preserve_index=False),
-        index_column_names=["soma_joinid"],
+        CENSUS_DATASETS_NAME, schema=schema, index_column_names=["soma_joinid"]
     ) as manifest:
         manifest.write(pa.Table.from_pandas(manifest_df, preserve_index=False))