brain-bican · patrick-lloyd-ray · Sep 17, 2024
diff --git a/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.csv b/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.csv
@@ -0,0 +1,82 @@
+Proposed BICAN Field Name,BICAN UUID,Aliases,LinkML Class,Definition,Nullable,Permissible Values,Data Type,Data Example,Min Value,Max Value,Unit,Statistical Variable Type,Subsets,Version Date
+cell_id,,,,"Identifier corresponding to each individual cell. Included in the data and in every other location to refer to the data (e.g., metadata and annotations). In AnnData files, the ID corresponding to each individual cell is stored in the obs index.",,,string,,,,,,"data, assigned metadata, calculated metadata, tooling",
+feature_matrix_label,,,,ID of the associated feature matrix where the data is stored (if not included in this file). Used in BKP when data is found elsewhere for connected cell to data file.,,,,,,,,,"assigned metadata, calculated metadata, annotations",
+dataset_label,,,,Link between each cell and each dataset in BKP. Need clarification on how this differs from feature_matrix_label; for CAS this is a taxonomy-level variable in uns called dataset_url.,,,,,,,,,"assigned metadata, calculated metadata, annotations",
+[COLUMN_NAME]_color,,,,"Color vector for metadata/taxonomy values in format [COLUMN_NAME]_label. This is ONLY used for molgen-shiny plots, but because of this, some metadata files come with these and some don't and that could cause challenges. Should revisit how to store colors and how to deal with metadata in both formats. Should also agree on a standard for which way is preferred.",,,,,,,,,"assigned metadata, calculated metadata, annotations",
+[COLUMN_NAME]_id,,,,"Same as above, but in this case for the order of metadata values (e.g., the levels of a factor, or ascending order of a numeric)",,,,,,,,,"assigned metadata, calculated metadata, annotations",
+assay,,,,"In CELLxGENE these correspond to a human-readable modality along with the associated EFO ontology term. We often use the term modality in place of assay (e.g., 'Smart-seq2'corresponds to 'EFO:0008931', '10x 3' v3'corresponds to 'EFO:0009922'). This is called ""library method"" in BKP. Ideally we will agree on a term for this and it will be provided upstream from BICAN. Called Modality in taxonomy Google Sheet.",,,,,,,,,assigned metadata,
+assay_ontology_term_id,,,,The ontology ID of the assay term.,,,,,,,,,assigned metadata,
+suspension_type,,,,"Either ""cell"", ""nucleus"", or ""na"" in CELLxGENE. Called entity in the BKP. We should pick one to use.",,"cell, nucleus, na",value set,,,,,,assigned metadata,
+[batch_condition_columns],,,,"Zero or more vectors of metadata associated with batches. These are not required, but called out separately by cellxgene for analysis purposes.",,,,,,,,,assigned metadata,
+[addtional uncontrolled metadata],,,,"Additional uncontrolled cell metadata. These are not required, but any additional columns are allowed by all h5ad formats.",,,,,,,,,"assigned metadata, calculated metadata",
+brain_region,,,,"Brain region(s) sampled. Called tissue_ontology_term_id in cellxgene; cell_set structures also defined below; called region_of_interest_label and anatomic_division_label in BKP. Also associated are acronymns, labels, etc.; More generally need to arrive at a way of dealing with brain regions. Note that this slot in the Assigned metadata is meant to deal with cell-level assignments for brain region (e.g., dissection) and NOT cell set summarizations by brain region, which are included below.",,,,,,,,,assigned metadata,
+tissue,,,,"Along with ""tissue"" field, these correspond to UBERON terms for the 'brain region' fields that we have (e.g., 'brain' = 'UBERON_0000955'). In process: we need to discuss how to integrate Allen reference brain atlases for mouse and human.",,,,,,,,,assigned metadata,
+tissue_ontology_term_id,,,,"Along with ""tissue"" field, these correspond to UBERON terms for the 'brain region' fields that we have (e.g., 'brain' = 'UBERON_0000955'). In process: we need to discuss how to integrate Allen reference brain atlases for mouse and human.",,,,,,,,,assigned metadata,
+donor_id,,,,"Identifier for the unique individual, ideal from the specimen portal (or other upstream source). This is called donor_label in the BKP. Should converge on a standard term. More than one identifier may be needed, but ideally for the analysis only a single one is retained and stored here.",,,,,,,,,assigned metadata,
+species,,,,"Species sampled. This is split into two fields in CAP/cellxgene/BICAN: organism (e.g., homo sapiens) and  organism_ontology_term_id (e.g., 'NCBITaxon:10090'). For consistency, we should change species to organism and could write a function to automatically identify the ontology term (I think GeneOrthology already has one). Called Species name and Species ID in taxonomy Google Sheet.",,,,,,,,,assigned metadata,
+age,,,,"Currently a free text field for defining the age of the donor. In CELLxGENE this is recorded in development_stage_ontology_term_id and is HsapDv if human, MmusDv if mouse. I'm not sure what this means, but more generally, we should align with BICAN on how to deal with this value.",,,,,,,,,assigned metadata,
+sex,,,,"Placeholder for donor sex. Called sex_ontology_term_id (e.g., PATO:0000384/383 for male/female) in CELLxGENE and called ""donor_sex"" in BKP. We should align on a single term.",,,,,,,,,assigned metadata,
+donor_genotype,,,,"One (or sometimes more) column related to the genotype of the animal (for transgenic mice, in particular). Not used for humans and most NHP.",,,,,,,,,assigned metadata,
+self_reported_ethnicity_ontology_term_id,,,,Controversial field that is required for CELLxGENE but otherwise not used. HANCESTRO term if human and 'na' if non-human.,,,,,,,,,assigned metadata,
+disease,,,,A human-readable name for a disease.,,,,,,,,,assigned metadata,
+disease_ontology_term_id,,,,The associated MONDO ontology term (or PATO:0000461 for 'normal'). Used in CELLxGENE and ideally we can also adopt for SEA-AD and other use cases.,,,,,,,,,assigned metadata,
+cluster,,,,"This is the CRITICAL column used for cluster annotations. It is the baseline for the majority of cell_annotation columns discussed below. Sometimes called cluster_label : [""Annotations""] : There is also an additional cluster_alias column used in mouse whole brain data and for BKP that I'm not sure how to wrap in. It's also used for cirrocumulus. Some discussion might be needed on whether this is one or more columns, and which one is the source of truth. It's also worth noting that this is a prerequisite for annotations, so maybe it better fits in a different category (analysis?).",,,,,,,,,annotations,
+[additional uncontrolled metadata],,,,"Additional uncontrolled cell metadata. These are not required, but any additional columns are allowed by all h5ad formats.",,,,,,,,,annotations,
+[cellannotation_set]--parent_cell_set_accession,,,,"ID corresponding to the parent cell_set. If not needed for annotations, definitely needed for tooling.",,,,,,,,,tooling,
+gene,,,,"A vector of gene symbols. Broadly useful in the community for defining genes but occasionally problematic; called gene_symbol in BKP. More generally, some alignment is needed about whether these or the ensembl_id are used for the gene identifier column (CELLxGENE uses a very specific version of ensembl_id for the INDEX).",,,,,,,,,"data, analysis",
+ensembl_id,,,,A vector of corresponding Ensembl IDs for each gene symbol. This is required for disambiguation of gene symbols; called gene_identifier in BKP. This is optional for AIT (but maybe it sholdn't be?).,,,,,,,,,data,
+biotype,,,,"biotype from the gtf file (e.g., protein_coding); used in BKP and BICAN for filtering of genes (but optional elsewhere)",,,,,,,,,data,
+name,,,,Longer gene name from the gtf file; used in BKP (optional for now),,,,,,,,,data,
+[additional gene info],,,,"Optional uncontrolled gene info; could include gene length, Genecode IDs, NCBI identifiers, etc.",,,,,,,,,data,
+marker_genes_[…],,,,"A set of logical vectors (T/F) indicating which genes are markers used to build dendrogram, or for other purposes. The [...] part of the name links to additional metadata in the uns. This needs to be UPDATED in AIT to allow multiple marker gene sets; markers currently stored differently in CAP.",,,,,,,,,"annotations, analysis",
+highly_variable_genes,,,,A logical vector (T/F) indicating which genes are highly variable. Used for correlation-based mapping in scrattch.mapping.,,,,,,,,,analysis,
+dataset_metadata,,,,"TBD information about the data set itself. A standard on this is not established (as far as I know?) but could include some combination of these pieces of information recorded for Annotations below: description, dataset_url, title, dataset_doi, author_list, author_name, author_contact, orcid, etc.",,,,,,,,,data,
+description,,,,,,,,,,,,,,
+dataset_url,,,,,,,,,,,,,,
+title,,,,,,,,,,,,,,
+dataset_doi,,,,,,,,,,,,,,
+author_list,,,,,,,,,,,,,,
+author_name,,,,,,,,,,,,,,
+author_contact,,,,,,,,,,,,,,
+orcid,,,,,,,,,,,,,,
+dend,,,,"A json formatted dendrogram used for tree mapping. Created by scrattch.taxonomy if not provided. Sometimes used for taxonomy annotation, but we are moving away from it with larger taxonomies and so this may now make more sense in the ""analysis"" category.",,,,,,,,,"annotations, analysis",
+labelsets,,,,"CRITICAL extra component; Equilalent to Cluster annotation term set in BKP. This is saved as a data frame representation (or is a list of data frames needed?), with some information about each [cellannotation_set] set of columns (e.g., subclass, class, neurotransmitter, etc.). Specifically: ""name"", ""description"", and ""rank"" (0 most specific) and some information about provenance are needed for each labelset.",,,,,,,,,annotations,
+cell_set_relationships,,,,"NEW proposed mechanism for dealing with sibling relationships for things like gradients, trajectories, constellation diagrams, etc.. This is stored as a data frame (table) of all relations with five columns: cells_set_accession1, cell_set_accession2, relation_label, value, direction. Could alternatively be stored as a JSON representation that unpacks into a dataframe.",,,,,,,,,annotations,
+filter,,,,"Indicator of which cells to use for a given child taxonomy (subset), saved as a list of vectors. Each entree in this list is named for the relevant ""mode"" and has TRUE/FALSE calls indicating whether a cell is filtered out (e.g., the ""standard"" taxonony is all FALSE). This is critical for how child taxonomies are defined and implemented in scrattch.taxonomy but differs from how taxonomies are stored in all other schemas--some discussion may be needed.",,,,,,,,,annotations,
+title,,,,"Taxonomy name (e.g., ""AIT30""); called title in cellxgene, not sure about other schema. Called Taxonomy short name in taxonomy Google Sheet.",,,,,,,,,annotations,
+taxonomy_id,,,,"Taxonomy ID in CCN format (e.g., ""CCN030420240""); TBD how this is generated, but MUST be globally unique. Also used as part of PURL (I think). Called Taxonomy ID in taxonomy Google Sheet too.",,,,,,,,,annotations,
+description,,,,"Free text description of the taxonomy (or of the dataset on CAP). This is also something we are adding as a requirement for the BKP, and I think should be required for all taxonomies. Called Description in taxonomy Google Sheet.",,,,,,,,,annotations,
+taxonomy_citation,,,,"""|""-separated publication DOI's of the taxonomy (e.g., ""doi:10.1038/s41586-018-0654-5""). Called Publication in taxonomy Google Sheet.",,,,,,,,,annotations,
+marker_gene_metadata,,,,"Data frame of Marker genes x dims that includes metadata for marker gene sets in var above; NEW and required if marker_genes_[…] is provided. At minimum a name (matching above) and description are needed, but potentially other things (e.g., what is it for, with controlled vocabulary).",,,,,,,,,annotations,
+transferred_annotations_metadata,,,,"Data frame of info about each transferred annotation column: source_taxonomy, algorithm_name, comment; Still some work on the best way to code this, but it is important. Linked to data in var above. This is for taxonomy-level metadata. This is also already encoded in TDT--how?",,,,,,,,,annotations,
+taxonomyDir,,,,"Location of the h5ad file; we might be able to remove this, since it is redundant with dataset_url and/or matrix_file_id. Called Taxonomy file location in taxonomy Google Sheet.",,,,,,,,,annotations,
+dataset_url,,,,"PURL of taxonomy; Possibly a redundant field, but critical; also publication_url and cellannotation_url (unclear how different)",,,,,,,,,"annotations, tooling",
+matrix_file_id,,,,Like dataset_url; e.g. CellXGene_dataset:8e10f1c4-8e98-41e5-b65f-8cd89a887122; Note: needs to be extended to allow for more than one file and connected to feature_matrix_label in obs. We need this field!,,,,,,,,,"annotations, tooling",
+author_list,,,,"List of all collaborators, comma separated [First] [Last]; Useful in general, even though currently only required by CAP. Called Taxonomy Users in taxonomy Google Sheet.",,,,,,,,,"annotations, tooling",
+author_name,,,,"The primary author [First Name] [Last Name]; in CCN was called ""taxonomy_author""; In CCN also seperated by cell_set with ""cell_set_alias_assignee""; Called Point person name in taxonomy Google Sheet.",,,,,,,,,annotations,
+author_contact,,,,Valid email address; Called Point person email in taxonomy Google Sheet.,,,,,,,,,annotations,
+orcid,,,,Valid ORCID; Called Point person ORCID in taxonomy Google Sheet.,,,,,,,,,annotations,
+annotation_source,,,,Additional metadata about annotation algorithm; Similar to taxonomy algorithm info stored for CCN,,,,,,,,,annotations,
+assigned_metadata_metadata,,,,TBD information about the assigned_metadata itself. This likely is not needed or should be renamed.,,,,,,,,,assigned metadata,
+batch_condition,,,,"List of obs fields that define “batches”; Used by CELLxGENE if provided, but otherwise not needed.",,,,,,,,,assigned metadata,
+calculated_metadata_metadata,,,,TBD information about the calculated_metadata itself. This likely is not needed or should be renamed.,,,,,,,,,calculated metadata,
+cell_annotation_schema,,,,extended metadata about annotations and labelsets stores in JSON.,,,,,,,,,calculated metadata,
+QC_markers,,,,"Marker gene expression in on-target and off-target cell populations, useful for patchseq analysis. Also includes information about KL divergence calculations and associated QC calls. Defined by buildPatchseqTaxonomy.",,,,,,,,,analysis,
+filter,,,,"Indicator of which cells to use for a given child taxonomy (subset), as defined above.",,,,,,,,,analysis,
+mode,,,,"Taxonomy mode that determines which filter to use (e.g., that indicates which child taxonomy to map against). Several of the other analysis components of the uns have things saved with mode as the name in the h5ad file. See scrattch.mapping documentation. Mode is the Taxonomy short name in taxonomy Google Sheet for a child taxonomy with the Parent taxonomy listed as the taxonomyName.",,,,,,,,,analysis,
+clustersUse,,,,A vector of cluster names to use for taxonomy. We should be able to remove this,,,,,,,,,analysis,
+clusterInfo,,,,A data.frame of cluster information. We should be able to remove this,,,,,,,,,analysis,
+marker_gene_metadata,,,,"Metadata about any new marker gene lists added, if any. See above.",,,,,,,,,analysis,
+development_date,,,,Data of taxonomy development. Required for Google Sheet. Potentially not needed if we want to infer from taxonomy_id.,,,,,,,,,analysis,
+public,,,,logical flag indicating whether taxonomy should be public or private. Required for Google Sheet. Potentially not needed if we want to infer from PURL/GitHub somehow.,,,,,,,,,analysis,
+annotation_sheet,,,,Link to annotation sheet (ideally a TDT GitHub repo link for communinal annotation). An optional slot in the Google sheet. I'm not sure if this is listed above somewhere.,,,,,,,,,analysis,
+purpose,,,,"Controlled vocabulary (currently ""General"" and/or ""Patch-seq""). Required for Google Sheet at the moment.",,,,,,,,,analysis,
+schema_version,,,,"cellxgene schema version (e.g., ""3.0.0"")",,,,,,,,,tooling,
+[...]_color,,,, RGB color vector for metadata [...]; required only for selecting colors in cirrocumulus. This may be the same as the [COLUMN_NAME]_color column above.,,,,,,,,,tooling,
+cellannotation_schema_version,,,,CAS schema version '[MAJOR].[MINOR].[PATCH]',,,,,,,,,tooling,
+cellannotation_timestamp,,,,"Timestamp when published: %yyyy-%mm-%dd %hh:%mm:%ss; Useful in general, even though currently only required by CAP; also publication_XXXX (unclear how different); This also could be the same as development_date above.",,,,,,,,,tooling,
+cellannotation_version,,,,CAP taxonomy annotation version; required by CAP; also publication_XXXX (unclear how different). I'm also not sure how this differs from the cellannotation_schema_version.,,,,,,,,,tooling,
+[additional information],,,,"Placeholder for several other (seemingly redundant) fields required by external tools (e.g., CAP, cellxgene) that I want to capture here. It may or may not make sense to spell them all out.",,,,,,,,,tooling,
+umap,,,,"2 (or more)-dimensional representation of cells in AIT. Must be of the form X_[...] for use with CELLxGENE. Only the first two dimensions are used for AIT and CELLxGENE, but 3 dimensions can be used for cirrocumulus.",,,,,,,,,analysis,
+pca,,,,Additional terms for embedding multi-dimensional principal components and latent spaces,,,,,,,,,analysis,
+scVI,,,,Additional terms for embedding multi-dimensional principal components and latent spaces,,,,,,,,,analysis,
diff --git a/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.xlsx b/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.xlsx