diff --git a/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.csv b/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.csv new file mode 100644 index 0000000..32d06c4 --- /dev/null +++ b/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.csv @@ -0,0 +1,82 @@ +Proposed BICAN Field Name,BICAN UUID,Aliases,LinkML Class,Definition,Nullable,Permissible Values,Data Type,Data Example,Min Value,Max Value,Unit,Statistical Variable Type,Subsets,Version Date +cell_id,,,,"Identifier corresponding to each individual cell. Included in the data and in every other location to refer to the data (e.g., metadata and annotations). In AnnData files, the ID corresponding to each individual cell is stored in the obs index.",,,string,,,,,,"data, assigned metadata, calculated metadata, tooling", +feature_matrix_label,,,,ID of the associated feature matrix where the data is stored (if not included in this file). Used in BKP when data is found elsewhere for connected cell to data file.,,,,,,,,,"assigned metadata, calculated metadata, annotations", +dataset_label,,,,Link between each cell and each dataset in BKP. Need clarification on how this differs from feature_matrix_label; for CAS this is a taxonomy-level variable in uns called dataset_url.,,,,,,,,,"assigned metadata, calculated metadata, annotations", +[COLUMN_NAME]_color,,,,"Color vector for metadata/taxonomy values in format [COLUMN_NAME]_label. This is ONLY used for molgen-shiny plots, but because of this, some metadata files come with these and some don't and that could cause challenges. Should revisit how to store colors and how to deal with metadata in both formats. Should also agree on a standard for which way is preferred.",,,,,,,,,"assigned metadata, calculated metadata, annotations", +[COLUMN_NAME]_id,,,,"Same as above, but in this case for the order of metadata values (e.g., the levels of a factor, or ascending order of a numeric)",,,,,,,,,"assigned metadata, calculated metadata, annotations", +assay,,,,"In CELLxGENE these correspond to a human-readable modality along with the associated EFO ontology term. We often use the term modality in place of assay (e.g., 'Smart-seq2'corresponds to 'EFO:0008931', '10x 3' v3'corresponds to 'EFO:0009922'). This is called ""library method"" in BKP. Ideally we will agree on a term for this and it will be provided upstream from BICAN. Called Modality in taxonomy Google Sheet.",,,,,,,,,assigned metadata, +assay_ontology_term_id,,,,The ontology ID of the assay term.,,,,,,,,,assigned metadata, +suspension_type,,,,"Either ""cell"", ""nucleus"", or ""na"" in CELLxGENE. Called entity in the BKP. We should pick one to use.",,"cell, nucleus, na",value set,,,,,,assigned metadata, +[batch_condition_columns],,,,"Zero or more vectors of metadata associated with batches. These are not required, but called out separately by cellxgene for analysis purposes.",,,,,,,,,assigned metadata, +[addtional uncontrolled metadata],,,,"Additional uncontrolled cell metadata. These are not required, but any additional columns are allowed by all h5ad formats.",,,,,,,,,"assigned metadata, calculated metadata", +brain_region,,,,"Brain region(s) sampled. Called tissue_ontology_term_id in cellxgene; cell_set structures also defined below; called region_of_interest_label and anatomic_division_label in BKP. Also associated are acronymns, labels, etc.; More generally need to arrive at a way of dealing with brain regions. Note that this slot in the Assigned metadata is meant to deal with cell-level assignments for brain region (e.g., dissection) and NOT cell set summarizations by brain region, which are included below.",,,,,,,,,assigned metadata, +tissue,,,,"Along with ""tissue"" field, these correspond to UBERON terms for the 'brain region' fields that we have (e.g., 'brain' = 'UBERON_0000955'). In process: we need to discuss how to integrate Allen reference brain atlases for mouse and human.",,,,,,,,,assigned metadata, +tissue_ontology_term_id,,,,"Along with ""tissue"" field, these correspond to UBERON terms for the 'brain region' fields that we have (e.g., 'brain' = 'UBERON_0000955'). In process: we need to discuss how to integrate Allen reference brain atlases for mouse and human.",,,,,,,,,assigned metadata, +donor_id,,,,"Identifier for the unique individual, ideal from the specimen portal (or other upstream source). This is called donor_label in the BKP. Should converge on a standard term. More than one identifier may be needed, but ideally for the analysis only a single one is retained and stored here.",,,,,,,,,assigned metadata, +species,,,,"Species sampled. This is split into two fields in CAP/cellxgene/BICAN: organism (e.g., homo sapiens) and organism_ontology_term_id (e.g., 'NCBITaxon:10090'). For consistency, we should change species to organism and could write a function to automatically identify the ontology term (I think GeneOrthology already has one). Called Species name and Species ID in taxonomy Google Sheet.",,,,,,,,,assigned metadata, +age,,,,"Currently a free text field for defining the age of the donor. In CELLxGENE this is recorded in development_stage_ontology_term_id and is HsapDv if human, MmusDv if mouse. I'm not sure what this means, but more generally, we should align with BICAN on how to deal with this value.",,,,,,,,,assigned metadata, +sex,,,,"Placeholder for donor sex. Called sex_ontology_term_id (e.g., PATO:0000384/383 for male/female) in CELLxGENE and called ""donor_sex"" in BKP. We should align on a single term.",,,,,,,,,assigned metadata, +donor_genotype,,,,"One (or sometimes more) column related to the genotype of the animal (for transgenic mice, in particular). Not used for humans and most NHP.",,,,,,,,,assigned metadata, +self_reported_ethnicity_ontology_term_id,,,,Controversial field that is required for CELLxGENE but otherwise not used. HANCESTRO term if human and 'na' if non-human.,,,,,,,,,assigned metadata, +disease,,,,A human-readable name for a disease.,,,,,,,,,assigned metadata, +disease_ontology_term_id,,,,The associated MONDO ontology term (or PATO:0000461 for 'normal'). Used in CELLxGENE and ideally we can also adopt for SEA-AD and other use cases.,,,,,,,,,assigned metadata, +cluster,,,,"This is the CRITICAL column used for cluster annotations. It is the baseline for the majority of cell_annotation columns discussed below. Sometimes called cluster_label : [""Annotations""] : There is also an additional cluster_alias column used in mouse whole brain data and for BKP that I'm not sure how to wrap in. It's also used for cirrocumulus. Some discussion might be needed on whether this is one or more columns, and which one is the source of truth. It's also worth noting that this is a prerequisite for annotations, so maybe it better fits in a different category (analysis?).",,,,,,,,,annotations, +[additional uncontrolled metadata],,,,"Additional uncontrolled cell metadata. These are not required, but any additional columns are allowed by all h5ad formats.",,,,,,,,,annotations, +[cellannotation_set]--parent_cell_set_accession,,,,"ID corresponding to the parent cell_set. If not needed for annotations, definitely needed for tooling.",,,,,,,,,tooling, +gene,,,,"A vector of gene symbols. Broadly useful in the community for defining genes but occasionally problematic; called gene_symbol in BKP. More generally, some alignment is needed about whether these or the ensembl_id are used for the gene identifier column (CELLxGENE uses a very specific version of ensembl_id for the INDEX).",,,,,,,,,"data, analysis", +ensembl_id,,,,A vector of corresponding Ensembl IDs for each gene symbol. This is required for disambiguation of gene symbols; called gene_identifier in BKP. This is optional for AIT (but maybe it sholdn't be?).,,,,,,,,,data, +biotype,,,,"biotype from the gtf file (e.g., protein_coding); used in BKP and BICAN for filtering of genes (but optional elsewhere)",,,,,,,,,data, +name,,,,Longer gene name from the gtf file; used in BKP (optional for now),,,,,,,,,data, +[additional gene info],,,,"Optional uncontrolled gene info; could include gene length, Genecode IDs, NCBI identifiers, etc.",,,,,,,,,data, +marker_genes_[…],,,,"A set of logical vectors (T/F) indicating which genes are markers used to build dendrogram, or for other purposes. The [...] part of the name links to additional metadata in the uns. This needs to be UPDATED in AIT to allow multiple marker gene sets; markers currently stored differently in CAP.",,,,,,,,,"annotations, analysis", +highly_variable_genes,,,,A logical vector (T/F) indicating which genes are highly variable. Used for correlation-based mapping in scrattch.mapping.,,,,,,,,,analysis, +dataset_metadata,,,,"TBD information about the data set itself. A standard on this is not established (as far as I know?) but could include some combination of these pieces of information recorded for Annotations below: description, dataset_url, title, dataset_doi, author_list, author_name, author_contact, orcid, etc.",,,,,,,,,data, +description,,,,,,,,,,,,,, +dataset_url,,,,,,,,,,,,,, +title,,,,,,,,,,,,,, +dataset_doi,,,,,,,,,,,,,, +author_list,,,,,,,,,,,,,, +author_name,,,,,,,,,,,,,, +author_contact,,,,,,,,,,,,,, +orcid,,,,,,,,,,,,,, +dend,,,,"A json formatted dendrogram used for tree mapping. Created by scrattch.taxonomy if not provided. Sometimes used for taxonomy annotation, but we are moving away from it with larger taxonomies and so this may now make more sense in the ""analysis"" category.",,,,,,,,,"annotations, analysis", +labelsets,,,,"CRITICAL extra component; Equilalent to Cluster annotation term set in BKP. This is saved as a data frame representation (or is a list of data frames needed?), with some information about each [cellannotation_set] set of columns (e.g., subclass, class, neurotransmitter, etc.). Specifically: ""name"", ""description"", and ""rank"" (0 most specific) and some information about provenance are needed for each labelset.",,,,,,,,,annotations, +cell_set_relationships,,,,"NEW proposed mechanism for dealing with sibling relationships for things like gradients, trajectories, constellation diagrams, etc.. This is stored as a data frame (table) of all relations with five columns: cells_set_accession1, cell_set_accession2, relation_label, value, direction. Could alternatively be stored as a JSON representation that unpacks into a dataframe.",,,,,,,,,annotations, +filter,,,,"Indicator of which cells to use for a given child taxonomy (subset), saved as a list of vectors. Each entree in this list is named for the relevant ""mode"" and has TRUE/FALSE calls indicating whether a cell is filtered out (e.g., the ""standard"" taxonony is all FALSE). This is critical for how child taxonomies are defined and implemented in scrattch.taxonomy but differs from how taxonomies are stored in all other schemas--some discussion may be needed.",,,,,,,,,annotations, +title,,,,"Taxonomy name (e.g., ""AIT30""); called title in cellxgene, not sure about other schema. Called Taxonomy short name in taxonomy Google Sheet.",,,,,,,,,annotations, +taxonomy_id,,,,"Taxonomy ID in CCN format (e.g., ""CCN030420240""); TBD how this is generated, but MUST be globally unique. Also used as part of PURL (I think). Called Taxonomy ID in taxonomy Google Sheet too.",,,,,,,,,annotations, +description,,,,"Free text description of the taxonomy (or of the dataset on CAP). This is also something we are adding as a requirement for the BKP, and I think should be required for all taxonomies. Called Description in taxonomy Google Sheet.",,,,,,,,,annotations, +taxonomy_citation,,,,"""|""-separated publication DOI's of the taxonomy (e.g., ""doi:10.1038/s41586-018-0654-5""). Called Publication in taxonomy Google Sheet.",,,,,,,,,annotations, +marker_gene_metadata,,,,"Data frame of Marker genes x dims that includes metadata for marker gene sets in var above; NEW and required if marker_genes_[…] is provided. At minimum a name (matching above) and description are needed, but potentially other things (e.g., what is it for, with controlled vocabulary).",,,,,,,,,annotations, +transferred_annotations_metadata,,,,"Data frame of info about each transferred annotation column: source_taxonomy, algorithm_name, comment; Still some work on the best way to code this, but it is important. Linked to data in var above. This is for taxonomy-level metadata. This is also already encoded in TDT--how?",,,,,,,,,annotations, +taxonomyDir,,,,"Location of the h5ad file; we might be able to remove this, since it is redundant with dataset_url and/or matrix_file_id. Called Taxonomy file location in taxonomy Google Sheet.",,,,,,,,,annotations, +dataset_url,,,,"PURL of taxonomy; Possibly a redundant field, but critical; also publication_url and cellannotation_url (unclear how different)",,,,,,,,,"annotations, tooling", +matrix_file_id,,,,Like dataset_url; e.g. CellXGene_dataset:8e10f1c4-8e98-41e5-b65f-8cd89a887122; Note: needs to be extended to allow for more than one file and connected to feature_matrix_label in obs. We need this field!,,,,,,,,,"annotations, tooling", +author_list,,,,"List of all collaborators, comma separated [First] [Last]; Useful in general, even though currently only required by CAP. Called Taxonomy Users in taxonomy Google Sheet.",,,,,,,,,"annotations, tooling", +author_name,,,,"The primary author [First Name] [Last Name]; in CCN was called ""taxonomy_author""; In CCN also seperated by cell_set with ""cell_set_alias_assignee""; Called Point person name in taxonomy Google Sheet.",,,,,,,,,annotations, +author_contact,,,,Valid email address; Called Point person email in taxonomy Google Sheet.,,,,,,,,,annotations, +orcid,,,,Valid ORCID; Called Point person ORCID in taxonomy Google Sheet.,,,,,,,,,annotations, +annotation_source,,,,Additional metadata about annotation algorithm; Similar to taxonomy algorithm info stored for CCN,,,,,,,,,annotations, +assigned_metadata_metadata,,,,TBD information about the assigned_metadata itself. This likely is not needed or should be renamed.,,,,,,,,,assigned metadata, +batch_condition,,,,"List of obs fields that define “batches”; Used by CELLxGENE if provided, but otherwise not needed.",,,,,,,,,assigned metadata, +calculated_metadata_metadata,,,,TBD information about the calculated_metadata itself. This likely is not needed or should be renamed.,,,,,,,,,calculated metadata, +cell_annotation_schema,,,,extended metadata about annotations and labelsets stores in JSON.,,,,,,,,,calculated metadata, +QC_markers,,,,"Marker gene expression in on-target and off-target cell populations, useful for patchseq analysis. Also includes information about KL divergence calculations and associated QC calls. Defined by buildPatchseqTaxonomy.",,,,,,,,,analysis, +filter,,,,"Indicator of which cells to use for a given child taxonomy (subset), as defined above.",,,,,,,,,analysis, +mode,,,,"Taxonomy mode that determines which filter to use (e.g., that indicates which child taxonomy to map against). Several of the other analysis components of the uns have things saved with mode as the name in the h5ad file. See scrattch.mapping documentation. Mode is the Taxonomy short name in taxonomy Google Sheet for a child taxonomy with the Parent taxonomy listed as the taxonomyName.",,,,,,,,,analysis, +clustersUse,,,,A vector of cluster names to use for taxonomy. We should be able to remove this,,,,,,,,,analysis, +clusterInfo,,,,A data.frame of cluster information. We should be able to remove this,,,,,,,,,analysis, +marker_gene_metadata,,,,"Metadata about any new marker gene lists added, if any. See above.",,,,,,,,,analysis, +development_date,,,,Data of taxonomy development. Required for Google Sheet. Potentially not needed if we want to infer from taxonomy_id.,,,,,,,,,analysis, +public,,,,logical flag indicating whether taxonomy should be public or private. Required for Google Sheet. Potentially not needed if we want to infer from PURL/GitHub somehow.,,,,,,,,,analysis, +annotation_sheet,,,,Link to annotation sheet (ideally a TDT GitHub repo link for communinal annotation). An optional slot in the Google sheet. I'm not sure if this is listed above somewhere.,,,,,,,,,analysis, +purpose,,,,"Controlled vocabulary (currently ""General"" and/or ""Patch-seq""). Required for Google Sheet at the moment.",,,,,,,,,analysis, +schema_version,,,,"cellxgene schema version (e.g., ""3.0.0"")",,,,,,,,,tooling, +[...]_color,,,, RGB color vector for metadata [...]; required only for selecting colors in cirrocumulus. This may be the same as the [COLUMN_NAME]_color column above.,,,,,,,,,tooling, +cellannotation_schema_version,,,,CAS schema version '[MAJOR].[MINOR].[PATCH]',,,,,,,,,tooling, +cellannotation_timestamp,,,,"Timestamp when published: %yyyy-%mm-%dd %hh:%mm:%ss; Useful in general, even though currently only required by CAP; also publication_XXXX (unclear how different); This also could be the same as development_date above.",,,,,,,,,tooling, +cellannotation_version,,,,CAP taxonomy annotation version; required by CAP; also publication_XXXX (unclear how different). I'm also not sure how this differs from the cellannotation_schema_version.,,,,,,,,,tooling, +[additional information],,,,"Placeholder for several other (seemingly redundant) fields required by external tools (e.g., CAP, cellxgene) that I want to capture here. It may or may not make sense to spell them all out.",,,,,,,,,tooling, +umap,,,,"2 (or more)-dimensional representation of cells in AIT. Must be of the form X_[...] for use with CELLxGENE. Only the first two dimensions are used for AIT and CELLxGENE, but 3 dimensions can be used for cirrocumulus.",,,,,,,,,analysis, +pca,,,,Additional terms for embedding multi-dimensional principal components and latent spaces,,,,,,,,,analysis, +scVI,,,,Additional terms for embedding multi-dimensional principal components and latent spaces,,,,,,,,,analysis, \ No newline at end of file diff --git a/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.xlsx b/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.xlsx new file mode 100644 index 0000000..8a0a1c9 Binary files /dev/null and b/docs/schemas/Cell-Annotation-Schema/Aligned_taxonomy_schema.xlsx differ diff --git a/docs/schemas/Cell-Annotation-Schema/Annotations.csv b/docs/schemas/Cell-Annotation-Schema/Annotations.csv new file mode 100644 index 0000000..34a1736 --- /dev/null +++ b/docs/schemas/Cell-Annotation-Schema/Annotations.csv @@ -0,0 +1,30 @@ +Proposed BICAN Field Name,BICAN UUID,Aliases,LinkML Class,Definition,Nullable,Permissible Values,Data Type,Data Example,Min Value,Max Value,Unit,Statistical Variable Type,Subsets,Version Date +labelset,,,,The unique name of the set of cell annotations. Each cell within the AnnData/Seurat file MUST be associated with a 'cell_label' value in order for this to be a valid 'cellannotation_setname'.,FALSE,,string,,,,,,, +cell_label,,,,"This denotes any free-text term which the author uses to annotate cells, i.e. the preferred cell label name used by the author. Abbreviations are exceptable in this field; refer to 'cell_fullname' for related details. Certain key words have been reserved:- 'doublets' is reserved for encoding cells defined as doublets based on some computational analysis- 'junk' is reserved for encoding cells that failed sequencing for some reason, e.g. few genes detected, high fraction of mitochondrial reads- 'unknown' is explicitly reserved for unknown or 'author does not know'- 'NA' is incomplete, i.e. no cell annotation was provided.",FALSE,,string,,,,,,, +cell_fullname,,,,"This MUST be the full-length name for the biological entity listed in cell_label by the author. (If the value in cell_label is the full-length term, this field will contain the same value.) NOTE: any reserved word used in the field 'cell_label' MUST match the value of this field.",TRUE,,string,"EXAMPLE 1: Given the matching terms 'LC' and 'luminal cell' used to annotate the same cell(s), then users could use either terms as values in the field 'cell_label'. However, the abbreviation 'LC' CANNOT be provided in the field 'cell_fullname'. EXAMPLE 2: Either the abbreviation 'AC' or the full-length term intended by the author 'GABAergic amacrine cell' MAY be placed in the field 'cell_label', but as full-length term naming this biological entity, 'GABAergic amacrine cell' MUST be placed in the field 'cell_fullname'.",,,,,, +cell_ontology_term_id,,,,"This MUST be a term from either the Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/cl) or from some ontology that extends it by classifying cell types under terms from the Cell Ontologye.g. the Provisional Cell Ontology (https://www.ebi.ac.uk/ols/ontologies/pcl) or the Drosophila Anatomy Ontology (DAO) (https://www.ebi.ac.uk/ols4/ontologies/fbbt). NOTE: The closest available ontology term matching the value within the field 'cell_label' (at the time of publication) MUST be used.For example, if the value of 'cell_label' is 'relay interneuron', but this entity does not yet exist in the ontology, users must choose the closest available term in the CL ontology. In this case, it's the broader term 'interneuron' i.e. https://www.ebi.ac.uk/ols/ontologies/cl/terms?obo_id=CL:0000099.",TRUE,,string,,,,,,, +cell_ontology_term,,,,This MUST be the human-readable name assigned to the value of 'cell_ontology_term_id'.,TRUE,,string,,,,,,, +cell_ids,,,,List of cell barcode sequences/UUIDs used to uniquely identify the cells within the AnnData/Seurat matrix. Any and all cell barcode sequences/UUIDs MUST be included in the AnnData/Seurat matrix.,TRUE,,list,,,,,,, +rationale,,,,"The free-text rationale which users provide as justification/evidence for their cell annotations. Researchers are encouraged to use this field to cite relevant publications in-line using standard academic citations of the form (Zheng et al., 2020) This human-readable free-text MUST be encoded as a single string.All references cited SHOULD be listed using DOIs under rationale_dois. There MUST be a 2000-character limit.",TRUE,,string,,,,,,, +rationale_dois,,,,A list of valid publication DOIs cited by the author to support or provide justification/evidence/context for 'cell_label'.,TRUE,,list,,,,,,, +marker_gene_evidence,,,,List of names of genes whose expression in the cells being annotated is explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file.,TRUE,,list,,,,,,, +synonyms,,,,"This field denotes any free-text term of a biological entity which the author associates as synonymous with the biological entity listed in the field 'cell_label'.In the case whereby no synonyms exist, the authors MAY leave this as blank, which is encoded as 'NA'. However, this field is NOT OPTIONAL.",TRUE,,list,,,,,,, +reviews,,,,,,,list,,,,,,, +datestamp,,,,Time and date review was last edited.,FALSE,,"string, format: date-time",,,,,,, +reviewer,,,,Review author.,,,string,,,,,,, +review,,,,"Reviewer's verdict on the annotation. Must be 'Agree' or 'Disagree'. Must be one of: [""Agree"", ""Disagree""].",,"agree, disagree",value set,,,,,,, +explanation,,,,Free-text review of annotation. This is required if the verdict is disagree and should include reasons for disagreement.,TRUE,,string,,,,,,, +author_annotation_fields,,,,A dictionary of author defined key value pairs annotating the cell set. The names and aims of these fields MUST not clash with official annotation fields.,,,object,,,,,,, +cell_set_accession,,,,"An identifier that can be used to consistently refer to the set of cells being annotated, even if the cell_label changes.",,,string,,,,,,, +parent_cell_set_accession,,,,"A list of accessions of cell sets that subsume this cell set. This can be used to compose hierarchies of annotated cell sets, built from a fixed set of clusters.",,,string,,,,,,, +transferred_annotations,,,,,,,list,,,,,,, +transferred_cell_label,,,,Transferred cell label.,,,string,,,,,,, +source_taxonomy,,,,PURL of source taxonomy.,,,string,,,,,,, +source_node_accession,,,,accession of node that label was transferred from.,,,string,,,,,,, +algorithm_name,,,,,,,string,,,,,,, +comment,,,,Free text comment on annotation transfer.,,,string,,,,,,, +cells,,,,By convention this is only used for annotation transfer labelsets. It MUST not be combined with the 'cell_ids' field.,,,list,,,,,,, +cell_id,,,,Identifier for a single cell.,FALSE,,string,,,,,,, +confidence,,,,Normalized confidence score.,,,number,,,,,,, +author_categories,,,,,,,list,,,,,,, +negative_marker_gene_evidence,,,,"List of names of genes, the absence of expression of which is explicitly used as evidence for this cell annotation. Each gene MUST be included in the matrix of the AnnData/Seurat file.",,,list,,,,,,, \ No newline at end of file diff --git a/docs/schemas/Cell-Annotation-Schema/Cell-Annotation-Schema.csv b/docs/schemas/Cell-Annotation-Schema/Cell-Annotation-Schema.csv new file mode 100644 index 0000000..21c771d --- /dev/null +++ b/docs/schemas/Cell-Annotation-Schema/Cell-Annotation-Schema.csv @@ -0,0 +1,14 @@ +Proposed BICAN Field Name,BICAN UUID,Aliases,LinkML Class,Definition,Nullable,Permissible Values,Data Type,Data Example,Min Value,Max Value,Unit,Statistical Variable Type,Subsets,Version Date +matrix_file_id,,,,"A resolvable ID for a cell by gene matrix file in the form namespace:accession, e.g. CellXGene_dataset:8e10f1c4-8e98-41e5-b65f-8cd89a887122. Please see https://github.com/cellannotation/cell-annotation-schema/registry/registry.json for supported namespaces.",,,string,CellXGene_dataset:8e10f1c4-8e98-41e5-b65f-8cd89a887122,,,,,, +title,,,,The title of the dataset. This MUST be less than or equal to 200 characters. e.g. 'Human retina cell atlas - retinal ganglion cells'.,FALSE,,string,'Human retina cell atlas - retinal ganglion cells',,,,,, +description,,,,"The description of the dataset. e.g. 'A total of 15 retinal ganglion cell clusters were identified from over 99K retinal ganglion cell nuclei in the current atlas. Utilizing previous characterized markers from macaque, 5 clusters can be annotated.'.",,,string,"'A total of 15 retinal ganglion cell clusters were identified from over 99K retinal ganglion cell nuclei in the current atlas. Utilizing previous characterized markers from macaque, 5 clusters can be annotated.'",,,,,, +cellannotation_schema_version,,,,"The schema version, the cell annotation open standard. Current version MUST follow 0.1.0This versioning MUST follow the format '[MAJOR].[MINOR].[PATCH]' as defined by Semantic Versioning 2.0.0, https://semver.org/.",,,string,1.2.1,,,,,, +cellannotation_timestamp,,,,The timestamp of all cell annotations published (per dataset). This MUST be a string in the format '%yyyy-%mm-%dd %hh:%mm:%ss'.,,,date-time,,,,,,, +cellannotation_version,,,,"The version for all cell annotations published (per dataset). This MUST be a string. The recommended versioning format is '[MAJOR].[MINOR].[PATCH]' as defined by Semantic Versioning 2.0.0, https://semver.org/.",,,string,1.2.1,,,,,, +cellannotation_url,,,,A persistent URL of all cell annotations published (per dataset).,,,string,,,,,,, +author_list,,,,"This field stores a list of users who are included in the project as collaborators, regardless of their specific role. An example list; '['John Smith', 'Cody Miller', 'Sarah Jones']'.",,,string," '['John Smith', 'Cody Miller', 'Sarah Jones']'",,,,,, +author_name,,,,Primary author's name. This MUST be a string in the format [FIRST NAME] [LAST NAME].,FALSE,,string,,,,,,, +author_contact,,,,Primary author's contact. This MUST be a valid email address of the author.,,,"string, format: email",,,,,,, +orcid,,,,Primary author's orcid. This MUST be a valid ORCID for the author.,,,string,,,,,,, +labelsets,,,,see labelsets table,FALSE,,array,,,,,,, +annotations,,,,see annotations table,FALSE,,array,,,,,,, \ No newline at end of file diff --git a/docs/schemas/Cell-Annotation-Schema/Cell-Annotation-Schema.xls b/docs/schemas/Cell-Annotation-Schema/Cell-Annotation-Schema.xls new file mode 100644 index 0000000..0b129ee Binary files /dev/null and b/docs/schemas/Cell-Annotation-Schema/Cell-Annotation-Schema.xls differ diff --git a/docs/schemas/Cell-Annotation-Schema/Labelsets.csv b/docs/schemas/Cell-Annotation-Schema/Labelsets.csv new file mode 100644 index 0000000..852cc9f --- /dev/null +++ b/docs/schemas/Cell-Annotation-Schema/Labelsets.csv @@ -0,0 +1,10 @@ +Proposed BICAN Field Name,BICAN UUID,Aliases,LinkML Class,Definition,Nullable,Permissible Values,Data Type,Data Example,Min Value,Max Value,Unit,Statistical Variable Type,Subsets,Version Date +name,,,,name of annotation key.,FALSE,,string,,,,,,, +description,,,,Some text describing what types of cell annotation this annotation key is used to record.,TRUE,,string,,,,,,, +annotation_method,,,,"The method used for creating the cell annotations. This MUST be one of the following strings: 'algorithmic', 'manual', or 'both' . Must be one of: [""algorithmic"", ""manual"", ""both""].",TRUE,"algorithmic, manual, both",value set,,,,,,, +automated_annotation,,,,,TRUE,,object,,,,,,, +algorithm_name,,,,The name of the algorithm used. It MUST be a string of the algorithm's name.,FALSE,,string,,,,,,, +algorithm_version,,,,"The version of the algorithm used (if applicable). It MUST be a string of the algorithm's version, which is typically in the format '[MAJOR].[MINOR]', but other versioning systems are permitted (based on the algorithm's versioning).",FALSE,,string,,,,,,, +algorithm_repo_url,,,,This field denotes the URL of the version control repository associated with the algorithm used (if applicable). It MUST be a string of a valid URL.,FALSE,,string,,,,,,, +reference_location,,,,"This field denotes a valid URL of the annotated dataset that was the source of annotated reference data. This MUST be a string of a valid URL. The concept of a 'reference' specifically refers to 'annotation transfer' algorithms, whereby a 'reference' dataset is used to transfer cell annotations to the 'query' dataset.",TRUE,,string,,,,,,, +rank,,,,A number indicating relative granularity with 0 being the most specific. Use this where a single dataset has multiple keys that are used consistently to record annotations and different levels of granularity.,TRUE,,integer,,,,,,, \ No newline at end of file