A repository for building ontologies for the Brain Data Standards Project.
Status: Draft
The main purpose of this repo is to automate data driven cell-type ontology development for the Brain Data Standards initiative. The main inputs are:
- Dendrograms (JSON) format, provided by the Allen, encoding data driven classification of brain cell types. These files also include a nomenclature standard (and mapping system) developed by the Allen: https://arxiv.org/abs/2006.05406. See dendrogram spec for details.
- CSV files identifying and summarising dendograms - including species & anatomical region
- CSV mapping files that combine dendogram nodes into groupings tht do not correspond to any single dendrogram node, but do correspond to known cell types.
- Marker files (robot templates) that map marker combinations with high predictive capacity for dendrogram nodes (generated by NS-Forest) onto those nodes
- Automatically seeded, manually curated robot templates mapping nodes to classes in CL and to various properties (e.g. soma location)
Figure 1: Build overview
The Build system is an extended version of the Ontology Development Kit - an automated ontology build system using ROBOT and MakeFiles. As well as managing the build from input files, this also automatically generated modules from referenced ontologies and integrates these into the build.
You will need Docker installed. Running a build will pull the required containers with all required dependencies.
To build
cd src/ontology
sh ./run.sh make prepare_release
This dynamically updates imports as well as building reasoned release files. The slowest part of the build is mirroring (downloading and reserialising) external ontologies. If you've run a build recently, mirrored versions will already be stored in the src/ontology/mirror. To run a build without mirroring:
cd src/ontology
sh ./run.sh make prepare_release MIR=false
To extend the ontologies imported from. Edit bdscratch-odk.yaml to add the required ontology to import_group.products, then run:
sh ./run.sh make update_repo
The update the import statements in src/ontology/bdscratch-edit.owl.
Extensions to the build are specified (as per ODK standard) in bdscratch.Makefile.
Dendrograms live in /src/dendrograms/. They are named according to their Allen Dendrogram ID, e.g. CCN201908210.json
We expect dendrograms to remain stable for relatively long periods of time and at least some generated Robot templates are intended to be manually edited to map to CL classes / property driven classification. For these reasons, we store generated templates on the repo and build them as needed using a separate MakeFile - src/dendrograms/Makefile.
To build (be careful you don't wipe out curation!):
cd src/dendrograms
# Build all
sh ./run.sh make
# Build specific template
sh ./run.sh make <template_filename>
# Build a specific set of tempaltes
sh ./run.sh make JOBS=<dendrogram_id>
Tempaltes are build from dendrograms using python scripts in src/scripts
Extended information about groupings of taxonomy nodes that are candidates for curation are stored in additional tsv files (accession.tsv) Support for incorporating this informtion into templates is TBA.
Robot templates live in /src/tempaltes/.
filename | e.g. | Description |
---|---|---|
{accession}.tsv | CCN201810310.tsv | Template for generating taxonomy as OWL individuals |
{accession}_class.tsv | CCN201810310_class.tsv | Templates for generating classes corresponding to OWL individuals in taxonomy. Includes slots for curating cell type & properties |
{accession}_markers.tsv | CCN201810310_markers.tsv | Templates for adding markers. Referenced markers must be present in gene reference files. |
ensmusg.tsv | {ensembl_gene_file}.tsv | Robot template listing all genes (all possible markers) for analysis/dendrogams of some specific species. |
ensembl_gene_file name follows standard ensembl ID prefixes but in lowercase e.g. ensmusg.tsv (ensembl mouse gene) has genes with IDs of the form: ENSMUSG{numeric_accession}
Markers are referenced by enembl ID using an identifiers.org URL scheme
ensembl gene file templates are used to generate mirror files, which act as source files for import generation, so that only referenced markers end up in the release files.
GTF files used as reference for BDSO can be found in this google drive folder