GrEBI (Graphs@EBI)

HPC pipeline to aggregate knowledge graphs from EMBL-EBI resources, the MONARCH Initiative KG, ROBOKOP, Ubergraph, and other sources into giant (multi-terabyte) transient Neo4j+Solr+RocksDB databases for querying.

Outputs

The resulting transient databases can be downloaded from https://ftp.ebi.ac.uk/pub/databases/spot/kg/ebi/

Name	Description	# Nodes	# Edges	Neo4j DB size
`ebi_monarch_xspecies`	All datasources with cross-species phenotype matches merged	~130m	~850m	~900 GB
`ebi_monarch`	All datasources with cross-species phenotype matches separated
`impc_x_gwas`	Limited to data from IMPC, GWAS Catalog, and related ontologies and mappings	~30m	~184m

Note that the purpose of this pipeline is not to supply another knowledge graph, but to facilitate querying and analysis across existing ones. Consequently the above databases should be considered temporary and are subject to be removed and/or replaced with new ones without warning.

Mapping sets used

The following mapping tables are loaded:

https://data.monarchinitiative.org/mappings/latest/gene_mappings.sssom.tsv
https://data.monarchinitiative.org/mappings/latest/hp_mesh.sssom.tsv
https://data.monarchinitiative.org/mappings/latest/mesh_chebi_biomappings.sssom.tsv
https://data.monarchinitiative.org/mappings/latest/mondo.sssom.tsv
https://data.monarchinitiative.org/mappings/latest/umls_hp.sssom.tsv
https://data.monarchinitiative.org/mappings/latest/upheno_custom.sssom.tsv
https://raw.githubusercontent.com/mapping-commons/mh_mapping_initiative/master/mappings/mp_hp_mgi_all.sssom.tsv
https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/mappings/oba-efo.sssom.tsv
https://raw.githubusercontent.com/obophenotype/bio-attribute-ontology/master/src/mappings/oba-vt.sssom.tsv
https://github.com/biopragmatics/biomappings/raw/refs/heads/master/src/biomappings/resources/mappings.tsv

In all of the currently configured outputs, skos:exactMatch mappings cause clique merging. In ebi_monarch_xspecies, semapv:crossSpeciesExactMatch also causes clique merging (so e.g. corresponding HP and MP terms will share a graph node). As this is not always desirable, a separate graph ebi_monarch is also provided where semapv:crossSpeciesExactMatch mappings are represented as edges.

Full list of datasources

Datasource	Loaded from
IMPC	EBI
GWAS Catalog	EBI
OLS	EBI
OpenTargets	EBI
Metabolights	EBI
ChEMBL	EBI
Reactome	EBI, MONARCH
BGee	MONARCH
BioGrid	MONARCH
Gene Ontology (GO) Annotation Database	MONARCH
HGNC (HUGO Gene Nomenclature Committee)	MONARCH
Human Phenotype Ontology Annotations (HPOA)	MONARCH
NCBI Gene	MONARCH
PHENIO	MONARCH
PomBase	MONARCH
ZFIN	MONARCH
MedGen	MONARCH
Protein ANalysis THrough Evolutionary Relationships (PANTHER)	MONARCH, ROBOKOP
STRING	MONARCH, ROBOKOP
Comparative Toxicogenomics Database (CTD)	MONARCH, ROBOKOP
Alliance of Genome Resources	MONARCH, ROBOKOP
BINDING	ROBOKOP
CAM KG	ROBOKOP
The Comparative Toxicogenomics Database (CTD)	ROBOKOP
Drug Central	ROBOKOP
The Alliance of Genome Resources	ROBOKOP
The Genotype-Tissue Expression (GTEx) portal	ROBOKOP
Guide to Pharmacology database (GtoPdb)	ROBOKOP
Hetionet	ROBOKOP
HMDB	ROBOKOP
Human GOA	ROBOKOP
Integrated Clinical and Environmental Exposures Service (ICEES) KG	ROBOKOP
IntAct	ROBOKOP
Protein ANalysis THrough Evolutionary Relationships (PANTHER)	ROBOKOP
Pharos	ROBOKOP
STRING	ROBOKOP
Text Mining Provider KG	ROBOKOP
Viral Proteome	ROBOKOP
AOPWiki	AOPWikiRDF
Ubergraph
MeSH
Human Reference Atlas KG

Implementation

The pipeline is implemented as Rust programs with simple CLIs, orchestrated with Nextflow. Input KGs are represented in a variety of formats including KGX, RDF, and JSONL files. After loading, a simple "bruteforce" integration strategy is applied:

All strings that begin with any IRI or CURIE prefix from the Bioregistry are canonicalised to the standard CURIE form
All property values that are the identifier of another node in the graph become edges
Cliques of equivalent nodes are merged into single nodes
Cliques of equivalent properties are merged into single properties (and for ontology-defined properties, the qualified safe labels are used)

The primary output of the pipeline is a property graph for Neo4j. The nodes and edges are also loaded into Solr for full-text search and RocksDB for id->object resolution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GrEBI (Graphs@EBI)

Outputs

Mapping sets used

Full list of datasources

Implementation

Files

README.md

Latest commit

History

README.md

File metadata and controls

GrEBI (Graphs@EBI)

Outputs

Mapping sets used

Full list of datasources

Implementation