This repository contains source code to learn dense semantic representations for biomedical entities and pairs of entities as used in Sänger and Leser: "Large-scale Entity Representation Learning for Biomedical Relationship Extraction" (Bioinformatics, 2020).
The approach aims to perform biomedical relation extraction on corpus-level based on entity and entity pair embeddings learned on the complete PubMed corpus. For this we use focus on all articles mentioning a certain biomedical entity (e.g. mutation V600E) or pair of entities within the article title or abstract. We concatenate all articles mention the entity / entity pair and apply paragraph vectors (Le and Mikolov, 2014) to learn an embedding for each distinct entity resp. pair of entities.
Content: Usage | Pre-trained Entity Embeddings | Embedding Training | Supported Entity Types | Citation | Acknowledgements |
The implementation of the embeddings is based on Gensim. The following snippet highlights the basic use of the pre-trained embeddings.
from gensim.models import KeyedVectors
# Loading pre-trained entity model
model = KeyedVectors.load("mutation-v0500.bin")
# Print number of distinct entities of the model
print(f"Distinct entities: {len(model.vocab)}\n")
# Get the embedding for an specific entity
entity_embedding = model["rs113488022"]
print(f"Embedding of rs113488022:\n{entity_embedding}\n")
# Find similar entities
print("Most similar entities to rs113488022:")
top5_nearest_neighbors = model.most_similar("rs113488022", topn=5)
for i, (entity_id, sim) in enumerate(top5_nearest_neighbors):
print(f" {i+1}: {entity_id} (similarity: {sim:.3f})")
This should output:
Distinct entities: 47498
Embedding of rs113488022:
[ 1.15715809e-01 4.90018785e-01 -6.05004542e-02 -8.35603476e-02
9.20398310e-02 -1.51171118e-01 4.01901715e-02 -2.36775234e-01
Most similar entities to rs113488022:
1: rs121913227 (similarity: 0.690)
2: rs121913364 (similarity: 0.628)
3: rs121913529 (similarity: 0.610)
4: rs121913357 (similarity: 0.573)
5: rs11554290 (similarity: 0.571)
Entity Type | Identifier | #Entities | Vocabulary | v500 | v1000 | v1500 | v2000 |
Cellline | Cellosaurus ID | 4,654 | Vocab | Vectors | Vectors | Vectors | Vectors |
Chemical | MeSH | 109,716 | Vocab | Vectors | Vectors | Vectors | Vectors |
Disease | MeSH | 10,712 | Vocab | Vectors | Vectors | Vectors | Vectors |
DOID | 3,157 | Vocab | Vectors | Vectors | Vectors | Vectors | |
Drug | Drugbank ID | 5,966 | Vocab | Vectors | Vectors | Vectors | Vectors |
Gene | NCBI Gene ID | 171,686 | Vocab | Vectors | Vectors | Vectors | Vectors |
Mutation | RS-Identifier | 47,498 | Vocab | Vectors | Vectors | Vectors | Vectors |
Species | NCBI Taxonomy | 176,989 | Vocab | Vectors | Vectors | Vectors | Vectors |
For the computing entity and entity pair embeddings we utilize the complete PubMed corpus and make use of the data and entity annotations provided by PubTator Central.
- Download annotations from PubTator Central:
python --resources pubtator_central
Note: The annotation data requires > 70GB of disk space.
Learning entity embeddings can be done in two steps:
- Prepare entity annotations:
python --working_dir _out --entity_type mutation
We support entity types cell line, chemical, disease, drug, gene, mutation, and species.
- Run representation learning:
python --input_file _out/mutation/doc2vec_input.txt \
--config_file ../resources/configurations/doc2vec-0500.config \
--model_name mutation-v0500 \
--output_dir _out/mutation
Example configurations can be found in resources/configurations.
To learn entity pair embeddings, preparation of the entity annotations has to be performed first (see above). Analogously to the entity embeddings, learning of pair embeddings is performed in two steps:
- Prepare pair annotations:
python --working_dir _out --source_type mutation --target_type disease
We support entity types disease, drug, and mutation.
- Run representation learning:
python --input_file _out/mutation-disease/doc2vec_input.txt \
--config_file ../resources/configurations/doc2vec-0500.config \
--model_name mutation-disease-v0500 \
--output_dir _out/mutation-disease
Example configurations can be found in resources/configurations.
Entity Type | Identifier | Example |
Cell line | Cellosaurus ID | CVCL:0027 (Hep-G2) |
Chemical | MeSH | MESH:D000068878 (hTrastuzumab) |
Disease | MeSH | MESH:D006984 (hypertrophic chondrocytes) |
Disease Ontology ID (DOID) 1 | DOID:60155 (visual agnosia) | |
Drug | Drugbank ID | DB00166 (lipoic acid) |
Gene | NCBI Gene ID | NCBI:673 (BRAF) |
Mutation | RS-Identifier | rs113488022 (V600E) |
Species | NCBI Taxonomy | TAXON:9606 (human) |
1: Use option "--entity_type disease-doid" when calling
to normalize
disease annotations to the Disease Ontology.
Please use the following bibtex entry to cite our work:
title={Large-scale Entity Representation Learning for Biomedical Relationship Extraction},
author={S{\"a}nger, Mario and Leser, Ulf},
publisher={Oxford University Press}
We use the annotations from PubTator Central to compute the entity embeddings. For further details see here and refer to:
Wei, Chih-Hsuan, et al. "PubTator central: automated concept annotation for biomedical full text articles." Nucleic acids research 47.W1 (2019): W587-W593.
We use information from the Disease Ontology to normalize disease annotations. For further details see here and refer to:
Schriml, Lynn M., et al. "Human Disease Ontology 2018 update: classification, content and workflow expansion." Nucleic acids research 47.D1 (2019): D955-D962.
We use the paragraph vectors model to perform entity representation learning. For further details see here and refer to:
Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.