Skip to content

Commit

Permalink
Merge pull request #3 from cmungall/nico-finalise
Browse files Browse the repository at this point in the history
A bit of updates to the narrative
  • Loading branch information
matentzn authored Sep 7, 2023
2 parents a5bc489 + 435541a commit 932bff3
Show file tree
Hide file tree
Showing 5 changed files with 103 additions and 43 deletions.
9 changes: 9 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,3 +117,12 @@ All other files are only available under CC BY 4.0, including:
+ `*.docx`

Please open [an issue](https://github.com/cmungall/gpt-mapping-manuscript/issues) for any question related to licensing.


## Mapping GPT instructions

If you want to run the makefile:

1. Create new python environment and activate
1. Install prerequistes. Run `make help` for help about this.
1. Set openai key: `runoak set-apikey -e openai sk-$(KEY)`
21 changes: 21 additions & 0 deletions build/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
cffi==1.15.0
errorhandler==2.0.1
git+https://github.com/manubot/manubot@d4242ffa4194e4a13a68c5f6466feff559d3f9d5
isbnlib==3.10.10
opentimestamps-client==0.7.1
opentimestamps==0.4.3
pandoc-eqnos==2.5.0
pandoc-fignos==2.4.0
pandoc-tablenos==2.3.0
pandoc-xnos==2.5.0
pandocfilters==1.5.0
panflute==2.2.3
psutil==5.9.4
pybase62==0.5.0
python-bitcoinlib==0.11.2
pyyaml==6.0
papermill
plotly
seaborn
ontogpt==0.2.9
notebook==7.0.3
17 changes: 16 additions & 1 deletion content/01.abstract.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
## Abstract {.page_break_before}

Mapping...
Aligning terminological resources, including ontologies, controlled vocabularies and taxonomies, is a critical part of data integration in many domains such as healthcare, chemistry and biomedical research.

Entity mapping is the process of determining correspondences between entities across these resources, such as gene identifiers, disease concepts or chemical entity identifiers. Many tools have been developed to compute such mappings based on common structural features and lexical information such as labels and synonyms. Lexical approaches in particular often provide very high recall, but low precision, due to lexical ambiguity.

Large Language Models (LLMs), such as the ones employed by ChatGPT, have generalizable abilities to perform a wide range of tasks, including question answering and
information extraction.

Here we present *MapperGPT*, an approach based on LLMs to refine and predict mapping relationships
as a post-processing step, that works in
concert with existing high-recall methods that are based on lexical and structural heuristics.

We evaluated *MapperGPT* on a series of alignment tasks from different domains, including anatomy, developmental
biology, and renal diseases.
We devised a collection of tasks that are designed to be particularly challenging
for lexical methods. We show that when used in combination with high-recall methods
*MapperGPT* can provide a substantial improvement in accuracy beating state of the art methods such as LogMap.
66 changes: 37 additions & 29 deletions content/02.introduction.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,23 @@
## Introduction

When do two identifiers indicate the same thing? Linking the same or related entities at scale is crucial
for knowledge base and ontology integration. For example, if two different disease databases, one with
information about disease genes and the other with information about disease symptoms, are to be merged,
then it is important to precisely know which disease in one database corresponds to which disease in the other.


A common method to automate ontology matching is to use lexical methods, in particular matching
on primary or alternative labels that have been assigned to concepts, sometimes in combination with
lexical normalization. These can often provide very high recall, but low precision, due to lexical ambiguity.
Tackling global challenges, including rare disease and climate change, requires the integration of a large number of disparate data sources.
Due to the decentralised nature of data standardisation,
where different data providers inevitably employ
different controlled vocabularies and ontologies to standardise their data,
it becomes crucial to integrate such "semantic spaces" (i.e. data spaces that are described using divergent sets of ontologies).

Linking entities across often huge semantic spaces at scale is crucial.
For example, integrating genetic associations for disease provided by a disease data resource such as the Online Mendelian Inheritance in Man (OMIM) with the phenotypic associations to the same disease provided by Orphanet
requires mapping different disease identifiers that refer to the exact same real-world disease concept.
Manually mapping thousands of disease concepts for two semantic spaces is potentially doable manually,
but in the real world, dozens of resources providing information about the same data type (disease, genes, environment, organisms) need to be integrated,
which makes a purely manual approach infeasible.

Semantic entity matching is the process of associating a term or identifier A in one semantic space
to one or more terms or identifiers B in another, where A and B refer to the same or related real-world concepts.
A common method to automate semantic entity matching is to use lexical methods, in particular matching
on primary or alternative labels (synonyms) that have been assigned to concepts, sometimes in combination with
lexical normalization. These methods can often provide very high recall, but low precision, due to lexical ambiguity.
Examples are provided in @tbl:example-matches, including a false match between an aeroplane part and an
insect part due to sharing the same name (wing) based on analogous function.

Expand All @@ -23,45 +32,44 @@ Table: Example of entity matching problem
{#tbl:example-matches}

An example of this approach is the LOOM algorithm used in the Bioportal ontology
resource, which provides very high recall, including NNN mappings across over a thousand vocabularies.
resource [@pubmed:20351849], which provides very high recall mappings across over a thousand ontologies and other controlled vocabularies.

A number of approaches can give higher precision mappings, many of these make use of other relationships
or properties in the ontology. The Ontology Alignment Evaluation Initiative (OAEI) provides a yearly
evaluation of different methods for ontology matching. One of the top-performing methods in OAEI is the LogMap
tool, which makes use of logical axioms in the ontology to assist in mapping.

A number of tools such as LogMap been used to build or link ontologies and knowledge bases. However,
these approaches are usually used in conjunction with manual curation of mappings, which can be
resource intensive.

Deep learning approaches and in particular Language Models (LMs) have been applied to ontology matching tasks.
Some methods make use of embedding distance OntoEmma, e.g [Wang
et al., 2018], DeepAlignment [Kolyvakis et al., 2018],
VeeAlign [Iyer et al., 2020]. More recently the Truveta Mapper [@doi:10.48550/arXiv.2301.09767]
Some methods make use of embedding distance OntoEmma [@doi:10.48550/arXiv.1806.07976],
DeepAlignment [@doi:10.18653/v1/N18-1072],
VeeAlign [@doi:10.48550/arXiv.2010.11721].
More recently the Truveta Mapper [@doi:10.48550/arXiv.2301.09767]
treats matching as a Translation task and involves pre-training on ontology structures.

The most recent development in LMs are so-called Large Language Models (LLMs), exemplified by ChatGPT,
which involved billions of parameters and pre-training on instruction-prompting tasks. The resulting
models have generalizable abilities to perform a wide range of tasks, including question answering,
information extraction. However, one challenge with LLMs is the problem of *hallucination*.
information extraction. However, one challenge with LLMs is the problem of *hallucination*. An hallucination describes a situation where an AI model “fabricates” information that does not directly correspond to the provided input.

One possibility is using GPT to generate mappings de-novo. However, the problem of hallucination makes this
highly unreliable, in particular, due to the propensity for LLMs to hallucinate database or ontology identifiers
Given their performance on any tasks related to the understanding and generation natural language, it seems
obvious to employ LLMs directly as a powerful, scalable alternative to
current state-of-the-art (SOTA) methods for entity matching.
One possibility is using LLMs like the ones employed by ChatGPT to generate mappings de-novo. However, the problem of hallucination makes this unreliable, in particular, due to the propensity for LLMs to hallucinate database or ontology identifiers
when these are requested.

We devised an alternative approach called *MapperGPT* that does not use GPT to generate mappings de-novo, but instead works in
concert with existing high-recall methods such as LOOM. We use GPT to refine and predict relationships
(predicates) as a post-processing step. We use an in-context knowledge-driven semantic approach, in
which examples of different mapping categories are provided, and information about the two concepts in a mapping
is provided.
We devised an alternative approach called *MapperGPT* that does not use GPT models to generate mappings de-novo, but instead works in
concert with existing high-recall methods such as LOOM [@pubmed:20351849]. We use a GPT model to refine and predict relationships
(predicates) as a post-processing step,
essentially for the purpose of isolating and removing false positive mappings.
We use an in-context knowledge-driven semantic approach, in
which examples of different mapping categories and information about the two concepts in a mapping is provided to the model to determine an appropriate mapping relationship. We use the Simple Standard for Sharing Ontological Mapping (SSSOM) [@doi:10.1093/database/baac035] for sharing and comparing entity mappings across systems.

We evaluated this on a series of alignment tasks from different domains, including anatomy, developmental
biology, and renal diseases. We devised a collection of tasks that are designed to be particularly challenging
for lexical methods. We show that when used in combination with high-recall methods such as LOOM or LexMatch,
for lexical methods. We show that when used in combination with high-recall methods such as LOOM or OAK LexMatch [@doi:10.5281/zenodo.8310471],
MapperGPT can provide a substantial improvement in accuracy beating SOTA methods such as LogMap.

Our contributions are as follows:

- creation of a series of new matching tasks expressed using the SSSOM standard
- An algorithm and tool MapperGPT that uses GPT to predict relationships between concepts

- The creation of a series of new matching tasks expressed using the SSSOM standard
- An algorithm and tool, *MapperGPT*, that uses a GPT model to predict relationships between concepts
33 changes: 20 additions & 13 deletions content/03.methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ The result object can be exported to SSSOM format.

### Example

As an example, two concepts from the fruitfly and zebra fish anatomy ontologies are candidate matches
As an example, two concepts from the Drosophila (fruitfly) and zebrafish anatomy ontologies [@doi:10.1186/2041-1480-4-32, @doi:10.1186/2041-1480-5-12] are candidate matches
due to sharing a lexical element (the "PC" abbreviation). This is a false positive match in reality,
as the concept are entirely different.

Expand All @@ -121,7 +121,7 @@ relationship: start stage unknown
relationship: end stage adult
```
The payload for this using gpt-3.5-turbo is:
The response for this using gpt-3.5-turbo (August 2023) is:
```yaml
category: DIFFERENT
Expand Down Expand Up @@ -149,32 +149,39 @@ similarities and differences may prove informative.
### Implementation
We use the OAK library to connect to a variety of ontologies in OBO and Bioportal.
The overall framework is implemented in OntoGPT.
The input is an SSSOM file. The output is SSSOM with predicate_id filled with predicted value:
We use the Ontology Access Kit (OAK) library [@doi:10.5281/zenodo.8310471] to connect to a variety of ontologies in the Open Biological and Biomedical Ontology (OBO) Library [@doi:10.1093/database/baab069] and Bioportal [@pubmed:19483092].
OAK generally enables accessing ontologies, but we also make use of its ability to extract subsets of ontologies,
perform lexical matching using OAK LexMatch, extract mappings from ontologies and ontology portals such as Bioportal, and
add labels to mapping tables which typically only include the mapped identifiers, for better readability. We make use of ROBOT [@doi:10.1186/s12859-019-3002-3] for converting between different ontology formats.
The overall mapping framework is implemented in OntoGPT [@doi:10.5281/zenodo.8278168] (https://github.com/monarch-initiative/ontogpt) in a method called `categorize-mappings`, where the input is a SSSOM mapping file (usually generated by a lexical matching tool) and
the output is a SSSOM mapping file with `predicate_id` filled with predicted value. Example:

```bash
ontogpt categorize-mappings --model gpt-4 -i foo.sssom.tsv -o bar.sssom.tsv
```

The entire pipeline is implemented as a fully reproducible `Makefile` (https://github.com/cmungall/gpt-mapping-manuscript/blob/main/Makefile).

### Generation of test sets

To evaluate the method, we created a collection of test sets from biological domains.
To evaluate the method, we created a collection of test sets from the biological and biomedical domains.
We chose to devise new test sets as we wanted to base these on up-to-date, precise,
validated mappings derived from ontologies such as Mondo, CL, and Uberon.
validated mappings derived from ontologies such as Mondo [@doi:10.1101/2022.04.13.22273750], Cell Ontology (CL) [@doi:10.1186/s13326-016-0088-7], and the Uberon Anatomy Ontology [@doi:10.1186/gb-2012-13-1-r5].

To generate anatomy test sets, we generated pairwise mappings between species-specific anatomy ontologies,
using the Uberon and CL mappings as the gold standard. i.e. if a pair of concepts are transitively linked
via Uberon or CL, then they are considered a match. We used the same method for developmental stages.
using the Uberon and CL mappings as the gold standard.
If a pair of concepts are transitively linked
via Uberon or CL, then they are considered a match.
For example, UBERON:0000924 (ectoderm) is mapped to FBbt:00000111 (ectoderm (fruitfly)) and ZFA:0000016 (ectoderm (zebrafish)), so we assume that FBbt:00000111 is an exact match to ZFA:0000016.
We used the same method for linking species-specifc developmental stage terms.

We also generated a renal disease test set by taking all heritable renal diseases from Mondo, all
renal diseases from NCIT, and generating a test set based on validated curated mappings between
Mondo and NCIT.

TODO: more test sets

TODO: table showing sizes
{{main_results_testsetsize}}
Table: Breakdown of the existing test sets.
{#tbl:main_results_testsetsize}

### Tool evaluation

Expand Down

0 comments on commit 932bff3

Please sign in to comment.