Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable date context to improve lookup accuracy #79

Open
dhimmel opened this issue Feb 28, 2022 · 1 comment
Open

Enable date context to improve lookup accuracy #79

dhimmel opened this issue Feb 28, 2022 · 1 comment

Comments

@dhimmel
Copy link

dhimmel commented Feb 28, 2022

gilda.ground has a context argument to provide textual context for an entity. Date context, i.e. when the source content was created, might also help with disambiguation in certain contexts. For example, Figure 2B from https://doi.org/10.1186/gb-2006-7-5-402 shows how the relative use of two gene symbols for ncbigene:5817:

image
A similar case is illustrated in Figure 2b for the gene encoding the poliovirus receptor. In the mid-1990s, the only symbol used was PVR (which is today the official name for the gene). The alternative name CD155 for the protein appeared for the first time in 1997, but gained greater acceptance after the publication in the late nineties of several articles describing structural aspects of the CD155 protein [9] that are critical to the interaction with the virus (CD nomenclature for cell-surface proteins follows a long established standard nomenclature). These articles named the gene as CD155, and this has been the preferred name since then. In this case, HUGO nomenclature apparently did not take this fact into account, since the establishment of PVR as the official gene name took place in 2003.

Now let's imagine that CD155 also happened to be a synonym for another gene that had been in use prior to 1997 (it is not, this is a fictional example). If we were to ground CD155 with a date from 1995, we would know it was referring to the other gene and not PVR.

I'm mostly thinking about this in terms of genes/proteins, but it could also be useful in terms of disease. The idea came about when I was manually mapping a resource that tended to use protein names from a decade ago and realized it would have been a lot easier to map things had it actually been 10 years ago.

I don't expect temporal trends or boundaries in synonym usage to be easy to construct, so mostly posting this issue to generate discussion and to accumulate notes. It definitely would be a major undertaking on the scale of an entire research study, so not expecting anything to be done.

@bgyori
Copy link
Member

bgyori commented Mar 2, 2022

That's a really interesting idea @dhimmel! I've thought about this issue myself. I think the bottleneck here might be to create a resource file for HGNC that comes with "quantitative" temporal resolution. Though the plot above is really interesting, it just shows the relative proportion of two non-ambiguous synonyms for a single gene.

As you also point out in

Now let's imagine that CD155 also happened to be a synonym for another gene that had been in use prior to 1997 (it is not, this is a fictional example). If we were to ground CD155 with a date from 1995, we would know it was referring to the other gene and not PVR.

the more complicated issue is if there are multiple genes with overlapping names/synonyms that change over time, and there are many examples of these in practice. Currently, we do keep track of and take into account withdrawn and obsolete gene symbols (there is a separate qualitative category for these different from "name" and "synonym" for scoring purposes) but it is not clear how the usage of each synonym has changed over time, especially as compared to other genes that might have matching synonyms. My big picture take is that disambiguation based on context is likely to be more robust and generalizable (to e.g., other entity types). Still in the special case where you're grounding labels on a data table rather than entity texts appearing in text, such context might be missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants