You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gilda.ground has a context argument to provide textual context for an entity. Date context, i.e. when the source content was created, might also help with disambiguation in certain contexts. For example, Figure 2B from https://doi.org/10.1186/gb-2006-7-5-402 shows how the relative use of two gene symbols for ncbigene:5817:
A similar case is illustrated in Figure 2b for the gene encoding the poliovirus receptor. In the mid-1990s, the only symbol used was PVR (which is today the official name for the gene). The alternative name CD155 for the protein appeared for the first time in 1997, but gained greater acceptance after the publication in the late nineties of several articles describing structural aspects of the CD155 protein [9] that are critical to the interaction with the virus (CD nomenclature for cell-surface proteins follows a long established standard nomenclature). These articles named the gene as CD155, and this has been the preferred name since then. In this case, HUGO nomenclature apparently did not take this fact into account, since the establishment of PVR as the official gene name took place in 2003.
Now let's imagine that CD155 also happened to be a synonym for another gene that had been in use prior to 1997 (it is not, this is a fictional example). If we were to ground CD155 with a date from 1995, we would know it was referring to the other gene and not PVR.
I'm mostly thinking about this in terms of genes/proteins, but it could also be useful in terms of disease. The idea came about when I was manually mapping a resource that tended to use protein names from a decade ago and realized it would have been a lot easier to map things had it actually been 10 years ago.
I don't expect temporal trends or boundaries in synonym usage to be easy to construct, so mostly posting this issue to generate discussion and to accumulate notes. It definitely would be a major undertaking on the scale of an entire research study, so not expecting anything to be done.
The text was updated successfully, but these errors were encountered:
That's a really interesting idea @dhimmel! I've thought about this issue myself. I think the bottleneck here might be to create a resource file for HGNC that comes with "quantitative" temporal resolution. Though the plot above is really interesting, it just shows the relative proportion of two non-ambiguous synonyms for a single gene.
As you also point out in
Now let's imagine that CD155 also happened to be a synonym for another gene that had been in use prior to 1997 (it is not, this is a fictional example). If we were to ground CD155 with a date from 1995, we would know it was referring to the other gene and not PVR.
the more complicated issue is if there are multiple genes with overlapping names/synonyms that change over time, and there are many examples of these in practice. Currently, we do keep track of and take into account withdrawn and obsolete gene symbols (there is a separate qualitative category for these different from "name" and "synonym" for scoring purposes) but it is not clear how the usage of each synonym has changed over time, especially as compared to other genes that might have matching synonyms. My big picture take is that disambiguation based on context is likely to be more robust and generalizable (to e.g., other entity types). Still in the special case where you're grounding labels on a data table rather than entity texts appearing in text, such context might be missing.
gilda.ground
has acontext
argument to provide textual context for an entity. Date context, i.e. when the source content was created, might also help with disambiguation in certain contexts. For example, Figure 2B from https://doi.org/10.1186/gb-2006-7-5-402 shows how the relative use of two gene symbols forncbigene:5817
:Now let's imagine that CD155 also happened to be a synonym for another gene that had been in use prior to 1997 (it is not, this is a fictional example). If we were to ground
CD155
with a date from 1995, we would know it was referring to the other gene and not PVR.I'm mostly thinking about this in terms of genes/proteins, but it could also be useful in terms of disease. The idea came about when I was manually mapping a resource that tended to use protein names from a decade ago and realized it would have been a lot easier to map things had it actually been 10 years ago.
I don't expect temporal trends or boundaries in synonym usage to be easy to construct, so mostly posting this issue to generate discussion and to accumulate notes. It definitely would be a major undertaking on the scale of an entire research study, so not expecting anything to be done.
The text was updated successfully, but these errors were encountered: