-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abbreviation recasing: use all abbreviations #153
Conversation
- Update: Now considering all abbreviations when checking a title and uppercasing them. Previously was only looking at the first preferred symbol. - Update: Refactored, added comments and todos, simplified code. - Bug fix: Was not actually uppercasing previously. The .replace() usage was incorrect.
7eefea5
to
788f267
Compare
- Add: A todo
todo: (more important): It's probable that .split(' ') is not enough to cover all cases. Should also run the check | ||
by splitting on other characters. E.g. consider the following potential cases: "TITLE (ACRONYM)", | ||
"TITLE: ACRONYM1&ACRONYM2", "TITLE/ACRONYM" or "TITLE ACRONYM/ACRONYM", "TITLE {ACRONYM1,ACRONYM2}", | ||
"TITLE[ACRONYM]", "TITLE-ACRONYM", or less likely cases such as "TITLE_ACRONYM", "TITLE.ACRONYM". There are quite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think these other potential cases exist in the data? Have you seen these in your analysis of the OMIM files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a time consuming analysis to investigate all current patterns of titles and acronyms with special characters. But we have seen several different variations.
Even if we did such an analysis, it does not necessarily future-proof the parser.
I think that many of the cases of this usage of special characters is not actually a rigorous syntax they've implemented, but a kind of lax syntax, or maybe just a bundle of ad hoc cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the updates in the diff.diff file and they look good for the OMIM phenotype label(disease name) updates. Nice work!
Just to document, there are some capitalizations that are not completely fixed for gene entry pages, e.g. NFKB
in https://omim.org/entry/164014?search=NFKB&highlight=NFKB.
> AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> <https://omim.org/entry/164014> "RELA protooncogene, nfkb subunit")
87677,87678c87677
< AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> <https://omim.org/entry/164014> "rela protooncogene, nfkb subunit")
However, since this case is for a gene name it's not important to fix and the data for the gene entry page doesn't support being able to fix this, e.g. NFKB, capitalization using the method in this PR so no need for further updates.
@twhetzel Ah, OK, I see. Thank you for reviewing! Perhaps I'll open an issue about those erroneous gene capitalization cases, or just keep it in mind. I'll handle the conflicts in this and merge tomorrow. |
resolves #129
Changes
Abbreviation recasing: use all abbrevs
Results:
diff.diff.zip (also w/
.owl
s on google drive)