Abbreviation recasing: use all abbreviations #153

joeflack4 · 2024-09-23T04:25:24Z

resolves #129

Changes

Abbreviation recasing: use all abbrevs

Update: Now considering all abbreviations when checking a title and uppercasing them. Previously was only looking at the first preferred symbol.
Update: Refactored, added comments and todos, simplified code.
Bug fix: Was not actually uppercasing previously. The .replace() usage was incorrect.

Results:
diff.diff.zip (also w/ .owls on google drive)

- Update: Now considering all abbreviations when checking a title and uppercasing them. Previously was only looking at the first preferred symbol. - Update: Refactored, added comments and todos, simplified code. - Bug fix: Was not actually uppercasing previously. The .replace() usage was incorrect.

- Add: A todo

twhetzel · 2024-12-10T23:43:41Z

omim2obo/parsers/omim_entry_parser.py

+    todo: (more important): It's probable that .split(' ') is not enough to cover all cases. Should also run the check
+     by splitting on other characters. E.g. consider the following potential cases: "TITLE (ACRONYM)",
+     "TITLE: ACRONYM1&ACRONYM2", "TITLE/ACRONYM" or "TITLE ACRONYM/ACRONYM", "TITLE {ACRONYM1,ACRONYM2}",
+     "TITLE[ACRONYM]",  "TITLE-ACRONYM", or less likely cases such as "TITLE_ACRONYM", "TITLE.ACRONYM". There are quite


Why do you think these other potential cases exist in the data? Have you seen these in your analysis of the OMIM files?

It might be a time consuming analysis to investigate all current patterns of titles and acronyms with special characters. But we have seen several different variations.

Even if we did such an analysis, it does not necessarily future-proof the parser.

I think that many of the cases of this usage of special characters is not actually a rigorous syntax they've implemented, but a kind of lax syntax, or maybe just a bundle of ad hoc cases.

twhetzel

I reviewed the updates in the diff.diff file and they look good for the OMIM phenotype label(disease name) updates. Nice work!

Just to document, there are some capitalizations that are not completely fixed for gene entry pages, e.g. NFKB in https://omim.org/entry/164014?search=NFKB&highlight=NFKB.

> AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> <https://omim.org/entry/164014> "RELA protooncogene, nfkb subunit")
87677,87678c87677
< AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> <https://omim.org/entry/164014> "rela protooncogene, nfkb subunit")

However, since this case is for a gene name it's not important to fix and the data for the gene entry page doesn't support being able to fix this, e.g. NFKB, capitalization using the method in this PR so no need for further updates.

joeflack4 · 2024-12-11T03:44:02Z

@twhetzel Ah, OK, I see. Thank you for reviewing!

Perhaps I'll open an issue about those erroneous gene capitalization cases, or just keep it in mind.

I'll handle the conflicts in this and merge tomorrow.

joeflack4 changed the base branch from main to title-cleaning-updates September 23, 2024 04:25

joeflack4 self-assigned this Sep 23, 2024

joeflack4 added the bug Something isn't working label Sep 23, 2024

joeflack4 linked an issue Sep 23, 2024 that may be closed by this pull request

Capitalization of detected symbols in "Alternative title(s)" and "Included title(s)" may be incorrect #129

Open

joeflack4 requested a review from twhetzel September 23, 2024 04:46

joeflack4 force-pushed the abbrev-recasing-use-all branch from 7eefea5 to 788f267 Compare September 23, 2024 04:47

joeflack4 mentioned this pull request Sep 23, 2024

"Alternative symbol" synonyms & "formerly" updates #132

Draft

Abbreviation recasing: use all abbrevs

41f528c

- Add: A todo

joeflack4 changed the base branch from title-cleaning-updates to develop November 6, 2024 03:18

Merge branch 'develop' into abbrev-recasing-use-all

4e31d6c

twhetzel reviewed Dec 10, 2024

View reviewed changes

twhetzel approved these changes Dec 11, 2024

View reviewed changes

joeflack4 mentioned this pull request Dec 11, 2024

Improve abbreviation re-casing #179

Closed

Merge branch 'develop' into abbrev-recasing-use-all

3f7392a

joeflack4 merged commit dfb9cb0 into develop Dec 11, 2024

joeflack4 deleted the abbrev-recasing-use-all branch December 11, 2024 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abbreviation recasing: use all abbreviations #153

Abbreviation recasing: use all abbreviations #153

joeflack4 commented Sep 23, 2024 •

edited

Loading

twhetzel Dec 10, 2024

joeflack4 Dec 11, 2024

twhetzel left a comment

joeflack4 commented Dec 11, 2024

Abbreviation recasing: use all abbreviations #153

Abbreviation recasing: use all abbreviations #153

Conversation

joeflack4 commented Sep 23, 2024 • edited Loading

Changes

twhetzel Dec 10, 2024

Choose a reason for hiding this comment

joeflack4 Dec 11, 2024

Choose a reason for hiding this comment

twhetzel left a comment

Choose a reason for hiding this comment

joeflack4 commented Dec 11, 2024

joeflack4 commented Sep 23, 2024 •

edited

Loading