Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abbreviation recasing: use all abbreviations #153

Merged
merged 4 commits into from
Dec 11, 2024

Conversation

joeflack4
Copy link
Contributor

@joeflack4 joeflack4 commented Sep 23, 2024

resolves #129

Changes

Abbreviation recasing: use all abbrevs

  • Update: Now considering all abbreviations when checking a title and uppercasing them. Previously was only looking at the first preferred symbol.
  • Update: Refactored, added comments and todos, simplified code.
  • Bug fix: Was not actually uppercasing previously. The .replace() usage was incorrect.

Results:
diff.diff.zip (also w/ .owls on google drive)

@joeflack4 joeflack4 changed the base branch from main to title-cleaning-updates September 23, 2024 04:25
@joeflack4 joeflack4 self-assigned this Sep 23, 2024
@joeflack4 joeflack4 added the bug Something isn't working label Sep 23, 2024
@joeflack4 joeflack4 requested a review from twhetzel September 23, 2024 04:46
- Update: Now considering all abbreviations when checking a title and uppercasing them. Previously was only looking at the first preferred symbol.
- Update: Refactored, added comments and todos, simplified code.
- Bug fix: Was not actually uppercasing previously. The .replace() usage was incorrect.
@joeflack4 joeflack4 changed the base branch from title-cleaning-updates to develop November 6, 2024 03:18
todo: (more important): It's probable that .split(' ') is not enough to cover all cases. Should also run the check
by splitting on other characters. E.g. consider the following potential cases: "TITLE (ACRONYM)",
"TITLE: ACRONYM1&ACRONYM2", "TITLE/ACRONYM" or "TITLE ACRONYM/ACRONYM", "TITLE {ACRONYM1,ACRONYM2}",
"TITLE[ACRONYM]", "TITLE-ACRONYM", or less likely cases such as "TITLE_ACRONYM", "TITLE.ACRONYM". There are quite
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think these other potential cases exist in the data? Have you seen these in your analysis of the OMIM files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a time consuming analysis to investigate all current patterns of titles and acronyms with special characters. But we have seen several different variations.

Even if we did such an analysis, it does not necessarily future-proof the parser.

I think that many of the cases of this usage of special characters is not actually a rigorous syntax they've implemented, but a kind of lax syntax, or maybe just a bundle of ad hoc cases.

Copy link
Contributor

@twhetzel twhetzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the updates in the diff.diff file and they look good for the OMIM phenotype label(disease name) updates. Nice work!

Just to document, there are some capitalizations that are not completely fixed for gene entry pages, e.g. NFKB in https://omim.org/entry/164014?search=NFKB&highlight=NFKB.

> AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> <https://omim.org/entry/164014> "RELA protooncogene, nfkb subunit")
87677,87678c87677
< AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> <https://omim.org/entry/164014> "rela protooncogene, nfkb subunit")

However, since this case is for a gene name it's not important to fix and the data for the gene entry page doesn't support being able to fix this, e.g. NFKB, capitalization using the method in this PR so no need for further updates.

@joeflack4
Copy link
Contributor Author

@twhetzel Ah, OK, I see. Thank you for reviewing!

Perhaps I'll open an issue about those erroneous gene capitalization cases, or just keep it in mind.

I'll handle the conflicts in this and merge tomorrow.

@joeflack4 joeflack4 merged commit dfb9cb0 into develop Dec 11, 2024
@joeflack4 joeflack4 deleted the abbrev-recasing-use-all branch December 11, 2024 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Capitalization of detected symbols in "Alternative title(s)" and "Included title(s)" may be incorrect
2 participants