Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251

Closed
DebsKing opened this issue May 20, 2024 · 3 comments

Comments

@DebsKing
Copy link

DebsKing commented May 20, 2024

Hello.

Thank you for the brilliant package. I have three questions:

  1. In the example search string below, I search for "Biomed" OR "Biomed engineering" (as a mock example). Having run quality checks on the results, I am not convinced that it is treating 'biomed engineering' as a phrase, rather than individual words. Is my my coding incorrect?

  2. When I repeat a given search string on the same day, it returns an identical number of publications. But when I repeat the search on consecutive days, it returns a few more publications each time. One might expect a small number of historical publications to be added to the Open Alex database, but some of my searches that go from 2019-2023 are returning 6% more publications when I run the code this month, compared to last month. I would like to clarify if this is due to on-going changes to the database, or the function oa_fetch(), or my code implementation.
    I did see some discussion on stability between the Open Alex database and R package Suggestion for discussion about conversion from result to data.frame / tibble #247.

  3. I want to search for words and phrases in the title OR abstract. I currently run this in two code chunks. Can I combine these for efficient code?

Thank you for again for the package, and for any help or guidance. It is really appreciated!
Deborah

My r code:

remotes::install_github("ropensci/openalexR") # following recent issue with package, I now install via Github.
packageVersion("openalexR") # 1.3.1
library(openalexR)

1. Search based on title

works_title <- oa_fetch(
entity = "works",
title.search = c("Biomed", "Biomed engineering"), # mock example
from_publication_date = "2019-01-01", to_publication_date = "2022-12-31", # mock example
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"), verbose = TRUE )

2. Search based on abstract

works_abstract <- oa_fetch(
entity = "works",
abstract.search = c("Biomed", "Biomed engineering"),
from_publication_date = "2019-01-01",to_publication_date = "2022-12-31",
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"),verbose = TRUE )

3. Quality checks:

count(works_abstract[duplicated(works_abstract$id), ]) # Are there duplicates within a dataframe # no
count(works_title[duplicated(works_title$id), ]) # Are there duplicates within a dataframe # no

common_publications <- intersect(works_title$id, works_abstract$id) # Are there duplicates across the 'title' and 'abstract' dataframes
length(common_publications) # yes, as one would expect.

4. Combine abstract and title dataframes:

works_title_filtered <- works_title %>% # Filter rows in works_title where id is not in works_abstract
filter(!(id %in% works_abstract$id))

works_combined <- bind_rows(works_abstract, works_title_filtered) # Combine the original works_abstract with the filtered works_title

count(works_combined[duplicated(works_combined$id), ]) # check no duplicates

5. put into bibliometrix format

works_combined <- oa2bibliometrix(works_combined)

@rkrug
Copy link

rkrug commented May 20, 2024 via email

@DebsKing
Copy link
Author

Thanks for your time and help!

My search string above – “biomed” OR “biomed engineering” – was a bad example due to the word repetition. Apologies. A closer example to my search string is: "medicine" OR "biomed engineering". Importantly, "biomed engineering" needs to be treated as a single phrase and not as two words. Using inverted commas appears to fix this:
' medicine OR "biomed engineering" '

Thank you for suggesting use of ‘title_and_abstract.search’. I want to search titles OR abstracts, and cannot implement ‘title_or_abstract.search’. Are you aware of any options to encode this, please? I cannot see any in the help / documentation.

Thank you for highlighting the stemming issue:
One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”).
This seems an important limitation, if it applies to all instances of: -er, -ing, -ies, etc. Is there any work around?

Thanks again.

@rkrug
Copy link

rkrug commented May 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants