Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251

DebsKing · 2024-05-20T08:06:56Z

Hello.

Thank you for the brilliant package. I have three questions:

In the example search string below, I search for "Biomed" OR "Biomed engineering" (as a mock example). Having run quality checks on the results, I am not convinced that it is treating 'biomed engineering' as a phrase, rather than individual words. Is my my coding incorrect?
When I repeat a given search string on the same day, it returns an identical number of publications. But when I repeat the search on consecutive days, it returns a few more publications each time. One might expect a small number of historical publications to be added to the Open Alex database, but some of my searches that go from 2019-2023 are returning 6% more publications when I run the code this month, compared to last month. I would like to clarify if this is due to on-going changes to the database, or the function oa_fetch(), or my code implementation.
I did see some discussion on stability between the Open Alex database and R package Suggestion for discussion about conversion from result to data.frame / tibble #247.
I want to search for words and phrases in the title OR abstract. I currently run this in two code chunks. Can I combine these for efficient code?

Thank you for again for the package, and for any help or guidance. It is really appreciated!
Deborah

My r code:

remotes::install_github("ropensci/openalexR") # following recent issue with package, I now install via Github.
packageVersion("openalexR") # 1.3.1
library(openalexR)

1. Search based on title

works_title <- oa_fetch(
entity = "works",
title.search = c("Biomed", "Biomed engineering"), # mock example
from_publication_date = "2019-01-01", to_publication_date = "2022-12-31", # mock example
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"), verbose = TRUE )

2. Search based on abstract

works_abstract <- oa_fetch(
entity = "works",
abstract.search = c("Biomed", "Biomed engineering"),
from_publication_date = "2019-01-01",to_publication_date = "2022-12-31",
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"),verbose = TRUE )

3. Quality checks:

count(works_abstract[duplicated(works_abstract$id), ]) # Are there duplicates within a dataframe # no
count(works_title[duplicated(works_title$id), ]) # Are there duplicates within a dataframe # no

common_publications <- intersect(works_title$id, works_abstract$id) # Are there duplicates across the 'title' and 'abstract' dataframes
length(common_publications) # yes, as one would expect.

4. Combine abstract and title dataframes:

works_title_filtered <- works_title %>% # Filter rows in works_title where id is not in works_abstract
filter(!(id %in% works_abstract$id))

works_combined <- bind_rows(works_abstract, works_title_filtered) # Combine the original works_abstract with the filtered works_title

count(works_combined[duplicated(works_combined$id), ]) # check no duplicates

5. put into bibliometrix format

works_combined <- oa2bibliometrix(works_combined)

rkrug · 2024-05-20T08:43:15Z

Hi Deborah Am also a very happy user of openalexR and I use it daily for title and abstract searches, for long search terms which include individual words and terms combined by OR. My comments are inline

Hello. Thank you for the brilliant package.

Can not agree more!

I have three questions: In the example search string below, I search for "Biomed" OR "Biomed engineering" (as a mock example). Having run quality checks on the results, I am not convinced that it is treating 'biomed engineering' as a phrase, rather than individual words. Is my my coding incorrect?

When you search for "Biomed" OR "Biomed engineering”, the result is all results from “Biomed” and all results from the search for “Biomes engineering” - in other words, the second set is contained in the first one - wo it is redundant and you should get the same results then searching for “Biomed” only. When you search in open Alex for ‘ X Y’ (without the inverted comms), it is automatically assuming that there is and AND between the terms. This is also true when you look at the API call your command is issuing: https://api.openalex.org/works?filter=title.search%3ABiomed%7CBiomed%20engineering%2Cfrom_publication_date%3A2019-01-01%2Cto_publication_date%3A2022-12-31%2Ccited_by_count%3A%3E1&sort=cited_by_count%3Adesc You see the term Biomed%7CBiomed%20engineering%2C <https://api.openalex.org/works?filter=title.search%3ABiomed%7CBiomed%20engineering%2Cfrom_publication_date%3A2019-01-01%2Cto_publication_date%3A2022-12-31%2Ccited_by_count%3A%3E1&sort=cited_by_count%3Adesc> which has a %7C, which is the escaped hex code for “|”, which stands for an AND. So your search is "Biomed" AND "Biomed engineering” - which is only “Biomed engineering”. Therefore you have to use `"Biomed" OR "Biomed engineering”` as the search term. Also, I have never used a vector of length larger then one for the for a search string, and if I would have, I would have expected either an OR, or even a vectorised version returning two results (but this is a different discussion)

When I repeat a given search string on the same day, it returns an identical number of publications. But when I repeat the search on consecutive days, it returns a few more publications each time. One might expect a small number of historical publications to be added to the Open Alex database, but some of my searches that go from 2019-2023 are returning 6% more publications when I run the code this month, compared to last month. I would like to clarify if this is due to on-going changes to the database, or the function oa_fetch(), or my code implementation. I did see some discussion on stability between the Open Alex database and R package #247 <#247>.

OpenAlex is growing and continuously ingesting sources. So if new works (and I use ‘works’ on purpose here as they are also datasets and not only articles) appear in any of the sources, they will be added. So an increase is too be expected. I usually download the results to a search on OpenAlex and store it as an element in a list, where the second element is the timestamp when the OpenAlex access took place. So this is expected.

I want to search for words and phrases in the title OR abstract. I currently run this in two code chunks. Can I combine these for efficient code?

Yes - I do this regularly. You have to use title_and_abstract.search to do this: openalexR::oa_fetch( title_and_abstract.search = ‘Biomed OR “Biomed engineering"', output = "list", verbose = TRUE ) One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”). Cheers, Rainer

…

Thank you for again for the package, and for any help or guidance. It is really appreciated! Deborah My r code: following recent issue with package, I now install via Github. remotes::install_github("ropensci/openalexR") packageVersion("openalexR") # 1.3.1 library(openalexR) 1. Search based on title works_title <- oa_fetch( entity = "works", title.search = c("Biomed", "Biomed engineering"), # mock example from_publication_date = "2019-01-01", to_publication_date = "2022-12-31", # mock example cited_by_count = ">1", options = list(sort = "cited_by_count:desc"), verbose = TRUE ) 2. Search based on abstract works_abstract <- oa_fetch( entity = "works", abstract.search = c("Biomed", "Biomed engineering"), from_publication_date = "2019-01-01",to_publication_date = "2022-12-31", cited_by_count = ">1", options = list(sort = "cited_by_count:desc"),verbose = TRUE ) 3. Quality checks: Are there duplicates within a dataframe: count(works_abstract[duplicated(works_abstract$id), ]) # no count(works_title[duplicated(works_title$id), ]) # no Are there duplicates across the 'title' and 'abstract' dataframes: common_publications <- intersect(works_title$id, works_abstract$id) length(common_publications) # yes, as one would expect. 4. Combine abstract and title dataframes: Filter rows in works_title where id is not in works_abstract works_title_filtered <- works_title %>% filter(!(id %in% works_abstract$id)) Combine the original works_abstract with the filtered works_title works_combined <- bind_rows(works_abstract, works_title_filtered) check no duplicates: count(works_combined[duplicated(works_combined$id), ]) 5. put into bibliometrix format works_combined <- oa2bibliometrix(works_combined) — Reply to this email directly, view it on GitHub <#251>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADW6BCPK5GHP36CHRFQI3TZDGVLJAVCNFSM6AAAAABH7H55VSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYDKMZYGU3DSOI>. You are receiving this because you are subscribed to this thread.

-- Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany) Orcid ID: 0000-0002-7490-0066 Department of Evolutionary Biology and Environmental Studies University of Zürich Office Y19-M-72 Winterthurerstrasse 190 8075 Zürich Switzerland Office: +41 (0)44 635 47 64 Cell: +41 (0)78 630 66 57 email: ***@***.*** ***@***.*** PGP: 0x0F52F982

DebsKing · 2024-05-20T11:56:11Z

Thanks for your time and help!

My search string above – “biomed” OR “biomed engineering” – was a bad example due to the word repetition. Apologies. A closer example to my search string is: "medicine" OR "biomed engineering". Importantly, "biomed engineering" needs to be treated as a single phrase and not as two words. Using inverted commas appears to fix this:
' medicine OR "biomed engineering" '

Thank you for suggesting use of ‘title_and_abstract.search’. I want to search titles OR abstracts, and cannot implement ‘title_or_abstract.search’. Are you aware of any options to encode this, please? I cannot see any in the help / documentation.

Thank you for highlighting the stemming issue:
One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”).
This seems an important limitation, if it applies to all instances of: -er, -ing, -ies, etc. Is there any work around?

Thanks again.

rkrug · 2024-05-20T12:01:44Z

Thanks for your time and help!

Pleasure.

My search string above – “biomed” OR “biomed engineering” – was a bad example due to the word repetition. Apologies. A closer example to my search string is: "medicine" OR "biomed engineering". Importantly, "biomed engineering" needs to be treated as a single phrase and not as two words. Using inverted commas appears to fix this: ' medicine OR "biomed engineering" '

Good.

Thank you for suggesting use of ‘title_and_abstract.search’. I want to search titles OR abstracts, and cannot implement ‘title_or_abstract.search’. Are you aware of any options to encode this, please? I cannot see any in the help / documentation.

`title_and_abstract,search` searches the title and the abstract for the term m - so if either has the term, it will be returned. This is not the logical AND - it effectively searches in abstract and title and when one is true, it returns it.

Thank you for highlighting the stemming issue: One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”). This seems an important limitation, if it applies to all instances of: -er, -ing, -ies, etc. Is there any work around? Thanks again.

Glad that I could help.

DebsKing mentioned this issue Jun 5, 2024

oa_fetch() results – problems with stability over time, duplicates & quality checks #253

Closed

trangdata closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251

Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251

DebsKing commented May 20, 2024 •

edited

Loading

rkrug commented May 20, 2024 via email

DebsKing commented May 20, 2024

rkrug commented May 20, 2024 via email

Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251

Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251

Comments

DebsKing commented May 20, 2024 • edited Loading

My r code:

1. Search based on title

2. Search based on abstract

3. Quality checks:

4. Combine abstract and title dataframes:

5. put into bibliometrix format

rkrug commented May 20, 2024 via email

DebsKing commented May 20, 2024

rkrug commented May 20, 2024 via email

DebsKing commented May 20, 2024 •

edited

Loading