-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarifying oa_fetch() implementation: searching for phrases v single words; stability; and concise code. #251
Comments
Hi Deborah
Am also a very happy user of openalexR and I use it daily for title and abstract searches, for long search terms which include individual words and terms combined by OR.
My comments are inline
Hello.
Thank you for the brilliant package.
Can not agree more!
I have three questions:
In the example search string below, I search for "Biomed" OR "Biomed engineering" (as a mock example). Having run quality checks on the results, I am not convinced that it is treating 'biomed engineering' as a phrase, rather than individual words. Is my my coding incorrect?
When you search for "Biomed" OR "Biomed engineering”, the result is all results from “Biomed” and all results from the search for “Biomes engineering” - in other words, the second set is contained in the first one - wo it is redundant and you should get the same results then searching for “Biomed” only.
When you search in open Alex for ‘ X Y’ (without the inverted comms), it is automatically assuming that there is and AND between the terms. This is also true when you look at the API call your command is issuing:
https://api.openalex.org/works?filter=title.search%3ABiomed%7CBiomed%20engineering%2Cfrom_publication_date%3A2019-01-01%2Cto_publication_date%3A2022-12-31%2Ccited_by_count%3A%3E1&sort=cited_by_count%3Adesc
You see the term Biomed%7CBiomed%20engineering%2C <https://api.openalex.org/works?filter=title.search%3ABiomed%7CBiomed%20engineering%2Cfrom_publication_date%3A2019-01-01%2Cto_publication_date%3A2022-12-31%2Ccited_by_count%3A%3E1&sort=cited_by_count%3Adesc> which has a %7C, which is the escaped hex code for “|”, which stands for an AND. So your search is "Biomed" AND "Biomed engineering” - which is only “Biomed engineering”.
Therefore you have to use `"Biomed" OR "Biomed engineering”` as the search term.
Also, I have never used a vector of length larger then one for the for a search string, and if I would have, I would have expected either an OR, or even a vectorised version returning two results (but this is a different discussion)
When I repeat a given search string on the same day, it returns an identical number of publications. But when I repeat the search on consecutive days, it returns a few more publications each time. One might expect a small number of historical publications to be added to the Open Alex database, but some of my searches that go from 2019-2023 are returning 6% more publications when I run the code this month, compared to last month. I would like to clarify if this is due to on-going changes to the database, or the function oa_fetch(), or my code implementation.
I did see some discussion on stability between the Open Alex database and R package #247 <#247>.
OpenAlex is growing and continuously ingesting sources. So if new works (and I use ‘works’ on purpose here as they are also datasets and not only articles) appear in any of the sources, they will be added. So an increase is too be expected. I usually download the results to a search on OpenAlex and store it as an element in a list, where the second element is the timestamp when the OpenAlex access took place. So this is expected.
I want to search for words and phrases in the title OR abstract. I currently run this in two code chunks. Can I combine these for efficient code?
Yes - I do this regularly. You have to use title_and_abstract.search to do this:
openalexR::oa_fetch(
title_and_abstract.search = ‘Biomed OR “Biomed engineering"',
output = "list",
verbose = TRUE
)
One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”).
Cheers,
Rainer
… Thank you for again for the package, and for any help or guidance. It is really appreciated!
Deborah
My r code:
following recent issue with package, I now install via Github.
remotes::install_github("ropensci/openalexR")
packageVersion("openalexR") # 1.3.1
library(openalexR)
1. Search based on title
works_title <- oa_fetch(
entity = "works",
title.search = c("Biomed", "Biomed engineering"), # mock example
from_publication_date = "2019-01-01", to_publication_date = "2022-12-31", # mock example
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"), verbose = TRUE )
2. Search based on abstract
works_abstract <- oa_fetch(
entity = "works",
abstract.search = c("Biomed", "Biomed engineering"),
from_publication_date = "2019-01-01",to_publication_date = "2022-12-31",
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"),verbose = TRUE )
3. Quality checks:
Are there duplicates within a dataframe:
count(works_abstract[duplicated(works_abstract$id), ]) # no
count(works_title[duplicated(works_title$id), ]) # no
Are there duplicates across the 'title' and 'abstract' dataframes:
common_publications <- intersect(works_title$id, works_abstract$id)
length(common_publications) # yes, as one would expect.
4. Combine abstract and title dataframes:
Filter rows in works_title where id is not in works_abstract
works_title_filtered <- works_title %>%
filter(!(id %in% works_abstract$id))
Combine the original works_abstract with the filtered works_title
works_combined <- bind_rows(works_abstract, works_title_filtered)
check no duplicates:
count(works_combined[duplicated(works_combined$id), ])
5. put into bibliometrix format
works_combined <- oa2bibliometrix(works_combined)
—
Reply to this email directly, view it on GitHub <#251>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AADW6BCPK5GHP36CHRFQI3TZDGVLJAVCNFSM6AAAAABH7H55VSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYDKMZYGU3DSOI>.
You are receiving this because you are subscribed to this thread.
--
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation Biology, UCT), Dipl. Phys. (Germany)
Orcid ID: 0000-0002-7490-0066
Department of Evolutionary Biology and Environmental Studies
University of Zürich
Office Y19-M-72
Winterthurerstrasse 190
8075 Zürich
Switzerland
Office: +41 (0)44 635 47 64
Cell: +41 (0)78 630 66 57
email: ***@***.***
***@***.***
PGP: 0x0F52F982
|
Thanks for your time and help! My search string above – “biomed” OR “biomed engineering” – was a bad example due to the word repetition. Apologies. A closer example to my search string is: "medicine" OR "biomed engineering". Importantly, "biomed engineering" needs to be treated as a single phrase and not as two words. Using inverted commas appears to fix this: Thank you for suggesting use of ‘title_and_abstract.search’. I want to search titles OR abstracts, and cannot implement ‘title_or_abstract.search’. Are you aware of any options to encode this, please? I cannot see any in the help / documentation. Thank you for highlighting the stemming issue: Thanks again. |
Thanks for your time and help!
Pleasure.
My search string above – “biomed” OR “biomed engineering” – was a bad example due to the word repetition. Apologies. A closer example to my search string is: "medicine" OR "biomed engineering". Importantly, "biomed engineering" needs to be treated as a single phrase and not as two words. Using inverted commas appears to fix this:
' medicine OR "biomed engineering" '
Good.
Thank you for suggesting use of ‘title_and_abstract.search’. I want to search titles OR abstracts, and cannot implement ‘title_or_abstract.search’. Are you aware of any options to encode this, please? I cannot see any in the help / documentation.
`title_and_abstract,search` searches the title and the abstract for the term m - so if either has the term, it will be returned. This is not the logical AND - it effectively searches in abstract and title and when one is true, it returns it.
Thank you for highlighting the stemming issue:
One other point t consider is stemming. One example where seeming is misleading is the search for “Researcher” in title and abstract. This returns, unexpectedly, the same results as “Research”. One has to be aware of this when building search queries. Also, stemming is also done in the inverted commas (“Researcher bias” returns the same result as “research bias”).
This seems an important limitation, if it applies to all instances of: -er, -ing, -ies, etc. Is there any work around?
Thanks again.
Glad that I could help.
|
Hello.
Thank you for the brilliant package. I have three questions:
In the example search string below, I search for "Biomed" OR "Biomed engineering" (as a mock example). Having run quality checks on the results, I am not convinced that it is treating 'biomed engineering' as a phrase, rather than individual words. Is my my coding incorrect?
When I repeat a given search string on the same day, it returns an identical number of publications. But when I repeat the search on consecutive days, it returns a few more publications each time. One might expect a small number of historical publications to be added to the Open Alex database, but some of my searches that go from 2019-2023 are returning 6% more publications when I run the code this month, compared to last month. I would like to clarify if this is due to on-going changes to the database, or the function oa_fetch(), or my code implementation.
I did see some discussion on stability between the Open Alex database and R package Suggestion for discussion about conversion from result to data.frame / tibble #247.
I want to search for words and phrases in the title OR abstract. I currently run this in two code chunks. Can I combine these for efficient code?
Thank you for again for the package, and for any help or guidance. It is really appreciated!
Deborah
My r code:
remotes::install_github("ropensci/openalexR") # following recent issue with package, I now install via Github.
packageVersion("openalexR") # 1.3.1
library(openalexR)
1. Search based on title
works_title <- oa_fetch(
entity = "works",
title.search = c("Biomed", "Biomed engineering"), # mock example
from_publication_date = "2019-01-01", to_publication_date = "2022-12-31", # mock example
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"), verbose = TRUE )
2. Search based on abstract
works_abstract <- oa_fetch(
entity = "works",
abstract.search = c("Biomed", "Biomed engineering"),
from_publication_date = "2019-01-01",to_publication_date = "2022-12-31",
cited_by_count = ">1",
options = list(sort = "cited_by_count:desc"),verbose = TRUE )
3. Quality checks:
count(works_abstract[duplicated(works_abstract$id), ]) # Are there duplicates within a dataframe # no
count(works_title[duplicated(works_title$id), ]) # Are there duplicates within a dataframe # no
common_publications <- intersect(works_title$id, works_abstract$id) # Are there duplicates across the 'title' and 'abstract' dataframes
length(common_publications) # yes, as one would expect.
4. Combine abstract and title dataframes:
works_title_filtered <- works_title %>% # Filter rows in works_title where id is not in works_abstract
filter(!(id %in% works_abstract$id))
works_combined <- bind_rows(works_abstract, works_title_filtered) # Combine the original works_abstract with the filtered works_title
count(works_combined[duplicated(works_combined$id), ]) # check no duplicates
5. put into bibliometrix format
works_combined <- oa2bibliometrix(works_combined)
The text was updated successfully, but these errors were encountered: