Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"primary_location:source.source.type" == "type"? #291

Open
yhan818 opened this issue Oct 21, 2024 · 5 comments
Open

"primary_location:source.source.type" == "type"? #291

yhan818 opened this issue Oct 21, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@yhan818
Copy link
Contributor

yhan818 commented Oct 21, 2024

I am currently conducting citation analysis. I focus on “works” data obtained from OpenAlex. This dataset serves as the primary source for conducting data analysis and data mining, specifically aimed at understanding the publications and citation (mainly articles).

I primarily use the “host_organization” field to analyze our publishers and selected publishers along with their associated journals. This analysis helps us identify which journals are frequently cited and determine how many times individual journals or publishers have been cited over the years.

During the process, I filtered using "issn_l" containing value for year 2019 to 2023.
# Filter rows where issn_l is neither NA nor an empty string articles_cited <- works_cited_final[!is.na(works_cited_final$issn_l) & works_cited_final$issn_l != "", ]
As a result, ~20% of articles have "referenced_works" values as NA. For example, https://openalex.org/works/W2980882411

I met with OpenAlex staff and they said that some book chapters have ISSN too . They recommended using an additional filter to see "primary_location:source.source.type: " = "journal".

OpenAlex technical documentation has "Journal articles will have a primary_location.source.type of journal" (type is defined )

I checked my data pulled, which has 38 cols. One col has a name "type". Is this the mapping of the" primary_location:source.source.type: "? If not, how to get it?

Thank you very much,

@yjunechoe
Copy link
Collaborator

yjunechoe commented Oct 22, 2024

Thanks for the details (and cross-checking with the OpenAlex team)!

I'm having some trouble following the narrative though - is the stuff about "host_organization" and "issn_l" also part of the issue you're reporting here?

Could you give us a small reprex isolating the problem? Then we can us use that as the basis for debugging.

@yhan818
Copy link
Contributor Author

yhan818 commented Oct 22, 2024

No. "host_organization" and "issn_l" are all good.

You can see the data structure using the following code:
org_works_2023 <-oa_fetch( entity="works", institutions.ror=c("03m2x1q45"), from_publication_date ="2023-01-01", to_publication_date = "2023-01-01" )

@yjunechoe
Copy link
Collaborator

yjunechoe commented Oct 22, 2024

Thanks - I think I see what you mean.

The $type column in the dataframe output corresponds to the type field (which is a property of the Work), not the primary_location.source.type field (which is a property of the Source).

Both are preserved in the output = "list" format but only the former is processed into a column in the output = "tibble" format.

A more minimal example:

work <- oa_fetch(identifier = "W4396214724")
work$type
#> [1] "article"

work_list <- oa_fetch(identifier = "W4396214724", output = "list")
work_list$type
#> [1] "article"
work_list$primary_location$source$type
#> [1] "journal"

I think what you have is a fair ask. We already hoist some properties of source into columns as so_*, so I can imagine a so_type column along with the ones we already process:

work |>
  select(starts_with("so"))
#> # A tibble: 1 × 2
#>   so            so_id                           
#>   <chr>         <chr>                           
#> 1 The R Journal https://openalex.org/S2489169438

@yjunechoe yjunechoe added the enhancement New feature or request label Oct 22, 2024
@trangdata
Copy link
Collaborator

Hi @yhan818 so actually in this case, you can perform the filtering within the query (instead of after pulling all the data down first) which is also more efficient. We could add column so_type but I'm a tiny bit hesitant about growing this output dataframe. Let me know if the below code chunk helps!

library(openalexR)
org_works_2023 <- oa_fetch(
  entity = "works",
  institutions.ror = c("03m2x1q45"),
  from_publication_date = "2023-01-01",
  to_publication_date = "2023-01-01",
  primary_location.source.type = "journal",
  options = list(sample = 20, seed = 1),
  verbose = TRUE
)
#> Requesting url: https://api.openalex.org/works?filter=institutions.ror%3A03m2x1q45%2Cfrom_publication_date%3A2023-01-01%2Cto_publication_date%3A2023-01-01%2Cprimary_location.source.type%3Ajournal&sample=20&seed=1
#> Getting 1 page of results with a total of 20 records...
org_works_2023
#> # A tibble: 20 × 39
#>    id     title display_name author ab    publication_date relevance_score so   
#>    <chr>  <chr> <chr>        <list> <chr> <chr>                      <dbl> <chr>
#>  1 https… Nove… Novel Wayfi… <df>   If a… 2023-01-01                 0.997 Vict…
#>  2 https… SEAR… SEARCHING F… <df>   <NA>  2023-01-01                 0.995 Abst…
#>  3 https… Neur… Neuropsychi… <df>   To e… 2023-01-01                 0.989 Leuk…
#>  4 https… Long… Longitudina… <df>   Alzh… 2023-01-01                 0.988 Neur…
#>  5 https… Corr… Corrigendum… <df>   <NA>  2023-01-01                 0.983 Magn…
#>  6 https… Tren… Trends in t… <df>   Spor… 2023-01-01                 0.981 Curr…
#>  7 https… The … The role of… <df>   This… 2023-01-01                 0.979 Ener…
#>  8 https… Rece… Recent prog… <df>   ZIP1… 2023-01-01                 0.979 Comp…
#>  9 https… Spec… Spectral me… <df>   We r… 2023-01-01                 0.975 Dele…
#> 10 https… Mult… Multiplexed… <df>   The … 2023-01-01                 0.974 Soft…
#> 11 https… Eval… Evaluation … <df>   <NA>  2023-01-01                 0.974 Arth…
#> 12 https… LATE… LATE MIOCEN… <df>   <NA>  2023-01-01                 0.969 Abst…
#> 13 https… Auto… Autobiograp… <df>   <NA>  2023-01-01                 0.966 Neur…
#> 14 https… MONI… MONITORING … <df>   <NA>  2023-01-01                 0.965 Abst…
#> 15 https… Edit… Editor’s Ch… <df>   Scie… 2023-01-01                 0.958 Mate…
#> 16 https… Maki… Making Dron… <df>   Smal… 2023-01-01                 0.949 Data…
#> 17 https… Disc… Discrete Ti… <df>   Conv… 2023-01-01                 0.949 IEEE…
#> 18 https… Audi… Audio deliv… <df>   Heal… 2023-01-01                 0.949 Proc…
#> 19 https… CHAR… CHARACTERIZ… <df>   <NA>  2023-01-01                 0.947 Abst…
#> 20 https… The … The implica… <df>   Amyo… 2023-01-01                 0.945 Anai…
#> # ℹ 31 more variables: so_id <chr>, host_organization <chr>, issn_l <chr>,
#> #   url <chr>, pdf_url <chr>, license <chr>, version <chr>, first_page <chr>,
#> #   last_page <chr>, volume <chr>, issue <chr>, is_oa <lgl>,
#> #   is_oa_anywhere <lgl>, oa_status <chr>, oa_url <chr>,
#> #   any_repository_has_fulltext <lgl>, language <chr>, grants <list>,
#> #   cited_by_count <int>, counts_by_year <list>, publication_year <int>,
#> #   cited_by_api_url <chr>, ids <list>, doi <chr>, type <chr>, …

Created on 2024-10-24 with reprex v2.0.2

@yjunechoe
Copy link
Collaborator

yjunechoe commented Oct 24, 2024

Great point! In that case I agree that we can hold off on this. Between the filter strategy and the option to run a secondary search on entity = "sources", a so_type column may not be so critical to have for a Works object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants