-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdata.Rmd
109 lines (75 loc) · 9.88 KB
/
data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
---
title: "Open Scholarly Data @ SUB Göttingen - Overview"
output:
distill::distill_article:
toc: true
toc_float: true
toc_depth: 2
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
We use Google Big Query to work with large open scholarly data. Our main data sources are [Unpaywall](https://unpaywall.org), [Crossref](https://www.crossref.org) and [OpenAlex](https://openalex.org).
An overview of our data warehouse including procedures to load the data into BigQuery can be found below.
Anyone can view and query our publicly available [Open Scholarly Data warehouse](https://console.cloud.google.com/bigquery?hl=en&project=subugoe-collaborative) on BigQuery with a [Google Cloud Computing account](https://cloud.google.com/). Note that Google will charge you for the number of bytes processed by each query (currently $ 6.25 per 1 TB).
## Status Crossref
### Current Snapshot (cr_instant)
::: l-body-outset
| Snapshot | File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
|-----------------|-----------------|---------------------|----------------------|-----------|--------------|-----------|--------------------|
| 2024/12 | all.json.tar.gz | [cr_instant.snapshot](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_instant) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 08.01.2025 | 2013-2024 | 52.114.759 |
:::
### Historical Snapshots (cr_history)
::: l-body-outset
| Snapshot | File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
|-----------------|-----------------|---------------------|----------------------|-----------|--------------|-----------|--------------------|
| 2018/04 | all.json.tar.gz | [cr_history.cr_apr18](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_history) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 20.02.2022 | 2013-2018 | 16.766.035 |
| 2019/04 | all.json.tar.gz | [cr_history.cr_apr19](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_history) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 29.10.2021 | 2013-2019 | 20.715.644 |
| 2020/04 | all.json.tar.gz | [cr_history.cr_apr20](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_history) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 29.10.2021 | 2013-2020 | 25.334.525 |
| 2021/04 | all.json.tar.gz | [cr_history.cr_apr21](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_history) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 29.10.2021 | 2013-2021 | 30.579.119 |
| 2022/04 | all.json.tar.gz | [cr_history.cr_apr22](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_history) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 14.05.2022 | 2013-2022 | 35.939.195 |
| 2023/04 | all.json.tar.gz | [cr_history.cr_apr23](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_history) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 07.05.2023 | 2013-2023 | 41.767.461 |
| 2024/04 | all.json.tar.gz | [cr_history.cr_apr24](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2scr_history) | schema_crossref.json | [Repo](https://github.com/naustica/crossref_bq) | 07.05.2024 | 2013-2024 | 47.709.184 |
:::
## Status Unpaywall
### Current Snapshot (upw_instant)
::: l-body-outset
| Snapshot| File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
|---------|-----------------------------------------------|----------------------|----------------------|-----------|--------------|-----------|-----------------|
| 2022/03 | unpaywall_snapshot_2022-03-09T083001.jsonl.gz | [upw_instant.snapshot](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2supw_instant) | bq_schema_mar22.json | [Repo](https://github.com/naustica/unpaywall_bq) | 14.03.2022 | 2008-2022 | 67.424.819 |
:::
### Historical Snapshots (upw_history)
::: l-body-outset
| Snapshot| File | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
|---------|-----------------------------------------------|-----------------------------|----------------------|-----------|--------------|-----------|-----------------|
| 2018/03 | unpaywall_snapshot_2018-03-29T113154.jsonl.gz | [upw_history.upw_Mar18_08_20](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2supw_history) | bq_schema_mar18.json | [Repo](https://github.com/naustica/unpaywall_bq) | 29.10.2021 | 2008-2018 | 36.557.043 |
| 2019/02 | unpaywall_snapshot_2019-02-21T031509.jsonl.gz | [upw_history.upw_Feb19_08_19](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2supw_history) | bq_schema_feb19.json | [Repo](https://github.com/naustica/unpaywall_bq) | 10.11.2021 | 2008-2019 | 42.143.979 |
| 2020/02 | unpaywall_snapshot_2020-02-25T115244.jsonl.gz | [upw_history.upw_Feb20_08_20](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2supw_history) | bq_schema_feb20.json | [Repo](https://github.com/naustica/unpaywall_bq) | 30.10.2021 | 2008-2020 | 49.717.710 |
| 2021/02 | unpaywall_snapshot_2021-02-18T160139.jsonl.gz | [upw_history.upw_Feb21_08_21](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2supw_history) | bq_schema_feb21.json | [Repo](https://github.com/naustica/unpaywall_bq) | 29.10.2021 | 2008-2021 | 58.437.927 |
| 2022/03 | unpaywall_snapshot_2022-03-09T083001.jsonl.gz | [upw_history.upw_Mar22_08_22](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2supw_history) | bq_schema_mar22.json | [Repo](https://github.com/naustica/unpaywall_bq) | 14.03.2022 | 2008-2022 | 67.424.819 |
:::
## Status Semantic Scholar
::: l-body-outset
| Snapshot | Directory | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
|------------|--------------|----------------------|----------------------|-----------|--------------|-----------|-----------------|
| 2024-05-28 | papers/ | [semantic_scholar.papers](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2ssemantic_scholar) | / | [Repo](https://github.com/naustica/MA/blob/main/download_papers.py) | 10.06.2024 | All | 218.668.220 |
| 2024-05-28 | venues/ | [semantic_scholar.venues](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2ssemantic_scholar) | / | [Repo](https://github.com/naustica/MA/blob/main/download_venues.py) | 10.06.2024 | All | 194.578 |
:::
## Status Openalex
::: l-body-outset
| Snapshot | Directory | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
|------------|---------------|-----------------------|-----------------------------------|-----------|--------------|-----------|----------------------|
| 2024-12-31 | authors/ | [openalex.authors](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sopenalex) | schema_openalex_author.json | [Repo](https://github.com/naustica/openalex) | 07.01.2025 | All | 101.693.809 |
| 2025-01-01 | funders/ | [openalex.funders](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sopenalex) | schema_openalex_funders.json | [Repo](https://github.com/naustica/openalex) | 07.01.2025 | All | 32.437 |
| 2025-01-01 | institutions/ | [openalex.institutions](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sopenalex) | schema_openalex_institutions.json | [Repo](https://github.com/naustica/openalex) | 07.01.2025 | All | 110.553 |
| 2025-01-01 | publishers/ | [openalex.publishers](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sopenalex) | schema_openalex_publishers.json | [Repo](https://github.com/naustica/openalex) | 07.01.2025 | All | 10.741 |
| 2025-01-01 | sources/ | [openalex.sources](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sopenalex) | schema_openalex_sources.json | [Repo](https://github.com/naustica/openalex) | 07.01.2025 | All | 260.811 |
| 2024-12-30 | topics/ | [openalex.topics](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sopenalex) | schema_openalex_topics.json | [Repo](https://github.com/naustica/openalex) | 07.01.2025 | All | 4.516 |
| 2024-12-31 | works/ | [openalex.works](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sopenalex) | schema_openalex_work.json | [Repo](https://github.com/naustica/openalex) | 07.01.2025 | All | 262.630.159 |
:::
## Status OpenAlex Document Type classification by SUB Göttingen
::: l-body-outset
| Snapshot | Directory | Table | Schema | Procedure | Last Changed | Coverage | Number of rows |
|------------|--------------|----------------------|----------------------|-----------|--------------|-----------|-----------------|
| 2024-12-31 | works/ | [resources.classification_article_reviews_december24](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1ssubugoe-collaborative!2sresources) | schema_document_types.json | [Repo](https://github.com/naustica/openalex_doctypes/tree/classifier/classifier) | 10.01.2025 | 2014-2024 | 58.240.262 |
:::