-
Notifications
You must be signed in to change notification settings - Fork 22
/
README.Rmd
130 lines (92 loc) · 6.38 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
message = FALSE,
warning = FALSE
)
```
# gutenbergr
<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/gutenbergr)](https://CRAN.R-project.org/package=gutenbergr)
[![rOpenSci peer-review](https://badges.ropensci.org/41_status.svg)](https://github.com/ropensci/software-review/issues/41)
[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)
[![R-CMD-check](https://github.com/ropensci/gutenbergr/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/ropensci/gutenbergr/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/ropensci/gutenbergr/graph/badge.svg)](https://app.codecov.io/gh/ropensci/gutenbergr)
<!-- badges: end -->
Download and process public domain works from the [Project Gutenberg](https://www.gutenberg.org/) collection. Includes
* A function `gutenberg_download()` that downloads one or more works from Project Gutenberg by ID: e.g., `gutenberg_download(84)` downloads the text of Frankenstein.
* Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
* `gutenberg_metadata` contains information about each work, pairing Gutenberg ID with title, author, language, etc
* `gutenberg_authors` contains information about each author, such as aliases and birth/death year
* `gutenberg_subjects` contains pairings of works with Library of Congress subjects and topics
## Installation
::: .pkgdown-release
Install the released version of gutenbergr from [CRAN](https://cran.r-project.org/):
```{r, eval = FALSE}
install.packages("gutenbergr")
```
:::
::: .pkgdown-devel
Install the development version of gutenbergr from [GitHub](https://github.com/):
```{r, eval = FALSE}
# install.packages("pak")
pak::pak("ropensci/gutenbergr")
```
:::
## Examples
The `gutenberg_works()` function retrieves, by default, a table of metadata for all unique English-language Project Gutenberg works that have text associated with them. (The `gutenberg_metadata` dataset has all Gutenberg works, unfiltered).
```{r echo = FALSE}
options(dplyr.width = 140)
options(width = 100)
```
Suppose we wanted to download Emily Bronte's "Wuthering Heights." We could find the book's ID by filtering:
```{r}
library(dplyr)
library(gutenbergr)
gutenberg_works() |>
filter(title == "Wuthering Heights")
# or just:
gutenberg_works(title == "Wuthering Heights")
```
Since we see that it has `gutenberg_id` 768, we can download it with the `gutenberg_download()` function:
```{r}
wuthering_heights <- gutenberg_download(768)
wuthering_heights
```
`gutenberg_download` can download multiple books when given multiple IDs. It also takes a `meta_fields` argument that will add variables from the metadata.
```{r}
# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
books |>
count(title)
```
It can also take the output of `gutenberg_works` directly. For example, we could get the text of all Aristotle's works, each annotated with both `gutenberg_id` and `title`, using:
```{r}
aristotle_books <- gutenberg_works(author == "Aristotle") |>
gutenberg_download(meta_fields = "title")
aristotle_books
```
## FAQ
### What do I do with the text once I have it?
* The [Natural Language Processing CRAN View](https://CRAN.R-project.org/view=NaturalLanguageProcessing) suggests many R packages related to text mining, especially around the [tm package](https://cran.r-project.org/package=tm).
* The [tidytext](https://github.com/juliasilge/tidytext) package is useful for tokenization and analysis, especially since gutenbergr downloads books as a data frame already.
* You could match the `wikipedia` column in `gutenberg_author` to Wikipedia content with the [WikipediR](https://cran.r-project.org/package=WikipediR) package or to pageview statistics with the [wikipediatrend](https://cran.r-project.org/package=wikipediatrend) package.
* If you're considering an analysis based on author name, you may find the [humaniformat](https://cran.r-project.org/package=humaniformat) (for extraction of first names) and [gender](https://cran.r-project.org/package=gender) (prediction of gender from first names) packages useful. (Note that humaniformat has a `format_reverse` function for reversing "Last, First" names).
### How were the metadata R files generated?
See the [data-raw](https://github.com/ropensci/gutenbergr/tree/master/data-raw) directory for the scripts that generate these datasets. As of now, these were generated from [the Project Gutenberg catalog](https://www.gutenberg.org/ebooks/offline_catalogs.html) on **`r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y')`**.
### Do you respect the rules regarding robot access to Project Gutenberg?
Yes! The package respects [these rules](https://www.gutenberg.org/policy/robot_access.html) and complies to the best of our ability. Namely:
* Project Gutenberg allows wget to harvest Project Gutenberg using [this list of links](https://www.gutenberg.org/robot/harvest?filetypes[]=html). The gutenbergr package visits that page once to find the recommended mirror for the user's location.
* We retrieve the book text directly from that mirror using links in the same format. For example, Frankenstein (book 84) is retrieved from `https://www.gutenberg.lib.md.us/8/84/84.zip`.
* We give priority to retrieving the `.zip` file to minimize bandwidth on the mirror. `.txt` files are only retrieved if there is no `.zip`.
Still, this package is *not* the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow [their recommendation](https://www.gutenberg.org/policy/robot_access.html) to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic.
## Code of Conduct
Please note that the gutenbergr project is released with a [Contributor Code of Conduct](https://contributor-covenant.org/version/2/1/CODE_OF_CONDUCT.html). By contributing to this project, you agree to abide by its terms.
[![ropensci\_footer](https://ropensci.org/public_images/github_footer.png)](https://ropensci.org/)