-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path06_harvest_files_from_websites.Rmd
309 lines (238 loc) · 12 KB
/
06_harvest_files_from_websites.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
---
title: "Iterate crawling and downloading files"
author: "John Little"
date: "`r Sys.Date()`"
output: html_notebook
---
This lesson will show how to use the `rvest` package to facilitate web crawling for data. Below you will see an abridged presentation based on a longer [workshop on web crawling](https://rfun.library.duke.edu/portfolio/scraping_workshop/). The downloaded data are excel workbook files, each containing approximately 50 worksheets.
A likely next steps will iterate over worksheets within each workbook. Developing a worksheet wrangling workflow is based on the article about [readxl workflows](https://readxl.tidyverse.org/articles/readxl-workflows.html), particularly but not limited to the section _concatenate worksheets into one data frame_.
Additionally, because the files are voluminous, at the end of this script I include functions to delete the downloaded workbook files.
### Data masking
Putting all these iteration tasks together within R's Tidyverse dialect can be challenging because of the conveniences afforded by [data masking](https://dplyr.tidyverse.org/articles/programming.html#data-masking). Data masking makes it easier to code faster as it requires less typing. Paradoxically, iterating becomes a bit challenging because of the need to indirectly refer to data variables while distinguishing from the assigned identity of environment variables. In short this is often done by embracing variables using the double curly braces `{{ }}`. But reality does get a bit technical and can be initially confusing -- although it's quite understandable and explained well in the [dplyr article that discusses indirection and tidy selection](https://dplyr.tidyverse.org/articles/programming.html#data-masking) along with additional useful information in the [ggplot2 article that discusses variables](https://ggplot2.tidyverse.org/articles/ggplot2-in-packages.html#using-aes-and-vars-in-a-package-function). Stick with it and you'll find the processes and workflows work quite well.
## Load library packages
```{r}
library(tidyverse)
library(readxl)
library(fs)
library(rvest)
```
## Data
The data we will scrape is from the [US Census pulse survey](https://www.census.gov/newsroom/press-kits/2020/pulse-surveys.html). Sepcifically we will look at the data from the [pulse surveys household experience](https://www.census.gov/data/experimental-data-products/household-pulse-survey.html)
The target household data is outlined at this [summary navigation page](https://www.census.gov/programs-surveys/household-pulse-survey/data.html). [The weekly pulse surveys](https://www.census.gov/data/tables/2020/demo/hhp/hhp13.html) are linked here and each file has a different and somewhat unpredictable URL.
> We can use `rvest` to crawl the summary page and harvest the urls of target files. curl_download() to download each file into a target directory within the RStudio project on the local file system.
Here are some example files of the housing pulse survey file 2b
- [housing2b_se_week7.xlsx](https://www2.census.gov/programs-surveys/demo/tables/hhp/2020/wk7/housing2b_se_week7.xlsx)
- [housing2b_se_week13.xlsx](https://www2.census.gov/programs-surveys/demo/tables/hhp/2020/wk13/housing2b_se_week13.xlsx)
- [housing2b_week37.xlsx](https://www2.census.gov/programs-surveys/demo/tables/hhp/2021/wk37/housing2b_week37.xlsx)
- [housing2b_week45.xlsx](https://www2.census.gov/programs-surveys/demo/tables/hhp/2022/wk45/housing2b_week45.xlsx)
`
## Set-up: scrape = crawl + parse; bulk-download
Assign some object names so I have an easy way to refer to my needed URLs. Use `read_html()` to ingest the raw HTML of the example page.
```{r}
my_url <- "https://www.census.gov/programs-surveys/household-pulse-survey/data.html"
base_url <- "https://www.census.gov/"
my_results <- read_html(my_url)
```
Crawl the target webpages to gather a list of URLs for files that may be download
```{r}
link_text <- my_results |>
html_nodes(".uscb-heading-2") |> # .uscb-heading-2 | .uscb-title-3
html_text()
link_url <- my_results |>
html_nodes("a.uscb-list-item") |>
html_attr("href")
my_crawl <- tibble(link_text, link_url, base_url) |>
filter(str_detect(link_text, "Household Pulse Survey:")) |>
unite(full_url, base_url, link_url, remove = TRUE, sep = "") |>
mutate(full_url = str_replace(full_url, 'gov//data', 'gov/data')) |>
relocate(full_url)
my_crawl
```
### Iterate with `map`
Use `map()` with `nest()`, `Sys.sleep()` and `libary(rvest)` to gather the target URLs that will be downloaded.
```{r}
crawl_results <-
my_crawl |> filter(str_detect(full_url, "hhp4[1-6]")) |> # the filter statement imposed to coerce consistent workshop experience.
slice_tail(n = 6) |> # slice(1:7) |> # limit the source data due to timing constraints in a workshop
# select(link_text_1, full_url) |>
nest(parenturl = -link_text) |>
mutate(my_rawhtml = map(parenturl, ~ {
Sys.sleep(2) ## DO THIS. Pause 2 seconds between each file. ##
.x |>
pull(full_url) |>
read_html() |>
html_nodes("a") |>
html_attr("href") |>
tibble()
}))
```
Subset the list of downloadable files to only the `housing2b` files.
```{r}
download_target_urls <- crawl_results |>
unnest(my_rawhtml) |>
rename(rawhtml = 3) |>
filter(str_detect(rawhtml, "housing2b")) |>
mutate(download_xworkbook_url = str_glue("https:{rawhtml}") ) |>
mutate(my_filename = fs::path_file(rawhtml))
download_target_urls
```
### Download the Excel files
`walk()` is like `map()` (in the same package: `purrr`) and used for it's secondary characteristics. In this case, the characteristic I need is downloaded files.
```{r}
fs::dir_create("data/xl_workbooks")
walk2(download_target_urls$download_xworkbook_url,
str_glue("data/xl_workbooks/{download_target_urls$my_filename}"),
curl::curl_download, mode = "wb")
```
What files did I just download?
```{r}
fs::dir_ls("data/xl_workbooks") |>
enframe()
```
## Put a bow on it.
In lesson five a single excel file was downloaded, wrangled, and then visualized. In this lesson, above, multiple files are downloaded. Multiple worksheets from multiple workbook files are ingested and wrangled into a single data frame with the goal of generating multiple bar plots. Custom functions, `map`, and a _for_ loop are used advantageously.
```{r}
# get filenames and limit to relevant files 'housing2b_week999'
my_files <- fs::dir_ls("data/xl_workbooks", glob = "*.xlsx") %>%
grep("2b_w", ., value = TRUE)
# get and repair filenames
getcolnamesa <- read_excel(my_files[1], range = "A5:J5")
getcolnamesb <- read_excel(my_files[1], range = "A6:J6")
getcolnames <-
c(names(getcolnamesa[1:2]),
names(getcolnamesb[3:8]),
names(getcolnamesa[9:10]))
# getcolnames
getcolnames[1] <- "select_characteristics"
getcolnames[2] <- "total"
# Get a plot title that I will use later.
table_title <- read_excel(my_files[1], range = "A1:A1", col_names = FALSE) |>
pull(1)
```
### For loop
I found it convenient to use a _for_ loop in combination with `purrr::map`. There's no shame to it.
```{r}
# for Loop to build big data frame
my_df_build <- tibble()
for(i in 1:length(my_files)) {
my_iterate_df <- my_files[i] |>
excel_sheets() |>
set_names() %>%
grep("[NS]C", ., value = TRUE) %>%
map_df(., ~ read_excel(my_files[i], sheet = .x,
range = "A8:J132",
col_names = FALSE), .id = "sheet") |>
mutate(file = my_files[i]) |>
relocate(file, .before = sheet)
my_df_build <- bind_rows(my_df_build, my_iterate_df)
}
# Assign proper filenames
names(my_df_build)[3:12] <- getcolnames
# names(my_df_build)
```
Improve this code? Absolutely. At the almost-bottom of this code see the above code-chunk optimized using the `map2()` function
### Useful vectors
Making vectors of character data can be useful to writing clear code.
```{r}
my_levels_categories <-
rev(c("Highly confident",
"Moderately confident",
"Slightly confident",
"No at all confident",
"Payment is/will be deferred",
# "total",
# "Occupied without rent",
# "Did not report to tenure"
"Did not report"))
remove_from_table <- c("total",
"Occupied without rent",
"Did not report to tenure")
my_levels_characteristics <-
rev(c("Bachelor’s degree or higher",
"Some college/associate’s degree",
"High school or GED",
"Less than high school"))
```
### Custom Functions
```{r}
wrangle_df <- . %>%
filter(select_characteristics != "Total") %>%
mutate(sub_group = if_else(is.na(total),
select_characteristics,
NA_character_)) %>%
fill(sub_group, .direction = "down") %>%
relocate(sub_group) %>%
drop_na(total) %>%
filter(sub_group == "Education") %>%
mutate(across(total:last_col(), as.numeric)) %>%
pivot_longer(total:last_col(), names_to = "my_category") %>%
filter(my_category != remove_from_table) %>%
group_by(sub_group, file, sheet, select_characteristics, my_category) %>%
summarise(totals = sum(value, na.rm = TRUE)) %>%
# ungroup() %>%
mutate(my_category =
fct_relevel(my_category, my_levels_categories))
make_barblot <- function(my_df, xaxis, yaxis, my_facet, ...) {
ggplot(data = my_df, aes({{xaxis}}, {{yaxis}})) +
geom_col() +
scale_y_continuous(
labels = scales::label_comma(scale_cut = scales::cut_short_scale())) +
coord_flip() +
facet_wrap(vars(fct_relevel({{my_facet}}, my_levels_characteristics))) +
theme(plot.title.position = "plot")
}
```
### Generate a bunch of plots
```{r}
my_df <- my_df_build |>
wrangle_df() |>
nest(data = -c(sub_group, file, sheet)) |>
mutate(myplot = map(data, make_barblot,
my_category, totals, select_characteristics)) |>
mutate(plot_title = fs::path_file(file)) |>
mutate(myplot = map(myplot, ~ .x + labs(
title = str_wrap(table_title, 80),
subtitle = str_glue("subset by {sheet} {sub_group}"),
caption = str_glue("Source: U.S. Census Bureau Household Pulse Survey - {plot_title}"),
x = "", y = ""))) |>
mutate(filename_title = str_glue("output/{fs::path_ext_remove(plot_title)}_{sheet}.svg"))
my_df
my_df |>
pull(myplot)
fs::dir_create("output")
walk2(my_df$filename_title, my_df$myplot, ggsave)
```
## map2
In the For loop section above, I combined a for loop with the `map()` function. However, it's helpful to understand the map function has variants that allow you to more efficiently iterate over multiple arguments. Thus, the map() and for loop code can be written more succinctly. After importing all the data,
```{r}
my_map2_df <- my_files |>
enframe() |>
mutate(sheet = map(value, excel_sheets)) |>
mutate(sheet = map(sheet, set_names)) |>
unnest(sheet) |>
rename(file = name) |>
filter(str_detect(sheet, "[NS]C")) |>
mutate(myXLtbl = map2(file, sheet, read_excel, range = "A8:J132", col_names = FALSE)) |>
unnest(myXLtbl) |>
select(-value)
names(my_map2_df)[3:12] <- getcolnames
my_map2_df |>
wrangle_df() |>
nest(data = -c(sub_group, file, sheet)) |>
mutate(myplot = map(data, make_barblot,
my_category, totals, select_characteristics)) |>
mutate(plot_title = fs::path_file(file)) |>
mutate(myplot = map(myplot, ~ .x + labs(
title = str_wrap(table_title, 80),
subtitle = str_glue("subset by {sheet} {sub_group}"),
caption = str_glue("Source: U.S. Census Bureau Household Pulse Survey - {plot_title}"),
x = "", y = ""))) |>
mutate(filename_title = str_glue("output/{fs::path_ext_remove(plot_title)}_{sheet}.svg")) |>
pull(myplot)
```
## Delete the downloaded files
Normally, I personally avoid using code to delete files in a directory. However, the chunk below can be used to keep the size of the project as small as possible. Since the code-chunk is set to `eval=FALSE`, this code chunk will not run unless you run it manually.
```{r eval=FALSE, include=TRUE}
fs::dir_delete("data/xl_workbooks")
fs::dir_delete("output")
```