-
Notifications
You must be signed in to change notification settings - Fork 44
/
Copy path14-web-scraping.qmd
335 lines (212 loc) · 15.9 KB
/
14-web-scraping.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
# Web scraping
The data we need to answer a question is not always in a spreadsheet ready for us to read. For example, the US murders dataset we used in the R Basics chapter originally comes from this Wikipedia page:
```{r}
url <- "https://en.wikipedia.org/w/index.php?title=Gun_violence_in_the_United_States_by_state&direction=prev&oldid=810166167"
```
_Web scraping_, or _web harvesting_, is the term we use to describe the process of extracting data from a website. The reason we can do this is because the information used by a browser to render webpages is received as a text file from a server. The text is code written in hyper text markup language (HTML). Every browser has a way to show the html source code for a page, each one different. On Chrome, you can use **Control-U** on a PC and **command+alt+U** on a Mac.
## HTML
Here is the key piece of code:
```
<table class="wikitable sortable">
<tr>
<th>State</th>
<th><a href="/wiki/List_of_U.S._states_and_territories_by_population"
title="List of U.S. states and territories by population">Population</a><br />
<small>(total inhabitants)</small><br />
<small>(2015)</small> <sup id="cite_ref-1" class="reference">
<a href="#cite_note-1">[1]</a></sup></th>
<th>Murders and Nonnegligent
<p>Manslaughter<br />
<small>(total deaths)</small><br />
<small>(2015)</small> <sup id="cite_ref-2" class="reference">
<a href="#cite_note-2">[2]</a></sup></p>
</th>
<th>Murder and Nonnegligent
<p>Manslaughter Rate<br />
<small>(per 100,000 inhabitants)</small><br />
<small>(2015)</small></p>
</th>
</tr>
<tr>
<td><a href="/wiki/Alabama" title="Alabama">Alabama</a></td>
<td>4,853,875</td>
<td>348</td>
<td>7.2</td>
</tr>
<tr>
<td><a href="/wiki/Alaska" title="Alaska">Alaska</a></td>
<td>737,709</td>
<td>59</td>
<td>8.0</td>
</tr>
<tr>
```
## The rvest package
The __tidyverse__ provides a web harvesting package called __rvest__. The first step using this package is to import the webpage into R. The package makes this quite simple:
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
library(rvest)
h <- read_html(url)
```
Note that the entire Murders in the US Wikipedia webpage is now contained in `h`. The class of this object is:
```{r}
class(h)
```
The __rvest__ package is actually more general; it handles XML documents. XML is a general markup language (that's what the ML stands for) that can be used to represent any kind of data. HTML is a specific type of XML specifically developed for representing webpages. Here we focus on HTML documents.
Now, how do we extract the table from the object `h`? If we print `h`, we don't really see much:
```{r}
h
```
We can see all the code that defines the downloaded webpage using the `html_text` function like this:
```{r, eval=FALSE}
html_text(h)
```
If we look at it, we can see the data we are after are stored in an HTML table: you can see this in this line of the HTML code above `<table class="wikitable sortable">`. The different parts of an HTML document, often defined with a message in between `<` and `>` are referred to as _nodes_. The __rvest__ package includes functions to extract nodes of an HTML document: `html_nodes` extracts all nodes of different types and `html_node` extracts the first one. To extract the tables from the html code we use:
```{r}
tab <- h |> html_nodes("table")
```
Now, instead of the entire webpage, we just have the html code for the tables in the page:
```{r}
tab
```
The table we are interested is the first one:
```{r}
tab[[1]]
```
This is clearly not a tidy dataset, not even a data frame. In the code above, you can definitely see a pattern and writing code to extract just the data is very doable. In fact, __rvest__ includes a function just for converting HTML tables into data frames:
```{r}
tab <- read_html(url) |> html_nodes("table") %>% .[[1]] |> html_table()
class(tab)
```
We are now much closer to having a usable data table:
```{r}
tab <- tab |> setNames(c("state", "population", "total", "murder_rate"))
head(tab)
```
We still have some wrangling to do. For example, we need to remove the commas and turn characters into numbers. Before continuing with this, we will learn a more general approach to extracting information from web sites.
## CSS selectors {#sec-css-selectors}
The default look of a webpage made with the most basic HTML is quite unattractive. The aesthetically pleasing pages we see today are made using CSS to define the look and style of webpages. The fact that all pages for a company have the same style usually results from their use of the same CSS file to define the style. The general way these CSS files work is by defining how each of the elements of a webpage will look. The title, headings, itemized lists, tables, and links, for example, each receive their own style including font, color, size, and distance from the margin. CSS does this by leveraging patterns used to define these elements, referred to as _selectors_. An example of such a pattern, which we used above, is `table`, but there are many, many more.
If we want to grab data from a webpage and we happen to know a selector that is unique to the part of the page containing this data, we can use the `html_nodes` function. However, knowing which selector can be quite complicated.
In fact, the complexity of webpages has been increasing as they become more sophisticated. For some of the more advanced ones, it seems almost impossible to find the nodes that define a particular piece of data. However, selector gadgets actually make this possible.
SelectorGadget^[http://selectorgadget.com/] is piece of software that allows you to interactively determine what CSS selector you need to extract specific components from the webpage. If you plan on scraping data other than tables from html pages, we highly recommend you install it. A Chrome extension is available which permits you to turn on the gadget and then, as you click through the page, it highlights parts and shows you the selector you need to extract these parts. There are various demos of how to do this including __rvest__ author Hadley Wickham's
vignette^[https://cran.r-project.org/web/packages/rvest/vignettes/selectorgadget.html] and other tutorials based on the vignette^[https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/] ^[https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/].
## JSON
Sharing data on the internet has become more and more common. Unfortunately, providers use different formats, which makes it harder for data scientists to wrangle data into R. Yet there are some standards that are also becoming more common. Currently, a format that is widely being adopted is the JavaScript Object Notation or JSON. Because this format is very general, it is nothing like a spreadsheet. This JSON file looks more like the code you use to define a list. Here is an example of information stored in a JSON format:
```{r, echo=FALSE}
library(jsonlite)
example <- data.frame(name = c("Miguel", "Sofia", "Aya", "Cheng"),
student_id = 1:4, exam_1 = c(85, 94, 87, 90),
exam_2 = c(86, 93, 88, 91))
json <- toJSON(example, pretty = TRUE)
json
```
The file above actually represents a data frame. To read it, we can use the function `fromJSON` from the __jsonlite__ package. Note that JSON files are often made available via the internet. Several organizations provide a JSON API or a web service that you can connect directly to and obtain data. Here is an example providing information Nobel prize winners:
```{r}
nobel <- fromJSON("http://api.nobelprize.org/v1/prize.json")
```
This downloads a list. The first argument is a table with information about Nobel prize winners:
```{r}
filter(nobel$prize, category == "literature" & year == "1971") |> pull(laureates)
```
You can learn much more by examining tutorials and help files from the __jsonlite__ package. This package is intended for relatively simple tasks such as converting data into tables. For more flexibility, we recommend `rjson`.
## APIs
An API, or Application Programming Interface, is a set of rules and protocols that allows different software entities to communicate with each other. It defines methods and data formats that software components should use when requesting and exchanging information.
APIs can be understood as middlemen between different software systems, facilitating their interactions. They play a crucial role in enabling integration that make today's software so interconnected and versatile.
There are several types of APIs. The main ones related to retrieving data are:
* **Web Services**:
- Often built using protocols like HTTP/HTTPS.
- Commonly used to enable applications to communicate with each other over the web. For instance, a weather application for a smartphone may use a web API to request weather data from a remote server.
* **Databases**:
- Enable communication between an application and a database.
- Examples include SQL-based calls or more abstracted ORM (Object-Relational Mapping) frameworks.
Other APIs include:
* Hardware APIs: Used for applications to interact with hardware, such as printers, cameras, or graphics cards.
* Remote Procedure Calls: Allows a protocol to execute a program or procedure on a remote server rather than on a local system.
* Library or Framework APIs: Define how to interact with a particular programming library or framework.
* Operating Systems:Define how software applications request and perform lower-level services performed by an operating system.
Key concepts associated with APIs:
- **Endpoints**: Specific functions available through the API. For web APIs, an endpoint is usually a specific URL where the API can be accessed.
- **Methods**: Actions that can be performed. In web APIs, these often correspond to HTTP methods like GET, POST, PUT, DELETE.
- **Requests and Responses**: The act of asking the API to perform its function is a _request_, and the data it returns to you is the _response_.
- **Rate Limits**: Restrictions on how often you can call the API, often used to prevent abuse or overloading of the service.
- **Authentication and Authorization**: Mechanisms to ensure that only approved users or applications can use the API. Common methods include API keys, OAuth, or JWT.
- **Data Formats**: Many web APIs exchange data in a specific format, often JSON or XML.
In today's digital age, APIs are foundational. They enable the creation of complex, feature-rich applications by allowing different software components to leverage each other's capabilities seamlessly.
## httr2 package
The **httr2** package provides functions to work with HTTP requests. One of the core functions in this package is `request`, which is used to form request to send to web services. The `req_perform` function sends the request.
This `request` function formas a HTTP GET request to the specified URL. Typically, HTTP GET requests are used to retrieve information from a server based on the provided URL.
The function returns an object of class `response`. This object contains all the details of the server's response, including status code, headers, and content. You can then use other **httr2** functions to extract or interpret information from this response.
Let's say you want to retrieve a webpage or API endpoint:
```{r}
library(readr)
library(httr2)
url <- "https://data.cdc.gov/resource/gvsb-yw6g.csv"
response <- request(url) |> req_perform()
```
We can see the results of the request by looking at the returned object.
```{r}
response
```
To extract the body, which is where the data are, we can use `resp_body_string` and send the result, a comma delimited string, to `read_csv`.
```{r}
tab <- response |> resp_body_string() |> read_csv()
```
Note it is only `r nrow(tab)` entries. API often limit how much you can download.
We can change this by adding a query parameter to thr request. Here we use the `req_url_path_append`.
```{r}
tab <- request(url) |>
req_url_path_append("?$limit=10000000") |>
req_perform() |>
resp_body_string() |>
read_csv()
dim(tab)
```
The CDC service returns data in csv format. A more common format used by web services is JSON. The CDC also provides data in json format through a the url:
```{r}
url <- "https://data.cdc.gov/resource/gvsb-yw6g.json"
```
To extract the data table we use the `fromJSON` function.
```{r}
tab <- request(url) |>
req_perform() |>
resp_body_string() |>
fromJSON(flatten = TRUE)
```
When working with APIs, it's essential to check the API's documentation for rate limits, required headers, or authentication methods. The `httr` package provides tools to handle these requirements, such as setting headers or using OAuth authentication.
## Exercises (will not be included in midterm)
(@) Visit the following web page: [https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm](https://web.archive.org/web/20181024132313/http://www.stevetheump.com/Payrolls.htm)
Notice there are several tables. Say we are interested in comparing the payrolls of teams across the years. The next few exercises take us through the steps needed to do this.
Start by applying what you learned to read in the website into an object called `h`.
(@) Note that, although not very useful, we can actually see the content of the page by typing:
```{r, eval = FALSE}
html_text(h)
```
The next step is to extract the tables. For this, we can use the `html_nodes` function. We learned that tables in html are associated with the `table` node. Use the `html_nodes` function and the `table` node to extract the first table. Store it in an object `nodes`.
(@) The `html_nodes` function returns a list of objects of class `xml_node`. We can see the content of each one using, for example, the `html_text` function. You can see the content for an arbitrarily picked component like this:
```{r, eval = FALSE}
html_text(nodes[[8]])
```
If the content of this object is an html table, we can use the `html_table` function to convert it to a data frame. Use the `html_table` function to convert the 8th entry of `nodes` into a table.
(@) Repeat the above for the first 4 components of `nodes`. Which of the following are payroll tables:
a. All of them.
b. 1
c. 2
d. 2-4
(@) Repeat the above for the first __last__ 3 components of `nodes`. Which of the following is true:
a. The last entry in `nodes` shows the average across all teams through time, not payroll per team.
b. All three are payroll per team tables.
c. All three are like the first entry, not a payroll table.
d. All of the above.
(@) We have learned that the first and last entries of `nodes` are not payroll tables. Redefine `nodes` so that these two are removed.
(@) We saw in the previous analysis that the first table node is not actually a table. This happens sometimes in html because tables are used to make text look a certain way, as opposed to storing numeric values.
Remove the first component and then use `sapply` and `html_table` to convert each node in `nodes` into a table. Note that in this case, `sapply` will return a list of tables. You can also use `lapply` to assure that a list is applied.
(@) Look through the resulting tables. Are they all the same? Could we just join them with `bind_rows`?
(@) Create two tables, call them `tab_1` and `tab_2` using entries 10 and 19.
(@) Use a `full_join` function to combine these two tables. Before you do this you will have to fix the missing header problem. You will also need to make the names match.
(@) After joining the tables, you see several NAs. This is because some teams are in one table and not the other. Use the `anti_join` function to get a better idea of why this is happening.
(@) We see see that one of the problems is that Yankees are listed as both _N.Y. Yankees_ and _NY Yankees_. In the next section, we will learn efficient approaches to fixing problems like this. Here we can do it "by hand" as follows:
```{r, eval=FALSE}
tab_1 <- tab_1 |>
mutate(Team = ifelse(Team == "N.Y. Yankees", "NY Yankees", Team))
```
Now join the tables and show only Oakland and the Yankees and the payroll columns.