-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy path18_text.qmd
372 lines (268 loc) · 12.4 KB
/
18_text.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
# Text {#sec-rtext}
```{r}
#| include: false
library(dplyr)
library(readr)
library(scales)
library(forcats)
library(ggplot2)
```
::: {.callout .callout-note}
Module originally written by Connor Jerzak.
:::
## Where are we? Where are we headed? {.unnumbered}
Up till now, you should have covered:
- Loading in data;
- `R` notation;
- Matrix algebra.
## Review
- `"` and `'` are usually equivalent.
- `<-` and `=` are usually interchangeable[^18_text-1]. (`x <- 3` is equivalent to `x = 3`, although the former is more preferred because it explicitly states the assignment).
- Use `(` `)` when you are giving input to a function:
[^18_text-1]: Only equal signs are allowed to define the values of a functions' argument
```{r}
# my_results <- FunctionName(FunctionInputs)
```
```
note `c(1,2,3)` is inputting three numbers in the function `c`
```
- Use `{` `}` when you are defining a function or writing a `for` loop:
```{r}
# function
MyFunction <- function(InputMatrix) {
TempMat <- InputMatrix
for (i in 1:5) {
TempMat <- t(TempMat) %*% TempMat / 10
}
return(TempMat)
}
myMat <- matrix(rnorm(100 * 5), nrow = 100, ncol = 5)
print(MyFunction(myMat))
# loop
x <- c()
for (i in 1:20) {
x[i] <- i
}
print(x)
```
## Goals for today
Today, we will learn more about using text data. Our objectives are:
- Reading and writing in text in `R`.
- To learn how to use paste and sprintf;
- To learn how to use regular expressions;
- To learn about other tools for representing + analyzing text in `R`.
## Reading and writing text in R
- To read in a text file, use readLines
```
readLines("~/Downloads/Carboxylic acid - Wikipedia.html")
```
- To write a text file, use:
```
write.table(my_string_vector, "~/mydata.txt", sep="\t")
```
## `paste()` and `sprintf()`
paste and sprintf are useful commands in text processing, such as for automatically naming files or automatically performing a series of command over a subset of your data. Table making also will often need these commands.
Paste concatenates vectors together.
```{R}
# use collapse for inputs of length > 1
my_string <- c("Not", "one", "could", "equal")
paste(my_string, collapse = " ")
# use sep for inputs of length == 1
paste("Not", "one", "could", "equal", sep = " ")
```
For more sophisticated concatenation, use sprintf. This is very useful for automatically making tables.
```{R}
sprintf("Coefficient for %s: %.3f (%.2f)", "Gender", 1.52324, 0.03143)
# %s is replaced by a character string
# %.3f is replaced by a floating point digit with 3 decimal places
# %.2f is replaced by a floating point digit with 2 decimal places
```
## Regular expressions
A regular expression is a special text string for describing a search pattern. They are most often used in functions for detecting, locating, and replacing desired text in a corpus.
Use cases:
1. TEXT PARSING. E.g. I have 10000 congressional speaches. Find all those which mention Iran.
2. WEB SCRAPING. E.g. Parse html code in order to extract research information from an online table.
3. CLEANING DATA. E.g. After loading in a dataset, we might need to remove mistakes from the dataset, orsubset the data using regular expression tools.
Example in `R`. Extract the tweet mentioning Indonesia.
```{r}
s1 <- "If only Bradley's arm was longer. RT"
s2 <- "Share our love in Indonesia and in the World. RT if you agree."
my_string <- c(s1, s2)
grepl(my_string, pattern = "Indonesia")
my_string[grepl(my_string, pattern = "Indonesia")]
```
Key point: Many R commands use regular expressions. See `?grepl`. Assume that `x` is a character vector and that `pattern` is the target pattern. In the earlier example, `x` could have been something like `my_string` and `pattern` would have been "`Indonesia`". Here are other key uses:
1. DETECT PATTERNS. `grepl(pattern, x)` goes through all the entries of `x` and returns a string of TRUE and FALSE values of the same size as `x`. It will return a `TRUE` whenever that string entry has the target pattern, and `FALSE` whenever it doesn't.
2. REPLACE PATTERNS. `gsub(pattern, x, replacement)` goes through all the entries of `x` replaces the `pattern` with `replacement`.
```{r}
gsub(
x = my_string,
pattern = "o",
replacement = "AAAA"
)
```
3. LOCATE PATTERNS. `regexpr(pattern, text)` goes through each element of the character string. It returns a vector of the same length, with the entries of the vector corresponding to the location of the first pattern match, or a -1 if no match was obtained.
```{r}
regex_object <- regexpr(pattern = "was", text = my_string)
attr(regex_object, "match.length")
attr(regex_object, "useBytes")
regexpr(pattern = "was", text = my_string)[1]
regexpr(pattern = "was", text = my_string)[2]
```
Seems simple? The problem: the patterns can get pretty complex!
### Character classes
Some types of symbols are stand in for some more complex thing, rather than taken literally.
`[[:digit:]]` Matches with all digits.
`[[:lower:]]` Matches with lower case letters.
`[[:alpha:]]` Matches with all alphabetic characters.
`[[:punct:]]` Matches with all punctuation characters.
`[[:cntrl:]]` Matches with "control" characters such as `\n`, `\r`, etc.
Example in `R`:
```{r}
my_string <- "Do you think that 34% of apples are red?"
gsub(my_string, pattern = "[[:digit:]]", replace = "DIGIT")
gsub(my_string, pattern = "[[:alpha:]]", replace = "")
```
### Special Characters.
Certain characters (such as `., *, \`) have special meaning in the regular expressions framework (they are used to form conditional patterns as discussed below). Thus, when we want our pattern to explicitly include those characters as characters, we must "escape" them by using \\ or encoding them in \\Q...\\E.
Example in `R`:
```{r}
my_string <- "Do *really* think he will win?"
gsub(my_string, pattern = "\\*", replace = "")
```
```{r}
my_string <- "Now be brave! \n Dread what comrades say of you here in combat! "
gsub(my_string, pattern = "\\\n", replace = "")
```
### Conditional patterns
`[]` The target characters to match are located between the brackets. For example, `[aAbB]` will match with the characters `a, A, b, B`.
`[^...]` Matches with everything except the material between the brackets. For example, `[^aAbB]` will match with everything but the characters `a, A, b, B`.
`(?=)` Lookahead -- match something that IS followed by the pattern.
`(?!)` Negative lookahead --- match something that is NOT followed by the pattern.
`(?<=)` Lookbehind -- match with something that follows the pattern.
```{r}
my_string <- "Do you think that 34%of the 23%of apples are red?"
gsub(my_string, pattern = "(?<=%)", replace = " ", perl = TRUE)
```
```{r}
my_string <- c(
"legislative1_term1.png",
"legislative1_term1.pdf",
"legislative1_term2.png",
"legislative1_term2.pdf",
"term2_presidential1.png",
"presidential1.png",
"presidential1_term2.png",
"presidential1_term1.pdf",
"presidential1_term2.pdf"
)
grepl(my_string, pattern = "^(?!presidential1).*\\.png", perl = TRUE)
```
- Indicates which file names don't start with `presidential1` but do end in `.png`
- `^` indicates that the pattern should start at the beginning of the string.
- `?!` indicates negative lookahead -- we're looking for any pattern NOT following presidential1 which meets the subsequent conditions. (see below)
- The first `.` indicates that, following the negative lookahead, there can be any characters and the \* says that it doesn't matter how many. Note that we have to escape the . in `.png`. (by writing `\\.` instead of just `.`)
You will have the chance to try out some regular expressions for yourself at the end!
## Representing Text
In courses and research, we often want to analyze text, to extract meaning out of it. One of the key decisions we need to make is how to represent the text as numbers. Once the text is represented numerically, we can then apply a host of statistical and machine learning methods to it. Those methods are discussed more in the Gov methods sequence (Gov 2000-2003). Here's a summary of the decisions you must make:
1. WHICH TEXT TO USE? Which text do I want to analyze? What is my universe of documents?
2. HOW TO REPRESENT THE TEXT NUMERICALLY? How do I use numbers to represent different things about the text?
3. HOW TO ANALYZE THE NUMERICAL REPRESENTATION? How do I extract meaning out of the numerical representation?
Representing text numerically.
1. Document term matrix. The document term matrix (DTM) is a common method for representing text. The DTM is a matrix. Each row of this matrix corresponds to a document; each column corresponds to a word. It is often useful to look at summary statistics such as the percentage of speaches in which a Democratic lawmaker used the word "inequality" compared to a Republican; the DTM would be very helpful for this and other tasks.
```{R}
doc1 <- "Rage---Goddess, sing the rage of Peleus’ son Achilles,
murderous, doomed, that cost the Achaeans countless losses,
hurling down to the House of Death so many sturdy souls,
great fighters’ souls."
doc2 <- "And fate? No one alive has ever escaped it,
neither brave man nor coward, I tell you,
it's born with us the day that we are born."
doc3 <- "Many cities of men he saw and learned their minds,
many pains he suffered, heartsick on the open sea,
fighting to save his life and bring his comrades home."
```
```{r}
DocVec <- c(doc1, doc2, doc3)
```
Now we can use utility functions in the `tm` package:
```{R}
#| eval: false
library(tm)
DocCorpus <- Corpus(VectorSource(DocVec))
DTM1 <- inspect(DocumentTermMatrix(DocCorpus))
```
Consider the effect of different "pre-processing" choices on the resulting DTM!
```{r}
#| eval: false
DocVec <- tolower(DocVec)
DocVec <- gsub(DocVec, pattern = "[[:punct:]]", replace = " ")
DocVec <- gsub(DocVec, pattern = "[[:cntrl:]]", replace = " ")
DocCorpus <- Corpus(VectorSource(DocVec))
DTM2 <- inspect(DocumentTermMatrix(DocCorpus,
control = list(stopwords = TRUE, stemming = TRUE)
))
```
Stemming is the process of reducing inflected/derived words to their word stem or base (e.g. stemming, stemmed, stemmer --\> stem\*)
## Important packages for parsing text
1. rvest -- Useful for downloading and manipulating HTML and XM.
2. tm -- Useful for converting text into a numerical representation (forming DTMs).
3. stringr -- Useful for string parsing.
## Exercises {.unnumbered}
#### 1 {.unnumbered}
Figure out why this command does what it does:
`r sprintf("%s of spontaneous events are %s in the mind. Really, %.2f?", "15.03322123", "puzzles", 15.03322123)`
#### 2 {.unnumbered}
Why does this command not work?
```{r}
try(sprintf(
"%s of spontaneous events are %s in the mind. Really, %.2f?",
"15.03322123", "puzzles", "15.03322123"
), TRUE)
```
#### 3 {.unnumbered}
Using `grepl`, these materials, Google, and your friends, describe what the following command does. What changes when `value = FALSE`?
```{r}
grep("'",
c("To dare is to lose one's footing momentarily.", "To not dare is to lose oneself."),
value = TRUE
)
```
#### 4 {.unnumbered}
Write code to automatically extract the file names that DO end start with presidential and DO end in .pdf
```{r}
my_string <- c(
"legislative1_term1.png",
"legislative1_term1.pdf",
"legislative1_term2.png",
"legislative1_term2.pdf",
"term2_presidential1.png",
"presidential1.png",
"presidential1_term2.png",
"presidential1_term1.pdf",
"presidential1_term2.pdf"
)
```
#### 5 {.unnumbered}
Using the same string as in the above, write code to automatically extract the file names that end in .pdf and that contain the text `term2`.
```{r}
# Your code here
```
#### 6 {.unnumbered}
Combine these two strings into a single string separated by a "-". Desired output: "The carbonyl group in aldehydes and ketones is an oxygen analog of the carbon–carbon double bond."
```{r}
string1 <- "The carbonyl group in aldehydes and ketones
is an oxygen analog of the carbon"
string2 <- "–carbon double bond."
```
#### 7 {.unnumbered}
Challenge problem! Download this webpage <https://en.wikipedia.org/wiki/Odyssey>
- Read the html file into your R workspace.
- Remove all of the htlm tags (you may need Google to help with this one).
- Remove all punctuation.
- Make all the characters lower case.
- Do this same process with this webpage (https://en.wikipedia.org/wiki/Iliad).
- Form a document term matrix from the two resulting text strings.
```{r}
# Your code here
```