-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy path01-data-prep.Rmd
280 lines (185 loc) · 12.7 KB
/
01-data-prep.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
```{r, echo=FALSE, purl=FALSE, message = FALSE}
library(sotu)
library(tidyverse)
library(tidytext)
library(readtext)
knitr::opts_chunk$set(warning=FALSE, message=FALSE,comment = "#>", purl = FALSE)
```
# Preparing Textual Data {#textprep}
> Learning Objectives
>
> - read textual data into R using `readtext`
> - use the `stringr` package to prepare strings for processing
> - use `tidytext` functions to tokenize texts and remove stopwords
> - use `SnowballC` to stem words
------------
We'll use several R packages in this section:
- `sotu` will provide the metadata and text of State of the Union speeches ranging from George Washington to Barack Obama.
- `tidyverse` is a collection of R packages designed for data science, including `dplyr` with a set of verbs for common data manipulations and `ggplot2` for visualization.
- `tidytext` provides specific functions for a "tidy" approach to working with textual data, where one row represents one "token" or meaningful unit of text, for example a word.
- `readtext` provides a function well suited to reading textual data from a large number of formats into R, including metadata.
```{r load-libs, eval=FALSE}
library(sotu)
library(tidyverse)
library(tidytext)
library(readtext)
```
## Reading text into R
First, let's look at the data in the `sotu` package. The metadata and texts are contained in this package separately in `sotu_meta` and `sotu_text` respectively. We can take a look at those by either typing the names or use funnctions like `glimpse()` or `str()`. Below, or example is what the metadata look like. Can you tell how many speeches there are?
```{r sotu-meta}
# Let's take a look at the state of the union metadata
str(sotu_meta)
```
In order to work with the speech texts and to later practice reading text files from disk we use the function `sotu_dir()` to write the texts out. This function by default writes to a temporary directory with one speech in each file. It returns a character vector where each element is the name of the path to the individual speech file. We save this vector into the `file_paths` variable.
```{r file-paths}
# sotu_dir writes the text files to disk in a temporary dir,
# but you could also specify a location.
file_paths <- sotu_dir()
head(file_paths)
```
Now that we have the files on disk and a vector of filepaths, we can pass this vector directly into `readtext` to read the texts into a new variable.
```{r readtext}
# let's read in the files with readtext
sotu_texts <- readtext(file_paths)
```
`readtext()` generated a dataframe for us with 2 colums: the doc_id, which is the name of the document and the actual text:
```{r readtext-view}
glimpse(sotu_texts)
```
To work with a single table, we combine the text and metadata. Our `sotu_texts` are organized by alphabetical order, so we sort our metadata in `sotu_meta` to match that order and then bind the columns.
```{r text-meta-combine}
sotu_whole <-
sotu_meta %>%
arrange(president) %>% # sort metadata
bind_cols(sotu_texts) %>% # combine with texts
as_tibble() # convert to tibble for better screen viewing
glimpse(sotu_whole)
```
Now that we have our data combined, we can start looking at the text. Typically quite a bit of effort goes into pre-processing the text for further analysis. Depending on the quality of your data and your goal, you might for example need to:
- replace certain characters or words,
- remove urls or certain numbers, such as phone numbers,
- clean up misspellings or errors,
- etc.
There are several ways to handle this sort of cleaning, we'll show a few examples below.
## String operations
R has many functions available to manipulate strings including functions like `grep` and `paste`, which come with the R base install.
Here we will here take a look at the `stringr` package, which is part of the `tidyverse`. It refers to a lot of functionality from the `stringi` package which is perhaps one of the most comprehensive string manipulation packages.
Below are examples for a few functions that might be useful.
### Counting ocurrences
`str_count` takes a character vector as input and by default counts the number of pattern matches in a string.
How man times does the word "citizen" appear in each of the speeches?
```{r str-count-citizen}
sotu_whole %>%
mutate(n_citizen = str_count(text, "citizen"))
```
It is possible to use regular expressions, for example, this is how we would check how many times either "citizen" or "Citizen" appear in each of the speeches:
```{r str-count-citizen-regex}
sotu_whole %>%
mutate(n_citizen = str_count(text, "citizen"),
n_cCitizen = str_count(text, "[C|c]itizen"))
```
Going into the use of [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) is beyond this introduction. However we want to point out the `str_view()` function which can help you to create your correct expression. Also see [RegExr](https://regexr.com/), an online tool to learn, build, & test regular expressions.
When used with the `boundary` argument `str_count()` can count different entities like "character", "line_break", "sentence", or "word". Here we add a new column to the dataframe indicating how many words are in each speech:
```{r str-word-count}
sotu_whole %>%
mutate(n_words = str_count(text, boundary("word")))
```
>>> CHALLENGE: Use the code above and add another column `n_sentences` where you calculate the number of sentences per speech. Then create a third column `avg_word_per_sentence`, where you calculate the number of words per sentence for each speech. Finally use `filter` to find which speech has shortest/longest average sentences length and what is the average length.
```{r avg-sentence-length, eval=FALSE, echo=FALSE}
sotu_whole %>%
mutate(n_words = str_count(text, boundary("word")),
n_sentences = str_count(text, boundary("sentence")),
avg_word_per_sentence = n_words/n_sentences) %>%
filter(avg_word_per_sentence %in% range(avg_word_per_sentence))
```
### Detecting patterns
`str_detect` also looks for patterns, but instead of counts it returns a logical vector (TRUE/FALSE) indiciating if the pattern is or is not found. So we typically want to use it with the `filter` "verb" from `dplyr`.
What are the names of the documents where the words "citizen" and "Citizen" do **not** occur?
```{r str-detect}
sotu_whole %>%
filter(!str_detect(text, "[C|c]itizen")) %>%
select(doc_id)
```
### Extracting words
The `word` function extracts specific words from a character vector of words. By default it returns the first word. If for example we wanted to extract the first 5 words of each speech by Woodrow Wilson we provide the `end` argument like this:
```{r extract-words}
sotu_whole %>%
filter(president == "Woodrow Wilson") %>% # sample a few speeches as demo
pull(text) %>% # we pull out the text vector only
word(end = 5)
```
### Replacing and removing characters
Now let's take a look at text 'cleaninng'. We will first remove the newline characters (`\n`). We use the `str_replace_all` function to replace all the ocurrences of the `\n` pattern with a white space `" "`. We need to add the escape character `\` in front of our pattern to be replaced so the backslash before the `n` is interpreted correctly.
```{r str-remove-newl}
sotu_whole %>%
filter(president == "Woodrow Wilson") %>%
pull(text) %>%
str_replace_all("\\n", " ") %>% # replace newline
word(end = 5)
```
This looks better, but we still have a problem to extract exactly 5 words because the too whitespaces are counted as a word. So let's get rid of any whitespaces before and after, as well as repeated whitespaces within the string with the `str_squish()` function.
```{r str-remove-spaces}
sotu_whole %>%
filter(president == "Woodrow Wilson") %>%
pull(text) %>%
str_replace_all("\\n", " ") %>%
str_squish() %>% # remove whitespaces
word(end = 5)
```
(For spell checks take a look at https://CRAN.R-project.org/package=spelling or https://CRAN.R-project.org/package=hunspell)
## Tokenize
A very common part of preparing your text for analysis involves *tokenization*. Currently our data contains in each each row a single text with metadata, so the entire speech text is the unit of observation. When we tokenize we break down the text into "tokens" (most commonly single words), so each row contains a single word with its metadata as unit of observation.
`tidytext` provides a function `unnest_tokens()` to convert our speech table into one that is tokenized. It takes three arguments:
- a tibble or data frame which contains the text;
- the name of the newly created column that will contain the tokens;
- the name of the column within the data frame which contains the text to be tokenized.
In the example below we name the new column to hold the tokens `word`. Remember that the column that holds the speech is called `text`.
```{r tokenize}
tidy_sotu <- sotu_whole %>%
unnest_tokens(word, text)
tidy_sotu
```
Note that the `unnest_tokens` function didn't just tokenize our texts at the word level. It also lowercased each word and stripped off the punctuation. We can tell it not to do this, by adding the following parameters:
```{r punct-lowercase}
# Word tokenization with punctuation and no lowercasing
sotu_whole %>%
unnest_tokens(word, text, to_lower = FALSE, strip_punct = FALSE)
```
We can also tokenize the text at the level of ngrams or sentences, if those are the best units of analysis for our work.
```{r sentence-token}
# Sentence tokenization
sotu_whole %>%
unnest_tokens(sentence, text, token = "sentences", to_lower = FALSE) %>%
select(sentence)
# N-gram tokenization as trigrams
sotu_whole %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
select(trigram)
```
(Take note that the trigrams are generated by a moving 3-word window over the text.)
## Stopwords
Another common task of preparing text for analysis is to remove stopwords. Stopwords are highly common words that are considered to provide non-relevant information about the content of a text.
Let's look at the stopwords that come with the `tidytext` package to get a sense of what they are.
```{r stopwords}
stop_words
```
These are English stopwords, pulled from different lexica ("onix", "SMART", or "snowball"). Depending on the type of analysis you're doing, you might leave these words in or alternatively use your own curated list of stopwords. Stopword lists exist for many languages, see for examle the [`stopwords` package in R](https://cran.r-project.org/package=stopwords). For now we will remove the English stopwords as suggested here.
For this we use `anti_join` from `dplyr`. We join and return all rows from our table of tokens `tidy_sotu` where there are **no matching values** in our list of stopwords. Both of these tables have one column name in common: `word` so by default the join will be on that column, and dplyr will tell us so.
```{r remove-stopwords}
tidy_sotu_words <- tidy_sotu %>%
anti_join(stop_words)
tidy_sotu_words
```
If we compare this with `tidy_sotu` we see that the records with words like "of", "the", "and", "in" are now removed.
We also went from `r nrow(tidy_sotu)` to `r nrow(tidy_sotu_words)` rows, which means we had a lot of stopwords in our corpus. This is a huge removal, so for serious analysis, we might want to scrutinize the stopword list carefully and determine if this is feasible.
## Word Stemming
Another way you may want to clean your data is to stem your words, that is, to reduce them to their word stem or root form, for example reducing *fishing*, *fished*, and *fisher* to the stem *fish*.
`tidytext` does not implement its own word stemmer. Instead it relies on separate packages like `hunspell` or `SnowballC`.
We will give an example here for the `SnowballC` package which comes with a function `wordStem`. (`hunspell` appears to run much slower, and it also returns a list instead of a vector, so in this context `SnowballC` seems to be more convenient.)
```{r stemming}
library(SnowballC)
tidy_sotu_words %>%
mutate(word_stem = wordStem(word))
```
Lemmatization takes this another step further. While a stemmer operates on a single word without knowledge of the context, lemmatization attempts to discriminate between words which have different meanings depending on part of speech. For example, the word "better" has "good" as its lemma, something a stemmer would not detect.
For lemmatization in R, you may want to take a look a the [`koRpus`](https://CRAN.R-project.org/package=koRpus) package, another [comprehensive R package for text analysis](https://cran.r-project.org/web/packages/koRpus/vignettes/koRpus_vignette.html). It allows to use [TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/), a widely used part-of-speech tagger. For full functionality of the R package a local installation of TreeTagger is recommended.