-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathLesson8.Rmd
232 lines (178 loc) · 6.85 KB
/
Lesson8.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
params:
lesson: "Lesson 8"
title: "Scraping and manipulating text strings"
bookchapter_name: "Cheat sheet for the `stringr` package"
bookchapter_section: "https://stringr.tidyverse.org/"
functions: "`str_which`,`str_detect`,`str_locate`,`str_view`,`str_sub`"
packages: "`stringr`"
# end inputs ---------------------------------------------------------------
header-includes: \usepackage{float}
always_allow_html: yes
output:
html_document:
code_folding: show
---
```{r, setup, echo = FALSE, cache = FALSE, include = FALSE}
options(width=100)
knitr::opts_chunk$set(
eval = T, # run all code
echo = TRUE, # show code chunks in output
tidy = TRUE, # make output as tidy
message = FALSE, # mask all messages
warning = FALSE, # mask all warnings
comment = "",
tidy.opts=list(width.cutoff=100), # set width of code chunks in output
size="small" # set code chunk size
)
```
\
<!-- install packages -->
```{r, load packages, eval=T, include=T, cache=F, message=F, warning=F, results='hide',echo=F}
# install.packages("pacman")
pacman::p_load(stringr,stringi,dplyr,reprex,xml2,rvest)
# reprex = for rendering text string in HTML
```
<!-- ____________________________________________________________________________ -->
<!-- ____________________________________________________________________________ -->
<!-- ____________________________________________________________________________ -->
<!-- start body -->
# `r paste0(params$lesson,": ",params$title)`
\
Functions for `r params$lesson`
`r params$functions`
\
Packages for `r params$lesson`
`r params$packages`
\
# Agenda
Use the `r params$packages` package to cut, substitute, print, and manipulate character and text strings in `R`. Useful for webscraping text from webpages, scraping PDFs and text files for given characters and words, mining genomics data, etc.
[`r params$bookchapter_name`](`r params$bookchapter_section`).
\
<!-- ----------------------- image --------------------------- -->
<div align="center">
<img src="img/stringr.png" style=width:50%>
</div>
<!-- ----------------------- image --------------------------- -->
\
<!-- end yaml template------------------------------------------------------- -->
Install necessary packages
```{r}
# install.packages("pacman") # uncomment and install this first
pacman::p_load(stringr,stringi,dplyr,reprex,xml2,rvest)
```
First, we need some text data. As an exercise, since we're using strings, we're going to use all the text from the webpage on using strings from the [R for Data Science textbook](https://r4ds.had.co.nz/strings.html) as our text sample.
```{r, eval=T}
require(xml2)# read html data
require(rvest) # select html elements
url <- "https://r4ds.had.co.nz/strings.html"
txt <- url %>% read_html %>% html_text() # scrape web text from url
txt %>% str
txt %>% str_length() # get length of vector
```
# Detecting strings
Search for location of string patterns using `str_detect`, `str_which`, and `str_locate`
```{r}
pat <- "strings" # string pattern to search for
txt %>% str_detect(pat) # returns logical if vector contains that pattern
txt %>% str_which(pat) # show which vector the pattern exists
txt %>% str_locate(pat) # show character positions of the first instance of pattern
txt %>% str_locate_all(pat) # show all positions
```
# Subsetting strings
Subset and cut up strings into manageable pieces
```{r, collapse = T}
# subset string portion based on char position
txt %>% str_sub(
txt %>% str_locate(pat) # use positions from above func
)
# insert user text into string position, e.g. between 1 and 2
str_sub(txt,1,2) <- "INSERT TEXT AT POSITION"
```
Shorten text with ellipsis to nth character
```{r}
txt_short <- txt %>% str_trunc(20) # munst be greater than 3 as this is the length of the ellipsis
txt_short
```
Return string as char vector containing pattern
```{r, eval=F}
txt %>% str_subset(pat)
```
Extract string patterns as characters
```{r}
txt %>% str_extract(pat) # pull pattern out of string
txt %>% str_extract_all(pat, simplify = F) # extract all patterns as string . set simplify = T to return matrix
txt %>% str_match(pat) # extract pattern as matrix
txt %>% str_match_all(pat) # extract all pattern instances as matrix
```
View an HTML rendering of the text using `str_view()`
```{r}
# visualise the first 100 characters
txt %>% str_sub(1,100) %>%
str_view(" ")
```
\
Split the text into separate components and apply the `str_sub` function to each new component
```{r, eval=T}
# split into matrix at every instance of pattern
txt_split <- txt %>% str_split_fixed(pat, n = Inf)
txt_split %>% dim # get dimensions of matrix
txt_split[1,20] # view 1st row and 3rd column
```
# Mutating and joining strings
Replace pattern instances with new pattern
```{r, eval=F}
repl <- "when you really need that coffee hit" # replacement character string
txt %>%
str_replace_all(pat,repl)
```
You'll notice that the first instance of the returned pattern is capitalised, so the replacement doesn't catch it and thus ignores the string. We can easily tell `R` to detect all instances of the pattern by ignoring case using `regex`
```{r, eval=F}
pat_all <- regex(pat, ignore_case = T)
pat_all
txt %>% str_replace_all(pat_all,repl)
```
# Further useful functions
Duplicate string
```{r}
# use the smaller, split text
txt_s <- txt_split[5]
txt_s %>% str_dup(3) # duplicate string n number of times (3)
```
Removing white space and truncating text
```{r}
txt_s %>% str_replace_all(" ","") # remove all spaces
txt_s %>% str_trim(side="both") # strip white space from both ends
```
## Alternative functions from `stringi` package
```{r}
require(stringi)
txt_s %>%
stri_replace_all_charclass("\\p{WHITE_SPACE}","") # remove middle white space
txt_s %>% str_replace_na() # change NAs into true "NA"
```
## Including vectors within strings
Insert numeric vectors without breaking character string
```{r}
vect <- 1000
str_interp("For including vectors like this ${vect} when you can't break the character strng")
```
Useful when breaking character quotes e.g. HTML tags
```{r}
str_interp("<div style=\"color:#F90F40;\"> <strong> Total count </strong> ${vect}")
```
Include lists within function
```{r}
str_interp("First value, ${v1}, Second value, ${v2*2}.",
list(v1 = 10, v2 = 20)
)
```
And data frames
```{r}
str_interp(
"Values are $[.2f]{max(Sepal.Width)} and $[.2f]{min(Sepal.Width)}.",
iris
)
```
# Regular expressions, i.e. regex
You can find in-depth info on how to parse character vectors or strings or find specific character patterns using regular expressions in the [`R` for Data Science book](https://r4ds.had.co.nz/strings.html). There's also [a handy regex tool](https://regex101.com/r/ksY7HU/2) for live text parsing.