18_text.qmd

# Text {#sec-rtext}

```{r}
#| include: false
library(dplyr)
library(readr)
library(scales)
library(forcats)
library(ggplot2)
```

::: {.callout .callout-note}
Module originally written by Connor Jerzak.
:::

## Where are we? Where are we headed? {.unnumbered}

Up till now, you should have covered:

-   Loading in data;
-   `R` notation;
-   Matrix algebra.

## Review

-   `"` and `'` are usually equivalent.
-   `<-` and `=` are usually interchangeable[^18_text-1]. (`x <- 3` is equivalent to `x = 3`, although the former is more preferred because it explicitly states the assignment).
-   Use `(` `)` when you are giving input to a function:

[^18_text-1]: Only equal signs are allowed to define the values of a functions' argument

```{r}
# my_results <- FunctionName(FunctionInputs)
```

```         
note  `c(1,2,3)` is inputting three numbers in the function `c`
```

-   Use `{` `}` when you are defining a function or writing a `for` loop:

```{r}
# function
MyFunction <- function(InputMatrix) {
  TempMat <- InputMatrix
  for (i in 1:5) {
    TempMat <- t(TempMat) %*% TempMat / 10
  }
  return(TempMat)
}
myMat <- matrix(rnorm(100 * 5), nrow = 100, ncol = 5)
print(MyFunction(myMat))

# loop
x <- c()
for (i in 1:20) {
  x[i] <- i
}
print(x)
```

## Goals for today

Today, we will learn more about using text data. Our objectives are:

-   Reading and writing in text in `R`.
-   To learn how to use paste and sprintf;
-   To learn how to use regular expressions;
-   To learn about other tools for representing + analyzing text in `R`.

## Reading and writing text in R

-   To read in a text file, use readLines

```         
readLines("~/Downloads/Carboxylic acid - Wikipedia.html")
```

-   To write a text file, use:

```         
write.table(my_string_vector, "~/mydata.txt", sep="\t") 
```

## `paste()` and `sprintf()`

paste and sprintf are useful commands in text processing, such as for automatically naming files or automatically performing a series of command over a subset of your data. Table making also will often need these commands.

Paste concatenates vectors together.

```{R}
# use collapse for inputs of length > 1
my_string <- c("Not", "one", "could", "equal")
paste(my_string, collapse = " ")

# use sep for inputs of length == 1
paste("Not", "one", "could", "equal", sep = " ")
```

For more sophisticated concatenation, use sprintf. This is very useful for automatically making tables.

```{R}
sprintf("Coefficient for %s: %.3f (%.2f)", "Gender", 1.52324, 0.03143)

# %s is replaced by a character string
# %.3f is replaced by a floating point digit with 3 decimal places
# %.2f is replaced by a floating point digit with 2 decimal places
```

## Regular expressions

A regular expression is a special text string for describing a search pattern. They are most often used in functions for detecting, locating, and replacing desired text in a corpus.

Use cases:

1.  TEXT PARSING. E.g. I have 10000 congressional speaches. Find all those which mention Iran.
2.  WEB SCRAPING. E.g. Parse html code in order to extract research information from an online table.
3.  CLEANING DATA. E.g. After loading in a dataset, we might need to remove mistakes from the dataset, orsubset the data using regular expression tools.

Example in `R`. Extract the tweet mentioning Indonesia.

```{r}
s1 <- "If only Bradley's arm was longer. RT"
s2 <- "Share our love in Indonesia and in the World. RT if you agree."
my_string <- c(s1, s2)
grepl(my_string, pattern = "Indonesia")
my_string[grepl(my_string, pattern = "Indonesia")]
```

Key point: Many R commands use regular expressions. See `?grepl`. Assume that `x` is a character vector and that `pattern` is the target pattern. In the earlier example, `x` could have been something like `my_string` and `pattern` would have been "`Indonesia`". Here are other key uses:

1.  DETECT PATTERNS. `grepl(pattern, x)` goes through all the entries of `x` and returns a string of TRUE and FALSE values of the same size as `x`. It will return a `TRUE` whenever that string entry has the target pattern, and `FALSE` whenever it doesn't.

2.  REPLACE PATTERNS. `gsub(pattern, x, replacement)` goes through all the entries of `x` replaces the `pattern` with `replacement`.

```{r}
gsub(
  x = my_string,
  pattern = "o",
  replacement = "AAAA"
)
```

3.  LOCATE PATTERNS. `regexpr(pattern, text)` goes through each element of the character string. It returns a vector of the same length, with the entries of the vector corresponding to the location of the first pattern match, or a -1 if no match was obtained.

```{r}
regex_object <- regexpr(pattern = "was", text = my_string)
attr(regex_object, "match.length")
attr(regex_object, "useBytes")
regexpr(pattern = "was", text = my_string)[1]
regexpr(pattern = "was", text = my_string)[2]
```

Seems simple? The problem: the patterns can get pretty complex!

### Character classes

Some types of symbols are stand in for some more complex thing, rather than taken literally.

`[[:digit:]]` Matches with all digits.

`[[:lower:]]` Matches with lower case letters.

`[[:alpha:]]` Matches with all alphabetic characters.

`[[:punct:]]` Matches with all punctuation characters.

`[[:cntrl:]]` Matches with "control" characters such as `\n`, `\r`, etc.

Example in `R`:

```{r}
my_string <- "Do you think that 34% of apples are red?"
gsub(my_string, pattern = "[[:digit:]]", replace = "DIGIT")
gsub(my_string, pattern = "[[:alpha:]]", replace = "")
```

### Special Characters.

Certain characters (such as `., *, \`) have special meaning in the regular expressions framework (they are used to form conditional patterns as discussed below). Thus, when we want our pattern to explicitly include those characters as characters, we must "escape" them by using \\ or encoding them in \\Q...\\E.

Example in `R`:

```{r}
my_string <- "Do *really* think he will win?"
gsub(my_string, pattern = "\\*", replace = "")
```

```{r}
my_string <- "Now be brave! \n Dread what comrades say of you here in combat! "
gsub(my_string, pattern = "\\\n", replace = "")
```

### Conditional patterns

`[]` The target characters to match are located between the brackets. For example, `[aAbB]` will match with the characters `a, A, b, B`.

`[^...]` Matches with everything except the material between the brackets. For example, `[^aAbB]` will match with everything but the characters `a, A, b, B`.

`(?=)` Lookahead -- match something that IS followed by the pattern.

`(?!)` Negative lookahead --- match something that is NOT followed by the pattern.

`(?<=)` Lookbehind -- match with something that follows the pattern.

```{r}
my_string <- "Do you think that 34%of the 23%of apples are red?"
gsub(my_string, pattern = "(?<=%)", replace = " ", perl = TRUE)
```

```{r}
my_string <- c(
  "legislative1_term1.png",
  "legislative1_term1.pdf",
  "legislative1_term2.png",
  "legislative1_term2.pdf",
  "term2_presidential1.png",
  "presidential1.png",
  "presidential1_term2.png",
  "presidential1_term1.pdf",
  "presidential1_term2.pdf"
)

grepl(my_string, pattern = "^(?!presidential1).*\\.png", perl = TRUE)
```

-   Indicates which file names don't start with `presidential1` but do end in `.png`
-   `^` indicates that the pattern should start at the beginning of the string.
-   `?!` indicates negative lookahead -- we're looking for any pattern NOT following presidential1 which meets the subsequent conditions. (see below)
-   The first `.` indicates that, following the negative lookahead, there can be any characters and the \* says that it doesn't matter how many. Note that we have to escape the . in `.png`. (by writing `\\.` instead of just `.`)

You will have the chance to try out some regular expressions for yourself at the end!

## Representing Text

In courses and research, we often want to analyze text, to extract meaning out of it. One of the key decisions we need to make is how to represent the text as numbers. Once the text is represented numerically, we can then apply a host of statistical and machine learning methods to it. Those methods are discussed more in the Gov methods sequence (Gov 2000-2003). Here's a summary of the decisions you must make:

1.  WHICH TEXT TO USE? Which text do I want to analyze? What is my universe of documents?
2.  HOW TO REPRESENT THE TEXT NUMERICALLY? How do I use numbers to represent different things about the text?
3.  HOW TO ANALYZE THE NUMERICAL REPRESENTATION? How do I extract meaning out of the numerical representation?

Representing text numerically.

1.  Document term matrix. The document term matrix (DTM) is a common method for representing text. The DTM is a matrix. Each row of this matrix corresponds to a document; each column corresponds to a word. It is often useful to look at summary statistics such as the percentage of speaches in which a Democratic lawmaker used the word "inequality" compared to a Republican; the DTM would be very helpful for this and other tasks.

```{R}
doc1 <- "Rage---Goddess, sing the rage of Peleus’ son Achilles,
         murderous, doomed, that cost the Achaeans countless losses,
         hurling down to the House of Death so many sturdy souls,
         great fighters’ souls."
doc2 <- "And fate? No one alive has ever escaped it,
         neither brave man nor coward, I tell you,
         it's born with us the day that we are born."
doc3 <- "Many cities of men he saw and learned their minds,
         many pains he suffered, heartsick on the open sea,
         fighting to save his life and bring his comrades home."
```

```{r}
DocVec <- c(doc1, doc2, doc3)
```

Now we can use utility functions in the `tm` package:

```{R}
#| eval: false
library(tm)
DocCorpus <- Corpus(VectorSource(DocVec))
DTM1 <- inspect(DocumentTermMatrix(DocCorpus))
```

Consider the effect of different "pre-processing" choices on the resulting DTM!

```{r}
#| eval: false
DocVec <- tolower(DocVec)
DocVec <- gsub(DocVec, pattern = "[[:punct:]]", replace = " ")
DocVec <- gsub(DocVec, pattern = "[[:cntrl:]]", replace = " ")
DocCorpus <- Corpus(VectorSource(DocVec))
DTM2 <- inspect(DocumentTermMatrix(DocCorpus,
  control = list(stopwords = TRUE, stemming = TRUE)
))
```

Stemming is the process of reducing inflected/derived words to their word stem or base (e.g. stemming, stemmed, stemmer --\> stem\*)

## Important packages for parsing text

1.  rvest -- Useful for downloading and manipulating HTML and XM.
2.  tm -- Useful for converting text into a numerical representation (forming DTMs).
3.  stringr -- Useful for string parsing.

## Exercises {.unnumbered}

#### 1 {.unnumbered}

Figure out why this command does what it does:

`r sprintf("%s of spontaneous events are %s in the mind.          Really, %.2f?",          "15.03322123", "puzzles", 15.03322123)`

#### 2 {.unnumbered}

Why does this command not work?

```{r}
try(sprintf(
  "%s of spontaneous events are %s in the mind. Really, %.2f?",
  "15.03322123", "puzzles", "15.03322123"
), TRUE)
```

#### 3 {.unnumbered}

Using `grepl`, these materials, Google, and your friends, describe what the following command does. What changes when `value = FALSE`?

```{r}
grep("'",
  c("To dare is to lose one's footing momentarily.", "To not dare is to lose oneself."),
  value = TRUE
)
```

#### 4 {.unnumbered}

Write code to automatically extract the file names that DO end start with presidential and DO end in .pdf

```{r}
my_string <- c(
  "legislative1_term1.png",
  "legislative1_term1.pdf",
  "legislative1_term2.png",
  "legislative1_term2.pdf",
  "term2_presidential1.png",
  "presidential1.png",
  "presidential1_term2.png",
  "presidential1_term1.pdf",
  "presidential1_term2.pdf"
)
```

#### 5 {.unnumbered}

Using the same string as in the above, write code to automatically extract the file names that end in .pdf and that contain the text `term2`.

```{r}
# Your code here
```

#### 6 {.unnumbered}

Combine these two strings into a single string separated by a "-". Desired output: "The carbonyl group in aldehydes and ketones is an oxygen analog of the carbon–carbon double bond."

```{r}
string1 <- "The carbonyl group in aldehydes and ketones
            is an oxygen analog of the carbon"
string2 <- "–carbon double bond."
```

#### 7 {.unnumbered}

Challenge problem! Download this webpage <https://en.wikipedia.org/wiki/Odyssey>

-   Read the html file into your R workspace.
-   Remove all of the htlm tags (you may need Google to help with this one).
-   Remove all punctuation.
-   Make all the characters lower case.
-   Do this same process with this webpage (https://en.wikipedia.org/wiki/Iliad).
-   Form a document term matrix from the two resulting text strings.

```{r}
# Your code here
```