sentiment-analysis.rmd

# Sentiment Analysis

```{r setup_sentiment, echo=FALSE}
knitr::opts_chunk$set(cache.rebuild=T, cache=T) 
```

<div style='text-align:center; font-size: 10rem; color:#990024CC'>
<i class="far fa-smile-beam fa-10x"></i> 
<i class="far fa-meh-blank fa-10x"></i> 
<i class="far fa-angry fa-10x"></i>
</div>

## Basic idea

A common and intuitive approach to text is <span class="emph">sentiment analysis</span>.  In a grand sense, we are interested in the emotional content of some text, e.g. posts on Facebook, tweets, or movie reviews.  Most of the time, this is obvious when one reads it, but if you have hundreds of thousands or millions of strings to analyze, you'd like to be able to do so efficiently.

We will use the <span class="pack">tidytext</span> package for our demonstration.  It comes with a lexicon of positive and negative words that is actually a combination of multiple sources, one of which provides numeric ratings, while the others suggest different classes of sentiment.


```{r lexicon, echo=-1}
set.seed(1234)
library(tidytext)
sentiments %>% slice(sample(1:nrow(sentiments)))
```

The gist is that we are dealing with a specific, pre-defined vocabulary.  Of course, any analysis will only be as good as the lexicon. The goal is usually to assign a sentiment score to a text, possibly an overall score, or a generally positive or negative grade. Given that, other analyses may be implemented to predict sentiment via standard regression tools or machine learning approaches.

## Issues

### Context, sarcasm, etc.

Now consider the following.

```{r sent_is_sick}
sentiments %>% filter(word=='sick') 
```

Despite the above assigned sentiments, the word *sick* has been used at least since 1960s surfing culture as slang for positive affect.  A basic approach to sentiment analysis as described here will not be able to detect slang or other context like sarcasm.  However, lots of training data for a particular context may allow one to correctly predict such sentiment.  In addition, there are, for example, slang lexicons, or one can simply add their own complements to any available lexicon.

### Lexicons

In addition, the lexicons are going to maybe be applicable to *general* usage of English in the western world.  Some might wonder where exactly these came from or who decided that the word *abacus* should be affiliated with 'trust'. You may start your path by typing `?sentiments` at the console if you have the <span class="pack">tidytext</span> package loaded.

## Sentiment Analysis Examples


### The first thing the baby did wrong

We demonstrate sentiment analysis with the text *The first thing the baby did wrong*, which is a very popular brief guide to parenting written by world renown psychologist [Donald Barthelme][Donald Barthelme] who, in his spare time, also wrote postmodern literature.  This particular text talks about an issue with the baby, whose name is Born Dancin', and who likes to tear pages out of books. Attempts are made by her parents to rectify the situation, without much success, but things are finally resolved at the end.  The ultimate goal will be to see how sentiment in the text evolves over time, and in general we'd expect things to end more positively than they began.

How do we start? Let's look again at the <span class="objclass">sentiments</span> data set in the <span class="pack">tidytext</span> package.


```{r inspect_sentiments}
sentiments %>% slice(sample(1:nrow(sentiments)))
```

The <span class="objclass">bing</span> lexicon provides only *positive* or *negative* labels. The AFINN, on the other hand, is numerical, with ratings -5:5 that are in the <span class="objclass">score</span> column. The others get more imaginative, but also more problematic. Why *assimilate* is *superfluous* is beyond me. It clearly should be negative given the [Borg](https://en.wikipedia.org/wiki/Borg_%28Star_Trek%29) connotations.

```{r superfluous}
sentiments %>% 
  filter(sentiment=='superfluous')
```

#### Read in the text files

But I digress.  We start with the raw text, reading it in line by line.  In what follows we read in all the texts (three) in a given directory, such that each element of 'text' is the work itself, i.e. `text` is a list column[^text]. The <span class="func">unnest</span> function will unravel the works to where each entry is essentially a paragraph form.

```{r baby_sentiment_importraw, echo=T}
library(tidytext)
barth0 = 
  data_frame(file = dir('data/texts_raw/barthelme', full.names = TRUE)) %>%
  mutate(text = map(file, read_lines)) %>%
  transmute(work = basename(file), text) %>%
  unnest(text) 
```

#### Iterative processing

One of the things stressed in this document is the iterative nature of text analysis.  You will consistently take two steps forward, and then one or two back as you find issues that need to be addressed. For example, in a subsequent step I found there were encoding issues[^encoding], so the following attempts to fix them.  In addition, we want to <span class="emph">tokenize</span> the documents such that our <span class="emph">tokens</span> are sentences (e.g. as opposed to words or paragraphs). The reason for this is that I will be summarizing the sentiment at sentence level.


```{r barth_fix_encoding, echo=1:2, eval=1:2}
# Fix encoding, convert to sentences; you may get a warning message
barth = barth0 %>% 
  mutate(
    text = 
      sapply(
        text, 
        stringi::stri_enc_toutf8, 
        is_unknown_8bit = TRUE,
        validate = TRUE
        )
  ) %>%
  unnest_tokens(
    output = sentence,
    input = text,
    token = 'sentences'
  )

save(barth, file='data/barth_sentences.RData')
```

#### Tokenization

The next step is to drill down to just the document we want, and subsequently tokenize to the word level.  However, I also create a sentence id so that we can group on it later.

```{r get_the_baby}
# get baby doc, convert to words
baby = barth %>% 
  filter(work=='baby.txt') %>% 
  mutate(sentence_id = 1:n()) %>%
  unnest_tokens(
    output = word,
    input = sentence,
    token = 'words',
    drop = FALSE
  ) %>%
  ungroup()
```

#### Get sentiments

Now that the data has been prepped, getting the sentiments is ridiculously easy.  But that is how it is with text analysis.  All the hard work is spent with the data processing.  Here all we need is an <span class="emph">inner join</span> of our words with a sentiment lexicon of choice. This process will only retain words that are also in the lexicon.  I use the numeric-based lexicon here. At that point, we get a sum score of sentiment by sentence.


```{r baby_sentiment}
# get sentiment via inner join
baby_sentiment = baby %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(sentence_id, sentence) %>% 
  summarise(sentiment = sum(score)) %>%
  ungroup()
```

#### Alternative approach

As we are interested in the sentence level, it turns out that the <span class="pack">sentimentr</span> package has built-in functionality for this, and includes a more nuanced sentiment scores that takes into account <span class="emph">valence shifters</span>, e.g. words that would negate something with positive or negative sentiment ('I do ***not*** like it').

```{r sentimentr, eval=FALSE}
baby_sentiment = barth0 %>%
  filter(work=='baby.txt') %>% 
  get_sentences(text) %>% 
  sentiment() %>% 
  drop_na() %>%   # empty lines
  mutate(sentence_id = row_number())
```


The following visualizes sentiment over the progression of sentences (note that not every sentence will receive a sentiment score). You can read the sentence by hovering over the dot. The <span style="color:#5500ff">&#9644;</span> is the running average.

```{r plot_sentiment, echo=FALSE}
# plot sentiment over sentences
baby_sentiment %>%
  mutate(running_avg = cumsum(sentiment)/row_number()) %>% 
  plot_ly(width='50%') %>% 
  add_lines(x=~sentence_id, y=~running_avg,
            color=I('#5500ff'), 
            line = list(shape = "spline"),
            opacity=.5,
            name='running average',
            showlegend=T) %>%
  add_lines(x=~sentence_id, y=~sentiment,
            color=I('#00aaff'), 
            line = list(shape = "spline"),
            showlegend=F) %>%
  add_markers(x=~sentence_id, y=~sentiment, 
              color=I('#ff5500'),
              marker = list(size=15),
              hoverinfo=~ 'text', 
              text=~str_wrap(sentence),
              showlegend=F) %>% 
  theme_plotly() 

# baby_sentiment %>%
#   mutate(running_avg = cumsum(sentiment)/row_number()) %>% 
#   filter(sentiment != 0) %>%
#   plot_ly(width='50%') %>% 
#   add_lines(
#     x =  ~ sentence_id,
#     y =  ~ running_avg,
#     color = I('#5500ff'),
#     line = list(shape = "spline"),
#     opacity = .5,
#     name = 'running average',
#     showlegend = T
#   ) %>% 
#   add_lines(x=~sentence_id, y=~sentiment,
#             color=I('#00aaff'), 
#             line = list(shape = "spline"),
#             showlegend=F) %>% 
#   add_markers(x=~sentence_id, y=~sentiment,
#               color=I('#ff5500'),
#               marker = list(size=15),
#               hoverinfo=~ 'text',
#               text=~str_wrap(text),
#               showlegend=F) %>%
#   theme_plotly()
```

<br>

In general, the sentiment starts out negative as the problem is explained. It bounces back and forth a bit but ends on a positive note.  You'll see that some sentences' context are not captured.  For example, sentence 16 is 'But it didn't do any good'.  However *good* is going to be marked as a positive sentiment in any lexicon by default. In addition, the token length will matter.  Longer sentences are more likely to have some sentiment, for example.


### Romeo & Juliet

For this example, I'll invite you to more or less follow along, as there is notable pre-processing that must be done.  We'll look at sentiment in Shakespeare's Romeo and Juliet.  I have a cleaner version in the raw texts folder, but we can take the opportunity to use the <span class="pack">gutenbergr</span> package to download it directly from Project Gutenberg, a storehouse for works that have entered the public domain.

```{r rnj_load, echo=-c(3,5)}
library(gutenbergr)
gw0 = gutenberg_works(title == "Romeo and Juliet")  # look for something with this title
gw0[,1:4]
rnj = gutenberg_download(gw0$gutenberg_id)
DT::datatable(rnj, 
              rownames=F, 
              options=list(dom='tp',
                           autoWidth = TRUE,
                           columnDefs = list(list(width = '50px', targets = 0))))
```

<br> 

We've got the text now, but there is still work to be done.  The following is a quick and dirty approach, but see the [Shakespeare section][Shakespeare Start to Finish] to see a more deliberate one.

We first slice off the initial parts we don't want like title, author etc. Then we get rid of other tidbits that would interfere, using a little regex as well to aid the process.

```{r rnj_clean, echo=-2}
rnj_filtered = rnj %>% 
  slice(-(1:49)) %>% 
  filter(!text==str_to_upper(text),            # will remove THE PROLOGUE etc.
         !text==str_to_title(text),            # will remove names/single word lines
         !str_detect(text, pattern='^(Scene|SCENE)|^(Act|ACT)|^\\[')) %>% 
  select(-gutenberg_id) %>% 
  unnest_tokens(sentence, input=text, token='sentences') %>% 
  mutate(sentenceID = 1:n())
DT::datatable(select(rnj_filtered, sentenceID, sentence), 
              rownames=F, 
              options=list(dom='tp',
                           autoWidth = TRUE,
                           columnDefs = list(list(width = '50px', targets = 0)))
)
```

<br> 


The following unnests the data to word tokens.  In addition, you can remove stopwords like a, an, the etc., and <span class="pack">tidytext</span> comes with a <span class="objclass">stop_words</span> data frame. However, some of the stopwords have sentiments, so you would get a bit of a different result if you retain them.  As Black Sheep once said, the choice is yours, and you can deal with this, or you can deal with that.


```{r rnj_stopwords}
# show some of the matches
stop_words$word[which(stop_words$word %in% sentiments$word)] %>% head(20)


# remember to call output 'word' or antijoin won't work without a 'by' argument
rnj_filtered = rnj_filtered %>% 
  unnest_tokens(output=word, input=sentence, token='words') %>%   
  anti_join(stop_words)
```

Now we add the sentiments via the <span class="func">inner_join</span> function.  Here I use 'bing', but you can use another, and you might get a different result.

```{r rnj_sentiment}
rnj_filtered %>% 
  count(word) %>% 
  arrange(desc(n))

rnj_sentiment = rnj_filtered %>% 
  inner_join(sentiments)
rnj_sentiment
```

```{r rnj_bing}
rnj_sentiment_bing = rnj_sentiment %>% 
  filter(lexicon=='bing')
table(rnj_sentiment_bing$sentiment)
```


```{r sentimentr_rng, eval=FALSE, echo=FALSE, cache=FALSE}
# appears to give more (too much?) weight to postiive
library(sentimentr)
test = rnj %>% 
  slice(-(1:49)) %>% 
  filter(!text==str_to_upper(text),            # will remove THE PROLOGUE etc.
         !text==str_to_title(text),            # will remove names/single word lines
         !str_detect(text, pattern='^(Scene|SCENE)|^(Act|ACT)|^\\[')) %>% 
  select(-gutenberg_id) %>% 
  pull(text) %>% 
  paste(collapse = ' ') %>% 
  get_sentences() %>% 
  sentiment()


ay <- list(
  tickfont = list(color = "green"),
  overlaying = "y",
  side = "right",
  title = "raw sentiment",
  titlefont = list(textangle=45),
  zeroline = F
)
test %>% 
  mutate(positivity = cumsum(if_else(sentiment>0, sentiment, 0)),
         negativity = cumsum(abs(if_else(sentiment<0, sentiment, 0)))) %>% 
  plot_ly() %>% 
  add_lines(x=~sentence_id, y=~positivity, name='positive') %>% 
  add_lines(x=~sentence_id, y=~negativity, name='negative') %>% 
  layout() %>% 
  add_lines(x=~sentence_id, y=~sentiment, name='sentiment', yaxis = "y2", opacity=.1) %>%
  # plotly by default has the second axis title backwards, but provides no fix
  # other than to use annotate, which won't work because it extends the plot x's
  # range; 'shift' does nothing but add or subtract to the x value, making it
  # pointless
  # add_annotations(text = 'plotly strikes again', x = 1440, xshift=100, y = 75, textangle=90, showarrow = F) %>% 
  theme_plotly() %>% 
  layout(
    yaxis = list(title='absolute cumulative sentiment'),
    yaxis2 = ay,
    xaxis = list(title="sentence ID")
  )
test %>% 
  mutate(positivity = cumsum(if_else(sentiment>0, sentiment, 0)),
         negativity = cumsum(abs(if_else(sentiment<0, sentiment, 0)))) %>% 
  plot_ly() %>% 
  add_lines(x=~sentence_id, y=~positivity - negativity) %>% 
  layout(yaxis = list(title='sentiment')) %>% 
  theme_plotly()
```


Looks like this one is going to be a downer. The following visualizes the positive and negative sentiment scores as one progresses sentence by sentence through the work using the <span class="pack">plotly</span> package. I also show same information expressed as a difference (opaque line).


```{r rnj_sentiment_as_game, echo=FALSE}
# Note that plotly will start disappearing with ANY change, whether this has to
# do with the code, text or whatever, so you'll need to rebuild the cache for
# this page one last time to ensure it will be displayed.
ay <- list(
  tickfont = list(color = "#2ca02c40"),
  overlaying = "y",
  side = "right",
  # title = "sentiment difference",
  titlefont = list(textangle=45),
  zeroline = F
)

rnj_sentiment_bing %>% 
  arrange(sentenceID) %>% 
  mutate(positivity = cumsum(sentiment=='positive'),
         negativity = cumsum(sentiment=='negative')) %>% 
  plot_ly() %>% 
  add_lines(x=~sentenceID, y=~positivity, name='positive') %>% 
  add_lines(x=~sentenceID, y=~negativity, name='negative') %>%
  add_lines(x=~sentenceID, y=~positivity-negativity, name='difference',
            yaxis = "y2", 
            opacity=.25) %>% 
  layout(
    xaxis = list(dtick = 200),
    yaxis = list(title='absolute cumulative sentiment'),
    yaxis2 = ay
  ) %>% 
  theme_plotly()
```


```{r rnj_sentiment_diff, echo=F, eval=FALSE}
rnj_sentiment_bing %>% 
  arrange(sentenceID) %>% 
  mutate(positivity = cumsum(sentiment=='positive'),
         negativity = cumsum(sentiment=='negative')) %>% 
  plot_ly() %>% 
  add_lines(x=~sentenceID, y=~positivity-negativity) %>% 
  theme_plotly() %>% 
  config(displayModeBar = F)
```

<br>

It's a close game until perhaps the midway point, when negativity takes over and despair sets in with the story.  By the end [[:SPOILER ALERT:]] Sean Bean is beheaded, Darth Vader reveals himself to be Luke's father, and Verbal is Keyser Söze.

## Sentiment Analysis Summary

In general, sentiment analysis can be a useful exploration of data, but it is highly dependent on the context and tools used.  Note also that 'sentiment' can be anything, it doesn't have to be positive vs. negative.  Any vocabulary may be applied, and so it has more utility than the usual implementation. 

It should also be noted that the above demonstration is largely conceptual and descriptive. While fun, it's a bit simplified. For starters, trying to classify words as simply positive or negative itself is not a straightforward endeavor.  As we noted at the beginning, context matters, and in general you'd want to take it into account.  Modern methods of sentiment analysis would use approaches like word2vec or deep learning to predict a sentiment probability, as opposed to a simple word match.  Even in the above, matching sentiments to texts would probably only be a precursor to building a model predicting sentiment, which could then be applied to new data.


## Exercise

### Step 0: Install the packages

If you haven't already, install the <span class="pack">tidytext</span> package. Install the <span class="pack">janeaustenr</span> package and load both of them[^lazy].

### Step 1: Initial inspection

First you'll want to look at what we're dealing with, so take a gander at <span class="objclass">austenbooks</span>.

```{r ja_inspect}
library(tidytext); library(janeaustenr)
austen_books()
austen_books() %>% 
  distinct(book)
```

We will examine only one text.  In addition, for this exercise we'll take a little bit of a different approach, looking for a specific kind of sentiment using the NRC database. It contains 10 distinct sentiments.

```{r nrc_sentiment}
get_sentiments("nrc") %>% distinct(sentiment)
```

Now, select from any of those sentiments you like (or more than one), and one of the texts as follows.

```{r nrc_init, eval=FALSE}
nrc_sadness <- get_sentiments("nrc") %>% 
  filter(sentiment == "positive")

ja_book = austen_books() %>%
    filter(book == "Emma")
```

```{r nrc_init_, echo=FALSE}
nrc_bad <- get_sentiments("nrc") %>% 
  filter(sentiment %in% c('fear', 'negative', 'sadness', 'anger', 'disgust'))

ja_book = austen_books() %>%
    filter(book == "Mansfield Park")
```

### Step 2: Data prep

Now we do a little prep, and I'll save you the trouble.  You can just run the following.

```{r ja_prep, eval=FALSE}
ja_book =  ja_book %>%
  mutate(chapter = str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)),
         chapter = cumsum(chapter),
         line_book = row_number()) %>%
  unnest_tokens(word, text)
```

```{r ja_prep2}
ja_book =  ja_book %>%
  mutate(chapter = str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)),
         chapter = cumsum(chapter),
         line_book = row_number()) %>%
  group_by(chapter) %>% 
  mutate(line_chapter = row_number()) %>% 
  # ungroup()
  unnest_tokens(word, text)
```

### Step 3: Get sentiment

Now, on your own, try the inner join approach we used previously to match the sentiments to the text. Don't try to overthink this.  The third pipe step will use the <span class="func">count</span> function with the `word` column and also the argument `sort=TRUE`.  Note this is just to look at your result, we aren't assigning it to an object yet.

```{r ja_join_sentiment_example, eval=FALSE}
ja_book %>%
  ? %>%
  ?
```

The following shows my negative evaluation of <span class="emph">Mansfield Park</span>.

```{r ja_join_sentiment, echo=FALSE}
ja_book %>%
  inner_join(nrc_bad) %>%
  count(word, sort = TRUE)
```


### Step 4: Visualize

Now let's do a visualization for sentiment. So redo your inner join, but we'll create a data frame that has the information we need.

```{r negativity_data}
plot_data = ja_book %>%
  inner_join(nrc_bad) %>%
  group_by(chapter, line_book, line_chapter) %>% 
  count() %>%
  group_by(chapter) %>% 
  mutate(negativity = cumsum(n),
         mean_chapter_negativity=mean(negativity)) %>% 
  group_by(line_chapter) %>%
  mutate(mean_line_negativity=mean(n))
  
plot_data
```

At this point you have enough to play with, so I leave you to plot whatever you want.

The following[^badplot] shows both the total negativity within a chapter, as well as the per line negativity within a chapter. We can see that there is less negativity towards the end of chapters.  We can also see that there appears to be more negativity in later chapters (darker lines).

```{r negativity_plot, echo=FALSE, fig.height=6, fig.width=8, dpi=100, fig.retina=2}
# neither plotly nor ggplot can handle this dual scale and coloring scheme; plotly comes 
# library(ggplot2)
#
# # library(viridis)
# plot_data %>%
#   ungroup() %>%
#   mutate(colspec = viridis::viridis(n=n_distinct(plot_data$chapter), begin=1, end=0)[chapter]) %>%
#   ggplot(aes(x=line_chapter, y=negativity)) +
#   geom_point(aes(y=n*30), color='#ff5500', alpha=.2, size=2) +
#   # geom_line(aes(y=n*30, color=I(colspec), group=chapter), alpha=.2, size=2) +
#   geom_line(aes(color=I(colspec), group=chapter)) +
#   scale_y_continuous(breaks = c(0,200,400,600),
#                      limits = c(0,650),
#                      sec.axis = sec_axis(~./30, name = "Per line count")) + # this is an utterly pointless feature
#   theme(legend.position='none') +
#   theme_trueMinimal()

plot_data = ungroup(plot_data)
plot_data %>%
  mutate(colspec = viridis::plasma(n=n_distinct(plot_data$chapter), begin=1, end=0)[chapter],
         colspec2 = viridis::plasma(n=n_distinct(plot_data$line_chapter), begin=0, end=1)[factor(line_chapter)]) %>%
  ggplot(aes(x=line_chapter, y=negativity)) +
  geom_point(aes(y=mean_line_negativity*85, color=I(colspec2)), alpha=.05) +
  # scale_color_viridis(option='C', discrete=T) +
  geom_line(aes(color=I(colspec), group=factor(chapter)), alpha=.9, lwd=.75) +
  # geom_point(aes(y=mean_line_negativity*85), stat='identity', color='#ff5500', alpha=.25) +
  labs(x='Line number', y='Negativity') + 
  ggtitle('Negativity in Mansfield Park') +
  scale_y_continuous(breaks = c(0,200,400,600),
                     limits = c(0,900),
                     sec.axis = sec_axis(~./85, name = "Per line mean negativity")) + 
  theme(legend.position='none') +
  theme_trueMinimal()
# ggplotly()

# ay <- list(
#   tickfont = list(color = "gray50"),
#   overlaying = "y",
#   side = "right",
#   title = "Per line mean negativity",
#   titleangle = -90 # not only added in the dumbest option possible, there is no way to undo it.
# )
#

# plotly, the visualization package, fails at color, colorbars, layering; it literally can
# not take a literal argument and apply it, and I grow weary
# plot_data %>% ungroup() %>%
#   mutate(colspec = viridis::plasma(n=n_distinct(plot_data$chapter), begin=1, end=0)[chapter]) %>%
#   plot_ly(colors=~rev(colspec)) %>%
#   add_lines(x=~line_chapter,
#             y=~negativity,
#             color=~colspec,
#             line=list(colors=~I(colspec)),
#             # hoverinfo='none',
#             opacity=1,
#             line=list(width=3),
#             showlegend=F, text=~chapter) %>%
#   add_markers(x=~line_chapter,
#               y=~mean_line_negativity,
#               color=~colspec,
#               # marker=list(colorscale=~ I(colspec)),
#               opacity=.25,
#               size = I(7),
#               yaxis = "y2",
#               showlegend=F, 
#               text=~chapter) %>%
#   # hide_colorbar() %>%  # good christ!!!
#   layout(title = "Negativity in Mansfield Park",
#          yaxis2 = ay,
#          xaxis = list(title="Line number"),
#          yaxis = list(title="Negativity")) %>%
#   theme_plotly()
```


[^text]: I suggest not naming your column 'text' in practice. It is a base function in R, and using it within the tidyverse may result in problems distinguishing the function from the column name (similar to `n()` function and the `n` column created by <span class="func">count</span> and <span class="func">tally</span>).  I only do so for pedagogical reasons.

[^encoding]: There are almost always encoding issues in my experience.

[^lazy]: This exercise is more or less taken directly from the [tidytext book](http://tidytextmining.com/sentiment.html).

[^badplot]: This depiction goes against many of my visualization principles. I like it anyway.


```{r setup_sentiment_return, echo=FALSE}
knitr::opts_chunk$set(cache.rebuild=F, cache=T) 
```