Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct decile column name. fixes #5 #6

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions _R/2017-04-26-tidytext-plots.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ peak_decile %>%

We see that the words in the start and the end are the most specific to their particular deciles: for example, almost half of the occurrences of the word "fictional" occurred in the first 10% of the story. The middle sections have words that are more spread out (having, say, 14% of their occurrences in that section rather than the expected 10%), but they still are words that make sense in the story structure.

Let's visualize the full trend for the words overrepreseted at each point.
Let's visualize the full trend for the words overrepresented at each point.

```{r sparklines, fig.width = 8, fig.height = 8, echo = FALSE}
peak_decile %>%
Expand All @@ -203,7 +203,7 @@ peak_decile %>%
ungroup() %>%
inner_join(decile_counts, by = "word") %>%
mutate(word = reorder(word, peak_decile + .001 * fraction_peak)) %>%
ggplot(aes(decile, n / number, color = word)) +
ggplot(aes(decile.y, n / number, color = word)) +
geom_line(show.legend = FALSE, size = 1) +
geom_hline(lty = 2, yintercept = .1, alpha = .5) +
facet_wrap(~ word, ncol = 6) +
Expand Down Expand Up @@ -246,4 +246,4 @@ In short, if we had to summarize the *average* story that humans tell, it would
This was a pretty simple analysis of story arcs (for a more in-depth example, see the [research described here](https://www.theatlantic.com/technology/archive/2016/07/the-six-main-arcs-in-storytelling-identified-by-a-computer/490733/)), and it doesn't tell us too much we wouldn't have been able to guess.
(Except perhaps that characters are most likely to be drunk right in the middle of a story. How can we monetize that insight?)

What I like about this approach is how quickly you can gain insights with simple quantitative methods (counting, taking the median) applied to a large text dataset. In future posts, I'll be diving deeper into these plots and showing what else we can learn.
What I like about this approach is how quickly you can gain insights with simple quantitative methods (counting, taking the median) applied to a large text dataset. In future posts, I'll be diving deeper into these plots and showing what else we can learn.