Textual Data Shakespeare Sentiment Demo

class: left, top, title-slide

.title[
# Textual Data<br>Shakespeare Sentiment Demo
]
.author[
### Keith VanderLinden<br>Calvin University
]

---

# Demo: Comparing the Sentiment of Two Documents

In this demo, we compare the sentiment rates for two of Shakespeare&rsquo;s plays:

- *MacBeth*: https://www.gutenberg.org/cache/epub/1129/pg1129.txt
- *Much Ado about Nothing*: https://www.gutenberg.org/cache/epub/1519/pg1519.txt

Note the structural similarity of the documents.

???
- This example reads two `*.txt` files from the web. There's no scraping.
- It combines the technologies used by:
  - two of the text examples:
      - Shakespeare wrangling
      - Scientific paper sentiment analysis
  - last unit's demo of sentiment analysis of Austen (from a preloaded dataset).
- Show how similar the files are. Surely, we can simplify this repetative task somehow.

---
# Automating the Reading of Gutenberg Text Documents

```r
get_gutenberg_words <- function(page, start_line, stop_line, title) {

# Construct the Gutenberg URL.
  url <- paste0("http://www.gutenberg.org/cache/epub/", page, "/pg", page, ".txt")

# Get list of lines (strings) from the text document at the given URL.
  word_lists <- read_html(url) %>%
    html_text() %>%
    str_remove_all("[:punct:]") %>%
    str_split("\r\n") %>%
    pluck(1) 
  
  # Convert the lists to a dataframe and tidy the data.
  word_lists[start_line:stop_line] %>%
    tibble() %>%
    rename(text = ".") %>%
    unnest_tokens(word, text) %>%
    mutate(title = title)
  
}
```

???
- Because we need to read two text files, we can encapsulate the read in a function.

---
# Reading Two Shakespeare Plays

.pull-left[

```r
macbeth <- 
  get_gutenberg_words(
    page = "1129",
    start_line = 212, 
    stop_line = 3183, 
    title="macbeth")
macbeth

ado <- 
  get_gutenberg_words(
    page = "1519",
    start_line = 28, 
    stop_line = 4624, 
    title = "ado")
ado

# Merge the two dataframes into one.
plays <- rbind(macbeth, ado)
```
]
.pull-right[

```
# A tibble: 18,284 × 2
  word    title  
  <chr>   <chr>  
1 the     macbeth
2 tragedy macbeth
3 of      macbeth
4 macbeth macbeth
5 by      macbeth
6 william macbeth
# … with 18,278 more rows
```

```
# A tibble: 22,661 × 2
  word    title
  <chr>   <chr>
1 much    ado  
2 ado     ado  
3 about   ado  
4 nothing ado  
5 by      ado  
6 william ado  
# … with 22,655 more rows
```
]

???
- Here, we get the texts for MacBeth and Much Ado.
- Note how much simpler this is with the function.

---
# Comparing the Sentiment of the two Plays

.pull-left[

```r
plays %>%
  inner_join(get_sentiments()) %>%
  group_by(title) %>%
  count(sentiment) %>%
  pivot_wider(
    names_from = sentiment, 
    values_from = n, 
    values_fill = 0) %>%
  mutate(
    sentiment = 
      (positive - negative) / 
        (positive + negative)
  )
```
.footnote[Cf. *Text Mining with R*, Chapter 2: https://www.tidytextmining.com/sentiment.html]
]
.pull-right[

```
# A tibble: 2 × 4
# Groups:   title [2]
  title   negative positive sentiment
  <chr>      <int>    <int>     <dbl>
1 ado          769     1121     0.186
2 macbeth      866      706    -0.102
```
]

???
Demo:
- Word-frequency examples
- Stop words
- Word sentiment lexicons

Notes:
- cf. https://www.tidytextmining.com/tidytext.html
- Cite that early sentiment lexicon built based on simily/frowny faces.