class: left, top, title-slide .title[ # Textual Data
Shakespeare Sentiment Demo ] .author[ ### Keith VanderLinden
Calvin University ] --- # Demo: Comparing the Sentiment of Two Documents In this demo, we compare the sentiment rates for two of Shakespeare’s plays: - *MacBeth*: https://www.gutenberg.org/cache/epub/1129/pg1129.txt - *Much Ado about Nothing*: https://www.gutenberg.org/cache/epub/1519/pg1519.txt Note the structural similarity of the documents. ??? - This example reads two `*.txt` files from the web. There's no scraping. - It combines the technologies used by: - two of the text examples: - Shakespeare wrangling - Scientific paper sentiment analysis - last unit's demo of sentiment analysis of Austen (from a preloaded dataset). - Show how similar the files are. Surely, we can simplify this repetative task somehow. --- # Automating the Reading of Gutenberg Text Documents ```r get_gutenberg_words <- function(page, start_line, stop_line, title) { # Construct the Gutenberg URL. url <- paste0("http://www.gutenberg.org/cache/epub/", page, "/pg", page, ".txt") # Get list of lines (strings) from the text document at the given URL. word_lists <- read_html(url) %>% html_text() %>% str_remove_all("[:punct:]") %>% str_split("\r\n") %>% pluck(1) # Convert the lists to a dataframe and tidy the data. word_lists[start_line:stop_line] %>% tibble() %>% rename(text = ".") %>% unnest_tokens(word, text) %>% mutate(title = title) } ``` ??? - Because we need to read two text files, we can encapsulate the read in a function. --- # Reading Two Shakespeare Plays .pull-left[ ```r macbeth <- get_gutenberg_words( page = "1129", start_line = 212, stop_line = 3183, title="macbeth") macbeth ado <- get_gutenberg_words( page = "1519", start_line = 28, stop_line = 4624, title = "ado") ado # Merge the two dataframes into one. plays <- rbind(macbeth, ado) ``` ] .pull-right[ ``` # A tibble: 18,284 × 2 word title <chr> <chr> 1 the macbeth 2 tragedy macbeth 3 of macbeth 4 macbeth macbeth 5 by macbeth 6 william macbeth # … with 18,278 more rows ``` ``` # A tibble: 22,661 × 2 word title <chr> <chr> 1 much ado 2 ado ado 3 about ado 4 nothing ado 5 by ado 6 william ado # … with 22,655 more rows ``` ] ??? - Here, we get the texts for MacBeth and Much Ado. - Note how much simpler this is with the function. --- # Comparing the Sentiment of the two Plays .pull-left[ ```r plays %>% inner_join(get_sentiments()) %>% group_by(title) %>% count(sentiment) %>% pivot_wider( names_from = sentiment, values_from = n, values_fill = 0) %>% mutate( sentiment = (positive - negative) / (positive + negative) ) ``` .footnote[Cf. *Text Mining with R*, Chapter 2: https://www.tidytextmining.com/sentiment.html] ] .pull-right[ ``` # A tibble: 2 × 4 # Groups: title [2] title negative positive sentiment <chr> <int> <int> <dbl> 1 ado 769 1121 0.186 2 macbeth 866 706 -0.102 ``` ] ??? Demo: - Word-frequency examples - Stop words - Word sentiment lexicons Notes: - cf. https://www.tidytextmining.com/tidytext.html - Cite that early sentiment lexicon built based on simily/frowny faces.