The goal of this exercise is to learn how to use regular expressions to filter text. It will do an analysis similar to the analysis of MacBeth described in the text (see Section 19.1) but this time on King Lear, another Shakespeare tragedy not lacking in death.

Loading the King Lear Text

We can retrieve this text from Project Gutenberg using rvest, Tidyverse’s Web scraping library. For now, you can just use this code; we’ll study it in more detail when we get to the Web scraping unit.

lear_url <- "http://www.gutenberg.org/cache/epub/1532/pg1532.txt"
lear_text <- read_html(lear_url) %>%
  html_nodes("body") %>%
  html_nodes("p") %>%
  html_text()
substring(lear_text, 1, 200)
## [1] "The Project Gutenberg eBook of King Lear, by William Shakespeare\r\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no r"

Now we split the raw text into lines:

lear <- lear_text %>% 
  str_split("\r\n") %>%
  pluck(1)
length(lear)
## [1] 6503

Computing Interesting Statistics

updateUPDATE
How many lines were spoken by Edgar? You’ll need to take a look at the raw text of the play to see how the formatting for speeches in King Lear differs from their formatting in MacBeth as assumed by the text example. N.b., the character name is always on a separate line ended with a period (.).

pattern <- "^EDGAR\\.$"
lear %>% str_subset(pattern) %>% length()
## [1] 100

updateUPDATE
How many unique speakers have lines? You’ll need to search for character names, which are of varying lengths and are formatted in all capital letters. Use the unique() function to filter out repeated names and use sort() can help with readability.

updateUPDATE
Modify the regex ^la{2,3} (given below) to match lala and ’lalalabut notlalalala`.

pattern <- "^la{2,3}"
c("", "la", "laa", "lala", "lalala", "lalalala", "blalalala") %>% str_subset(pattern)
## [1] "laa"

updateUPDATE
Give an example of a string that matches this regex: \\d{3}-\\d{3}-\\d{4}. Hint: you probably know several of these from memory and share them with others regularly.

updateUPDATE
In a programming course, you might have written a function that checks whether a string was a valid US social security number (e.g., 123-45-6789). Write a regular expression that does the same thing and test it on few well-chosen test cases.

updateUPDATE
Reproduce the life & death plot demonstrated in the text. Choose any characters you’d like to trace, but include King Lear and his oldest daughter Regan.