class: left, top, title-slide .title[ # Textual Data
Web Scraping Demo ] .author[ ### Keith VanderLinden
Calvin University ] --- # Checking Access Restrictions It’s proper to check that scraping data is permitted. ```r library(robotstxt) paths_allowed("http://www.imdb.com") ``` ``` [1] TRUE ``` ```r paths_allowed("http://www.facebook.com") ``` ``` [1] FALSE ``` ```r paths_allowed("https://www.gutenberg.org") ``` ``` [1] TRUE ``` ```r paths_allowed("https://www.gutenberg.org/ebooks/search") ``` ``` [1] FALSE ``` ??? - Don't assume that all Web text is free and open. - Because computers can attend to and remember everything, it's not clear that we can simply assume that all public web text is truly open to our analysis. --- # Scraping the HTML Data We can then scrape the HTML from the [IMDB Top Chart](https://www.imdb.com/chart/top/). ```r library(rvest) page <- read_html("https://www.imdb.com/chart/top/") page ``` ``` {html_document} <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml"> [1] <head>\n<meta http-equiv="Content-Type" content="text/html ... [2] <body id="styleguide-v2" class="fixed">\n <img ... ``` ??? - This will be an ETL example. Here's the E(xtract). - Note how ugly the raw HTML is in this case. --- # Accessing the Relevant HTML Elements .pull-left[ ```r ratings <- page %>% html_nodes("strong") %>% html_text() %>% as.numeric() ratings[1:5] ``` ```r years <- page %>% html_nodes(".secondaryInfo") %>% html_text() %>% str_remove("\\(") %>% str_remove("\\)") %>% as.numeric() years[1:5] ``` ```r titles <- page %>% html_nodes(".titleColumn a") %>% html_text() titles[1:5] ``` ] .pull-right[ ``` [1] 9.2 9.2 9.0 9.0 9.0 ``` <br><br><br><br> ``` [1] 1994 1972 2008 1974 1957 ``` <br><br><br><br><br><br> ``` [1] "The Shawshank Redemption" "The Godfather" [3] "The Dark Knight" "The Godfather Part II" [5] "12 Angry Men" ``` ] ??? - This will be an ETL example. Here's the T(ransform). - Show how to find all these HTML elements using the Firefox DOM inspector. --- # Creating a Tidy Dataframe .pull-left[ ```r imdb_top_250 <- tibble( title = titles, year = years, rating = ratings ) imdb_top_250 <- imdb_top_250 %>% mutate(rank = 1:nrow(imdb_top_250)) ``` ] .pull-right[ ```r imdb_top_250 ``` ``` # A tibble: 250 × 4 title year rating rank <chr> <dbl> <dbl> <int> 1 The Shawshank Redemption 1994 9.2 1 2 The Godfather 1972 9.2 2 3 The Dark Knight 2008 9 3 4 The Godfather Part II 1974 9 4 5 12 Angry Men 1957 9 5 6 Schindler's List 1993 8.9 6 # … with 244 more rows ``` ] ??? - This will be an ETL example. Here's the L(oad). --- # Visualizing the Movie Ratings Over Time .pull-left[ ```r imdb_top_250 %>% group_by(year) %>% summarise(avg_score = mean(rating)) %>% ggplot() + aes(x = year, y = avg_score, ) + geom_point() + geom_smooth() + labs(title = "Average Movie Ratings", x = "Movie Year", y = "Average Movie Rating", ) ``` ] .pull-right[ <img src="scraping_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" /> ] ???