library(tidyverse)
library(rvest)
library(robotstxt)

The goal of this exercise is to practice scraping text data from a website. It will do an analysis similar to the analysis of IMDB movies demoed in the class session, but this time on IMDB television shows.

Scraping the Webpage

updateUPDATE
Before scraping the IMDB website, check to make sure that scraping is permitted.

updateUPDATE
Retrieve the IMDB TV shows page using rvest and save it as page. The URL is: https://www.imdb.com/chart/tvmeter

page <- read_html("https://www.imdb.com/chart/top/")
page
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Parsing Useful Information

We now wade through the hierarchical HTML structure to find the values of interest to us, namely the year, rating, and title for each of the 100 top TV shows. We’ll do these each, in sequence, finding and cleaning up the values as needed.

updateUPDATE
Get the ratings first, they’re probably the easiest. If we inspect the rating value on the webpage, we see that it is marked using the imdbRating CSS class. We use this to find it in the HTML structure. You can see that the text is messed up, so use string commands to fix it. And remember to convert it to a numeric value.

ratings <- page %>%
  html_nodes(".imdbRating") %>%
  html_text() 
ratings[1:10]
##  [1] "\n            9.2\n    " "\n            9.1\n    "
##  [3] "\n            9.0\n    " "\n            9.0\n    "
##  [5] "\n            8.9\n    " "\n            8.9\n    "
##  [7] "\n            8.9\n    " "\n            8.8\n    "
##  [9] "\n            8.8\n    " "\n            8.8\n    "

updateUPDATE
Get the years next. The year for TV shows is is stored just after the anchor (a) node for the title, using the class secondaryInfo. You can now clean up these year strings and convert them to numbers.

years <- page %>%
  html_nodes("a+ .secondaryInfo") %>%
  html_text()
years[1:10]
##  [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)"
##  [9] "(1966)" "(2001)"

updateUPDATE
Finally, get the titles. Use the HTML structure inspector to find the class name for the title values, and then clean up the value.

titles <- page %>%
  html_nodes(".CORRECT_CLASS_NAME")
titles[1:10]
## {xml_nodeset (0)}

updateUPDATE
Finally, create a dataframe that contains the data that we’ve collected and add a rank value. You’ll need to sort these values first, they’re not sorted as the movies were.

tvshows <- tibble(
  # Put variables here.
) 
# Wrangle the dataset here.
tvshows
## # A tibble: 0 x 0

Analyzing the Data

updateUPDATE
Find out which, if any, of the top TV shows are from your birth year.

updateUPDATE
Visualize the average yearly rating as we did in class for the movies.

updateUPDATE
Count the TV shows in the top 100 per decade. You can visualize this using either a table or a plot.