Lab 12.1 - Top IMDB TV Shows

The goal of this exercise is to practice scraping text data from a website. It will do an analysis similar to the analysis of IMDB movies demoed in the class session, but this time on IMDB television shows.

Scraping the Webpage

update UPDATE
Before scraping the IMDB website, check to make sure that scraping is permitted.

update UPDATE
Retrieve the IMDB TV shows page using rvest and save it as page. The URL is: https://www.imdb.com/chart/tvmeter

page <- read_html("https://www.imdb.com/chart/top/")
page

## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n            <img height="1" widt ...

Parsing Useful Information

We now wade through the hierarchical HTML structure to find the values of interest to us, namely the year, rating, and title for each of the 100 top TV shows. We’ll do these each, in sequence, finding and cleaning up the values as needed.

update UPDATE
Get the ratings first, they’re probably the easiest. If we inspect the rating value on the webpage, we see that it is marked using the imdbRating CSS class. We use this to find it in the HTML structure. You can see that the text is messed up, so use string commands to fix it. And remember to convert it to a numeric value.

ratings <- page %>%
  html_nodes(".imdbRating") %>%
  html_text() 
ratings[1:10]

##  [1] "\n            9.2\n    " "\n            9.1\n    "
##  [3] "\n            9.0\n    " "\n            9.0\n    "
##  [5] "\n            8.9\n    " "\n            8.9\n    "
##  [7] "\n            8.9\n    " "\n            8.8\n    "
##  [9] "\n            8.8\n    " "\n            8.8\n    "

update UPDATE
Get the years next. The year for TV shows is is stored just after the anchor (a) node for the title, using the class secondaryInfo. You can now clean up these year strings and convert them to numbers.

years <- page %>%
  html_nodes("a+ .secondaryInfo") %>%
  html_text()
years[1:10]

##  [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)"
##  [9] "(1966)" "(2001)"

update UPDATE
Finally, get the titles. Use the HTML structure inspector to find the class name for the title values, and then clean up the value.

titles <- page %>%
  html_nodes(".CORRECT_CLASS_NAME")
titles[1:10]

## {xml_nodeset (0)}

update UPDATE
Finally, create a dataframe that contains the data that we’ve collected and add a rank value. You’ll need to sort these values first, they’re not sorted as the movies were.

tvshows <- tibble(
  # Put variables here.
) 
# Wrangle the dataset here.
tvshows

## # A tibble: 0 x 0

Lab 12.1 - Top IMDB TV Shows

Author Goes Here

Semester Goes Here

Scraping the Webpage

Parsing Useful Information

Analyzing the Data