library(tidyverse)
library(rvest)
library(robotstxt)
The goal of this exercise is to practice scraping text data from a website. It will do an analysis similar to the analysis of IMDB movies demoed in the class session, but this time on IMDB television shows.
UPDATE
Before scraping the IMDB website, check to make sure that scraping is permitted.
UPDATE
Retrieve the IMDB TV shows page using rvest and save it as page
. The URL is: https://www.imdb.com/chart/tvmeter
page <- read_html("https://www.imdb.com/chart/top/")
page
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
We now wade through the hierarchical HTML structure to find the values of interest to us, namely the year, rating, and title for each of the 100 top TV shows. We’ll do these each, in sequence, finding and cleaning up the values as needed.
UPDATE
Get the ratings first, they’re probably the easiest. If we inspect the rating value on the webpage, we see that it is marked using the imdbRating
CSS class. We use this to find it in the HTML structure. You can see that the text is messed up, so use string commands to fix it. And remember to convert it to a numeric value.
ratings <- page %>%
html_nodes(".imdbRating") %>%
html_text()
ratings[1:10]
## [1] "\n 9.2\n " "\n 9.1\n "
## [3] "\n 9.0\n " "\n 9.0\n "
## [5] "\n 8.9\n " "\n 8.9\n "
## [7] "\n 8.9\n " "\n 8.8\n "
## [9] "\n 8.8\n " "\n 8.8\n "
UPDATE
Get the years next. The year for TV shows is is stored just after the anchor (a
) node for the title, using the class secondaryInfo
. You can now clean up these year strings and convert them to numbers.
years <- page %>%
html_nodes("a+ .secondaryInfo") %>%
html_text()
years[1:10]
## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)"
## [9] "(1966)" "(2001)"
UPDATE
Finally, get the titles. Use the HTML structure inspector to find the class name for the title values, and then clean up the value.
titles <- page %>%
html_nodes(".CORRECT_CLASS_NAME")
titles[1:10]
## {xml_nodeset (0)}
UPDATE
Finally, create a dataframe that contains the data that we’ve collected and add a rank
value. You’ll need to sort these values first, they’re not sorted as the movies were.
tvshows <- tibble(
# Put variables here.
)
# Wrangle the dataset here.
tvshows
## # A tibble: 0 x 0
UPDATE
Find out which, if any, of the top TV shows are from your birth year.
UPDATE
Visualize the average yearly rating as we did in class for the movies.
UPDATE
Count the TV shows in the top 100 per decade. You can visualize this using either a table or a plot.