Textual Data Web Scraping Demo

class: left, top, title-slide

.title[
# Textual Data Web Scraping Demo
]
.author[
### Keith VanderLinden Calvin University
]

---

# Checking Access Restrictions

It&rsquo;s proper to check that scraping data is permitted.

```r
library(robotstxt)
paths_allowed("http://www.imdb.com")
```

```
[1] TRUE
```

```r
paths_allowed("http://www.facebook.com")
```

```
[1] FALSE
```

```r
paths_allowed("https://www.gutenberg.org")
```

```
[1] TRUE
```

```r
paths_allowed("https://www.gutenberg.org/ebooks/search")
```

```
[1] FALSE
```

???
- Don't assume that all Web text is free and open. 
- Because computers can attend to and remember everything, it's not clear that we can simply assume that all public web text is truly open to our analysis.

---
# Scraping the HTML Data

We can then scrape the HTML from the [IMDB Top Chart](https://www.imdb.com/chart/top/).

```r
library(rvest)

page <- read_html("https://www.imdb.com/chart/top/")
page
```

```
{html_document}
<html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html ...
[2] <body id="styleguide-v2" class="fixed">\n <img ...
```

???
- This will be an ETL example. Here's the E(xtract).
- Note how ugly the raw HTML is in this case.

---
# Accessing the Relevant HTML Elements

.pull-left[

```r
ratings <- page %>%
 html_nodes("strong") %>%
 html_text() %>%
 as.numeric()
ratings[1:5]
```

```r
years <- page %>%
 html_nodes(".secondaryInfo") %>%
 html_text() %>%
 str_remove("\\(") %>%
 str_remove("\\)") %>%
 as.numeric()
years[1:5]
```

```r
titles <- page %>%
 html_nodes(".titleColumn a") %>%
 html_text()
titles[1:5]
```
]
.pull-right[

```
[1] 9.2 9.2 9.0 9.0 9.0
```

```
[1] 1994 1972 2008 1974 1957
```

```
[1] "The Shawshank Redemption" "The Godfather"           
[3] "The Dark Knight"          "The Godfather Part II"   
[5] "12 Angry Men"            
```
]

???
- This will be an ETL example. Here's the T(ransform).
- Show how to find all these HTML elements using the Firefox DOM inspector.

---
# Creating a Tidy Dataframe

.pull-left[

```r
imdb_top_250 <- tibble(
 title = titles, 
 year = years, 
 rating = ratings
 )

imdb_top_250 <- imdb_top_250 %>%
 mutate(rank = 1:nrow(imdb_top_250))
```
]
.pull-right[

```r
imdb_top_250
```

```
# A tibble: 250 × 4
 title year rating rank
 <chr> <dbl> <dbl> <int>
1 The Shawshank Redemption 1994 9.2 1
2 The Godfather 1972 9.2 2
3 The Dark Knight 2008 9 3
4 The Godfather Part II 1974 9 4
5 12 Angry Men 1957 9 5
6 Schindler's List 1993 8.9 6
# … with 244 more rows
```
]

???
- This will be an ETL example. Here's the L(oad).

---
# Visualizing the Movie Ratings Over Time

.pull-left[

```r
imdb_top_250 %>% 
 group_by(year) %>%
 summarise(avg_score = mean(rating)) %>%
 ggplot() +
 aes(x = year,
 y = avg_score, 
 ) +
 geom_point() +
 geom_smooth() +
 labs(title = "Average Movie Ratings",
 x = "Movie Year", 
 y = "Average Movie Rating",
 )
```
]
.pull-right[
<img src="scraping_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" />
]

???