Textual Data The World-Wide Web

class: left, top, title-slide

.title[
# Textual Data The World-Wide Web
]
.author[
### Keith VanderLinden Calvin University
]

---

# The World-Wide Web

.pull-left[
The *World-Wide Web* is a freely accessible repository of textual data that is:
- human-readable not machine-readable.
- ever-increasing in volume and variety.

The Web is based on these technologies:
- Uniform Resource Locators (URL)
- Hyper-text Transfer Protocol (HTTP)
- Hyper-text Markup Language (HTML)
- Cascading Style Sheets (CSS)
- JavaScript
]
.pull-right[
References: 
https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview 
https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics 
https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics 
https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics

]

???
(Sir) Tim Berners-Lee is credited with first proposing the WWW.

---
# The World-Wide Web - URLs

.pull-left[
The *World-Wide* (Web) is a freely accessible repository of textual data that is:
- human-readable not machine-readable.
- ever-increasing in volume and variety.

The Web is based on these technologies:
- **Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL))**
- Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview))
- Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics))
- Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics))
- [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics)
]
.pull-right[
URLs are document addresses.

![URL Structure](images/mdn-url-short.png)

.footnote[Image from MDN: https://developer.mozilla.org]
]

???
- There are additional URL fields for:
  - passing arguments.
  - addressing page anchors.

---
# The World-Wide Web - HTTP

.pull-left[
The *World-Wide* (Web) is a freely accessible repository of textual data that is:
- human-readable not machine-readable.
- ever-increasing in volume and variety.

The Web is based on these technologies:
- Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL))
- **Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview))**
- Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics))
- Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics))
- [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics)
]
.pull-right[
HTTP is an Internet communication protocol.

![HTTP](images/http.png)

.footnote[Image from: [MDN](https://developer.mozilla.org)]
]

???
- See the `http:` in the browser's URL box.
- HTTP supports: GET; PUT; POST; DELETE
- HTTPS is HTTP with encryption.

---
# The World-Wide Web - HTML

.pull-left[
The *World-Wide* (Web) is a freely accessible repository of textual data that is:
- human-readable not machine-readable.
- ever-increasing in volume and variety.

The Web is based on these technologies:
- Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL))
- Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview))
- **Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics))**
- Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics))
- [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics)
]
.pull-right[
HTML specifies the structure and content.

```html
<!DOCTYPE html>
<html>
 <head>
 <link rel="stylesheet" 
 href="sample.css"> 
 <title>My Sample Webpage</title>
 </head>
 <body>
 <h1>A Heading</h1>
 a paragraph
 <ul>
 <li id="an_id">italics</li>
 <li>bold</li>
 <li>An image:</li>
 </ul>
 <img src="images/rvest.jpg">
 </body>
</html>
```
]

???
- Load [sample.html](sample.html) in a browser (without the style sheet).
- HTML elements are specified using (usually) paired *tags*.
- The tags delimit some html-text.
- The body element is what the browser displays.
- The HTML structure is hierarchical, with parent and child nodes (e.g., body contains h1). This structure will be important for scraping.

---
# The World-Wide Web - CSS

.pull-left[
The *World-Wide* (Web) is a freely accessible repository of textual data that is:
- human-readable not machine-readable.
- ever-increasing in volume and variety.

The Web is based on these technologies:
- Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL))
- Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview))
- Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics))
- **Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics))**
- [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics)
]
.pull-right[
CSS specifies the style for HTML documents.

```html
h1 {
  color: navy;
}

#an_id {
  color: maroon;
}

.a_class {
  color: darkgoldenrod;
}

ul li p{
  color: forestgreen;
  font-family: Courier New;
}
```
]

???
- Load [sample.html](sample.html) in a browser (with the style sheet).
- Style sheets were added later to simplify consistent formatting.
- Explain the CSS selectors (class "period", id "number", the containment list).

---
# The World-Wide Web - JavaScript

.pull-left[
The *World-Wide* (Web) is a freely accessible repository of textual data that is:
- human-readable not machine-readable.
- ever-increasing in volume and variety.

The Web is based on these technologies:
- Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL))
- Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview))
- Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics))
- Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics))
- **[JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics)**
]
.pull-right[
JavaScript implements dynamic webpage behavior.

```javascript
let myImg = document.querySelector('img');

myIm.onclick = function() {
    let mySrc = myImage.getAttribute('src');
    if(mySrc === 'firefox-icon.png') {
      myImg.setAttribute('src',
                         'firefox2.png'
                         );
    } else {
      myImg.setAttribute('src',
                         'firefox-icon.png'
                         );
    }
}
```

.footnote[Example from MDN: https://developer.mozilla.org]
]

???
- This code alternates the display of two images based on user input.

- Summary
  - These were simple examples.
  - A real, modern webpage is much more complicated.
    - Try `ctrl-u` to see the raw web page content.
    - Try `inspect element` to hover over the page elements.

---
# Collecting Text from the Web

.pull-left[
There are several ways to collect textual data from the Web:
- Download raw text documents by URL.
- Access Web *Application Programming Interfaces* (APIs).
- Parse the raw HTML of a Web page. This is called *Web scraping*.

*Web scraping* requires that we identify pages that contain useful data (i.e., *crawling*) and reading that data from those pages (i.e., *scraping*).
]

.pull-right[
**Examples from the Textbook**

MacBeth: http://www.gutenberg.org/cache/epub/1129/pg1129.txt

The arXiv API:

```r
library(aRxiv)
arxiv_search(
  query = '"Data Science"', 
  limit = 20000, 
  batchsize = 100
)
```

Scraping: https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles
]

???
- All these approaches use the HTTP protocol.
- APIs are preferable to scraping because:
  - APIs are purpose-built for machine-readability.
  - Web pages are purpose-build for human-readability.
  - Scraping algorithms need to be changed every time a webpage is reformatted.
- We'll demo:
  1. Scraping a pre-identified pages using `rvest` (the IMDB top chart).
  2. Downloading texts in `*.txt` format (Shakespeare text from Gutenberg).