class: left, top, title-slide .title[ # Textual Data
The World-Wide Web ] .author[ ### Keith VanderLinden
Calvin University ] --- # The World-Wide Web .pull-left[ The *World-Wide Web* is a freely accessible repository of textual data that is: - human-readable not machine-readable. - ever-increasing in volume and variety. The Web is based on these technologies: - Uniform Resource Locators (URL) - Hyper-text Transfer Protocol (HTTP) - Hyper-text Markup Language (HTML) - Cascading Style Sheets (CSS) - JavaScript ] .pull-right[ References:<br> https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL<br> https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview<br> https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics<br> https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics<br> https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics<br> ] ??? (Sir) Tim Berners-Lee is credited with first proposing the WWW. --- # The World-Wide Web - URLs .pull-left[ The *World-Wide* (Web) is a freely accessible repository of textual data that is: - human-readable not machine-readable. - ever-increasing in volume and variety. The Web is based on these technologies: - **Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL))** - Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)) - Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)) - Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics)) - [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics) ] .pull-right[ URLs are document addresses. ![URL Structure](images/mdn-url-short.png) .footnote[Image from MDN: https://developer.mozilla.org] ] ??? - There are additional URL fields for: - passing arguments. - addressing page anchors. --- # The World-Wide Web - HTTP .pull-left[ The *World-Wide* (Web) is a freely accessible repository of textual data that is: - human-readable not machine-readable. - ever-increasing in volume and variety. The Web is based on these technologies: - Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)) - **Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview))** - Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)) - Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics)) - [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics) ] .pull-right[ HTTP is an Internet communication protocol. ![HTTP](images/http.png) .footnote[Image from: [MDN](https://developer.mozilla.org)] ] ??? - See the `http:` in the browser's URL box. - HTTP supports: GET; PUT; POST; DELETE - HTTPS is HTTP with encryption. --- # The World-Wide Web - HTML .pull-left[ The *World-Wide* (Web) is a freely accessible repository of textual data that is: - human-readable not machine-readable. - ever-increasing in volume and variety. The Web is based on these technologies: - Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)) - Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)) - **Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics))** - Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics)) - [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics) ] .pull-right[ HTML specifies the structure and content. ```html <!DOCTYPE html> <html> <head> <link rel="stylesheet" href="sample.css"> <title>My Sample Webpage</title> </head> <body> <h1>A Heading</h1> <p class="a_class">a paragraph</p> <ul> <li id="an_id"><em>italics</em></li> <li><strong>bold</strong></li> <li><p>An image:</p></li> </ul> <img src="images/rvest.jpg"> </body> </html> ``` ] ??? - Load [sample.html](sample.html) in a browser (without the style sheet). - HTML elements are specified using (usually) paired *tags*. - The tags delimit some html-text. - The body element is what the browser displays. - The HTML structure is hierarchical, with parent and child nodes (e.g., body contains h1). This structure will be important for scraping. --- # The World-Wide Web - CSS .pull-left[ The *World-Wide* (Web) is a freely accessible repository of textual data that is: - human-readable not machine-readable. - ever-increasing in volume and variety. The Web is based on these technologies: - Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)) - Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)) - Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)) - **Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics))** - [JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics) ] .pull-right[ CSS specifies the style for HTML documents. ```html h1 { color: navy; } #an_id { color: maroon; } .a_class { color: darkgoldenrod; } ul li p{ color: forestgreen; font-family: Courier New; } ``` ] ??? - Load [sample.html](sample.html) in a browser (with the style sheet). - Style sheets were added later to simplify consistent formatting. - Explain the CSS selectors (class "period", id "number", the containment list). --- # The World-Wide Web - JavaScript .pull-left[ The *World-Wide* (Web) is a freely accessible repository of textual data that is: - human-readable not machine-readable. - ever-increasing in volume and variety. The Web is based on these technologies: - Uniform Resource Locators ([URL](https://developer.mozilla.org/en-US/docs/Learn/Common_questions/What_is_a_URL)) - Hyper-text Transfer Protocol ([HTTP](https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview)) - Hyper-text Markup Language ([HTML](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/HTML_basics)) - Cascading Style Sheets ([CSS](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/CSS_basics)) - **[JavaScript](https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/JavaScript_basics)** ] .pull-right[ JavaScript implements dynamic webpage behavior. ```javascript let myImg = document.querySelector('img'); myIm.onclick = function() { let mySrc = myImage.getAttribute('src'); if(mySrc === 'firefox-icon.png') { myImg.setAttribute('src', 'firefox2.png' ); } else { myImg.setAttribute('src', 'firefox-icon.png' ); } } ``` .footnote[Example from MDN: https://developer.mozilla.org] ] ??? - This code alternates the display of two images based on user input. - Summary - These were simple examples. - A real, modern webpage is much more complicated. - Try `ctrl-u` to see the raw web page content. - Try `inspect element` to hover over the page elements. --- # Collecting Text from the Web .pull-left[ There are several ways to collect textual data from the Web: - Download raw text documents by URL. - Access Web *Application Programming Interfaces* (APIs). - Parse the raw HTML of a Web page. This is called *Web scraping*. *Web scraping* requires that we identify pages that contain useful data (i.e., *crawling*) and reading that data from those pages (i.e., *scraping*). ] -- .pull-right[ **Examples from the Textbook** MacBeth: http://www.gutenberg.org/cache/epub/1129/pg1129.txt The arXiv API: ```r library(aRxiv) arxiv_search( query = '"Data Science"', limit = 20000, batchsize = 100 ) ``` Scraping: https://en.wikipedia.org/wiki/List_of_songs_recorded_by_the_Beatles ] ??? - All these approaches use the HTTP protocol. - APIs are preferable to scraping because: - APIs are purpose-built for machine-readability. - Web pages are purpose-build for human-readability. - Scraping algorithms need to be changed every time a webpage is reformatted. - We'll demo: 1. Scraping a pre-identified pages using `rvest` (the IMDB top chart). 2. Downloading texts in `*.txt` format (Shakespeare text from Gutenberg).