Textual Data Natural Language Processing

class: left, top, title-slide

.title[
# Textual Data<br>Natural Language Processing
]
.author[
### Keith VanderLinden<br>Calvin University
]

---

# Natural Language Processing

.pull-left[
*Natural Language Processing* analyses unstructured, human-written text, which is different from numerical data in that it is not naturally numeric or tabular. Thus, it requires different data wrangling skills.

Human text understanding operates at multiple levels simultaneously:

- Syntax
- Semantics

]

???
- As with numerical data, there is plenty of accessible textual data.
- Text "levels": Form vs Function
  - Syntax = grammar + phonology + orthography
  - Semantics = meaning (reference/truth) in context (pragmatics)

.pull-right[
Examples

- &ldquo;am . hUNgry I&rdquo;
- &ldquo;I am hungry.&rdquo;
- &ldquo;I am so hungry I could eat a horse.&rdquo;
- &ldquo;I am soooooooooo HANGRY!!!!!&rdquo;
- &ldquo;Colorless green ideas slept furiously.&rdquo;
]

???
- Analyses:
  - ungrammatical (i.e., syntactically incorrect)
  - grammatical, semantically clear
  - grammatical, semantically unclear?
  - ungrammatical, semantically clear?
  - [from N. Chomsky] grammatical; meaningless
- Plan:
  - Modeling required to even approach meaning (INFO 602).
  - We'll focus on surprisingly useful techniques of corpus linguistics (INFO 601).

---
# Corpus Linguistics

.pull-left[
*Corpus Linguistics* is a form of natural language processing that bases its analysis of language on a set of textual documents collected into a *corpus*. It often focuses on the use of words or phrases in an attempt to *mine* information on a document&rsquo;s:

- Use of words and phrases
- Sentiment
- Topic

.footnote[See: *Welcome to Text Mining with R*, J. Silge and D. Robinson, https://www.tidytextmining.com]
]

???
- We'll focus on these.
- As a quick example of the potential value of simple analyses, see also Google Ngram&rsquo;s analysis of the relative use of word pairs:
  - [&ldquo;1900&rdquo; vs. &ldquo;2000&rdquo;](https://books.google.com/ngrams/graph?content=1900%2C2000&year_start=1800&year_end=2019&corpus=26&smoothing=3&direct_url=t1%3B%2C1900%3B%2Cc0%3B.t1%3B%2C2000%3B%2Cc0)
  - [&ldquo;science&rdquo; vs. &ldquo;religion&rdquo;](https://books.google.com/ngrams/graph?content=science%2Creligion&year_start=1800&year_end=2019&corpus=26&smoothing=3&direct_url=t1%3B%2Cscience%3B%2Cc0%3B.t1%3B%2Creligion%3B%2Cc0)

---
# Tidy Textual Data

.pull-left[
Recall that *tidy* datasets have the following characteristics:
- Each *variable* must have its own column.
- Each *observation* must have its own row.
- Each *value* must have its own cell.

A *tidy* view of textual data represents meaningful units of text, e.g., words or phrases, as the *variables*. The `tidytext` library supports this.

```r
library(tidytext)
library(janeaustenr)
```
]

???
It's surprising how much we can learn from text using simple text units.

.pull-right[

```r
prideprejudice[10:11]
```

```
[1] "It is a truth universally acknowledged, that a single man in possession"
[2] "of a good fortune, must be in want of a wife."                          
```

```r
tibble(txt = prideprejudice[10:11]) %>%
  unnest_tokens(word, txt)
```

```
# A tibble: 23 × 1
  word        
  <chr>       
1 it          
2 is          
3 a           
4 truth       
5 universally 
6 acknowledged
# … with 17 more rows
```
]

???
- Words are often used as tokens.
- `unnest_tokens()` has:
  - *tokenized* the text
  - removed the punctuation
  - converted all letters to lower-case
  - created a new `word` column

---
# Analyses of N-Grams

We can use longer phrases as the meaningful units of analysis. Here, we compute a key-word-in-context (KWIC) list.

```r
austen_books() %>%
  filter(book == "Pride & Prejudice") %>%
  unnest_tokens(ngram, text, token = "ngrams", n = 5) %>%
  select(ngram) %>%
  filter(grepl(".+\ .+\ pride\ .+\ .+", ngram))
```

```
# A tibble: 26 × 1
  ngram                            
  <chr>                            
1 up with pride and i              
2 vanity and pride are different   
3 being vain pride relates more    
4 mixture of pride and impertinence
5 indeed but pride where there     
6 of mind pride will be            
# … with 20 more rows
```

???
- I've slipped a regex in there - we'll discuss them next!
- N-gram analysis is valuable in:
  - Lexicography - Concordances like this help put words in context.
  - Language Modeling - Usually we set n <= 3.
- Distributional semantics is based on J. R. Firth's idea that "You shall know a word by the company it keeps." (1957).

---
# Sentiment Analysis: Lexicons

We can use sentiment-tagged lexicons to quantify the emotional tone of a text document.

.pull-left[

```r
get_sentiments() %>%
  filter(sentiment == "positive")
```

```
# A tibble: 2,005 × 2
  word       sentiment
  <chr>      <chr>    
1 abound     positive 
2 abounds    positive 
3 abundance  positive 
4 abundant   positive 
5 accessable positive 
6 accessible positive 
# … with 1,999 more rows
```
]
.pull-right[

```r
get_sentiments() %>%
  filter(sentiment == "negative")
```

```
# A tibble: 4,781 × 2
  word       sentiment
  <chr>      <chr>    
1 2-faces    negative 
2 abnormal   negative 
3 abolish    negative 
4 abominable negative 
5 abominably negative 
6 abominate  negative 
# … with 4,775 more rows
```
]

???
- These sentiment-tagged corpora are pre-built.
- Their earliest forms used emojies to set sentiment.

---
# Sentiment Analysis: Results

We can join a document with a sentiment lexicon to quantify the sentiment value of a text. Here, the values range from totally negative, -1.0, to totally positive, +1.0.

.pull-left[

```r
austen_books() %>%
  unnest_tokens(word, text) %>%
  inner_join(get_sentiments()) %>%
  group_by(book) %>%
  count(sentiment) %>%
  pivot_wider(names_from = sentiment, 
              values_from = n, 
              values_fill = 0) %>%
  mutate(sentiment = 
           (positive - negative) / 
             (positive + negative))
```
]
.pull-right[

```
# A tibble: 6 × 4
# Groups:   book [6]
  book                negative positive sentiment
  <fct>                  <int>    <int>     <dbl>
1 Sense & Sensibility     3671     4933     0.147
2 Pride & Prejudice       3652     5052     0.161
3 Mansfield Park          4828     6749     0.166
4 Emma                    4809     7157     0.196
5 Northanger Abbey        2518     3244     0.126
6 Persuasion              2201     3473     0.224
```
]

???
- This explains why I like Jane Austen - every book is net positive in sentiment.
- Companies care about what people say about them.

---
# Topic Modeling: Term Frequency

We can use the frequency of the words to quantify the document&rsquo;s topic. Here, we demonstrate [Zipf&rsquo;s Law](https://en.wikipedia.org/wiki/Zipf%27s_law).

.pull-left[

```r
pride_words <- austen_books() %>%
  filter(book == "Pride & Prejudice") %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)
pride_words
```

```
# A tibble: 6,538 × 2
  word      n
  <chr> <int>
1 the    4331
2 to     4162
3 of     3610
4 and    3585
5 her    2203
6 i      2065
# … with 6,532 more rows
```
]
.pull-right[

```r
pride_words %>%
  mutate(total = sum(n)) %>%
  ggplot(aes(x = n / total)) +
  geom_histogram() +
  xlim(NA, 0.0009)
```

<img src="nlp_files/figure-html/unnamed-chunk-10-1.png" width="90%" style="display: block; margin: auto;" />
]

???
- Word use tends to follow Zipf's law.
  - Law: The frequency of a word is inversely proportional to its rank. 
    - Thus, the second most common word will be 1/2 as common as the most common word.
  - This "law" has been observed throughout science.
  - It's useful in topic modeling, in which we can ignore the common words (*stop words*) and focus on the uncommon words.

---
# Topic Modeling: TF-IDF

We can use *Term-Frequency, Inverse-Document-Frequency* to quantify the topic of a document more accurately.

.pull-left[

```r
book_tf_idf <-  austen_books() %>%
  unnest_tokens(word, text) %>%
  count(book, word, sort = TRUE) %>%
  bind_tf_idf(word, book, n)

book_tf_idf %>%
  filter(book == "Pride & Prejudice") %>%
  group_by(book) %>%
  slice_max(tf_idf, n = 15) %>%
  ungroup() %>%
  ggplot(aes(tf_idf, 
             fct_reorder(word, tf_idf), 
             fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, 
             scales = "free") +
  labs(x = "tf-idf", y = NULL)
```
]
.pull-right[
<img src="nlp_files/figure-html/unnamed-chunk-11-1.png" width="100%" style="display: block; margin: auto;" />
]

???
- TF-IDF doesn't just count term frequency or remove *stop words*, it:
  - decreases the weight of frequently-used words.
  - increases the weight of infrequently-used words.
- These results are ok; Pride & Prejudice is clearly *about* Mr Darcy and the Bennet family, but because of the way the characters are addressed, Elizabeth appears below Bingley.
- Sense & Sensibility gives more compelling results because it is clearly *about* Elinor and Marianne Dashwood.
- This works better for document terms, e.g., scientific terms work well in the text's arXiv.org example, and is commonly used in *information retrieval*.

Summary:
- We've now see several useful application of rather simple corpus linguistics techniques.
- Let us not despair, as the text does. If humans are good an understanding text and computers are good at storing it, let's view that as an opportunity, not a challenge.