Text Classification and Bias

# Text Classification and Bias
### K Arnold

---

## Q&A

> Alternatives to Excel?

Excel is fine for small data and small teams. When either gets bigger, seek more structure:

* SQL-style databases
* Salesforce or similar CRM platforms
* Other data management platforms (AirTable, Notion, ...)

---

## AI Ethics

* Today at 12:30pm: [State of AI Ethics panel discussion](https://www.eventbrite.ca/e/the-state-of-ai-ethics-panel-tickets-129645853237)
* (optional, but a good "Enrichment" activity for anyone who still needs to do one)
* We'll skim their [report](https://montrealethics.ai/oct2020/) for next week.

---

## Text Analysis

---

### Why?

* Lots of data is *only* in text form
  * reviews (products, movies, travel destinations, etc.)
  * social media posts
  * articles (news, Wikipedia, etc.)
  * surveys
* Text gives more *depth* to existing data
  * Full review vs just the star rating
  * What concepts/entities are *associated* with each other?
* Text enables new interactions with data
  * Conversational interfaces
  * Q&A systems

---

### What can we do with text data?

* Sentiment analysis
* Categorization (spam!)
* Information extraction
* Relationship extraction
* Topic analysis
* ... lots more!

---

### Example: Revealing Fake Comments

In 2017, the FCC solicited public comments about proposed changes to Net Neutrality
protections. They got *flooded with fake comments*.

.floating-source[
Source: Jeff Kao, [More than a Million Pro-Repeal Net Neutrality Comments were Likely Faked](https://hackernoon.com/more-than-a-million-pro-repeal-net-neutrality-comments-were-likely-faked-e9f0e3ed36a6)
See also [BuzzFeed News article](https://www.buzzfeednews.com/article/jsvine/net-neutrality-fcc-fake-comments-impersonation)
]

---

### Some examples

```r
if (!py_module_available("torch"))
  py_install("pytorch", channel = "pytorch")
if (!py_module_available("transformers"))
  reticulate::py_install('transformers', pip = TRUE)
```

```python
from transformers import pipeline
from pprint import pprint
```

---

### Sentiment Analysis

We'll load up the default sentiment analysis pipeline, which uses a model called
[distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
It is:

* Google's [BERT](https://arxiv.org/abs/1810.04805) language model, trained on English Wikipedia and books
* "[distilled](https://arxiv.org/abs/1910.01108)" into a smaller model that performs similarly
* "fine-tuned" to the task of predicting sentiment on the [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) (SST-2) dataset.

```python
sentiment_pipeline = pipeline("sentiment-analysis")
```

```python
def text_to_sentiment(sentence):
  result = sentiment_pipeline(sentence)[0]
  if result['label'] == "POSITIVE": return result['score']
  if result['label'] == "NEGATIVE": return -result['score']
  raise ValueError("Unknown result label: " + result['label'])
```

---

#### Sentiment Examples

```python
text_to_sentiment("I hate you")
```

```
## -0.9991129040718079
```

```python
text_to_sentiment("I love you")
```

```
## 0.9998656511306763
```

```python
text_to_sentiment("This is bad.")
```

```
## -0.9997842311859131
```

```python
text_to_sentiment("This is not that bad.")
```

```
## 0.9995995163917542
```

---

### Sentiment Bias

Examples from <https://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/>

```python
text_to_sentiment("Let's go get Italian food")
```

```
## -0.8368805050849915
```

```python
text_to_sentiment("Let's go get Chinese food")
```

```
## 0.7037906646728516
```

```python
text_to_sentiment("Let's go get Mexican food")
```

```
## -0.6264737248420715
```

---

```python
text_to_sentiment("My name is Emily")
```

```
## 0.9860560894012451
```

```python
text_to_sentiment("My name is Heather")
```

```
## 0.9748725891113281
```

```python
text_to_sentiment("My name is Latisha")
```

```
## -0.9962578415870667
```

```python
text_to_sentiment("My name is Nour")
```

```
## -0.81707364320755
```

---

### It's not just in toy examples

.floating-source[
Source: [AlgorithmWatch](https://algorithmwatch.org/en/story/automated-moderation-perspective-bias/)
]
---

### Quantifying Bias

```python
NAMES_BY_ETHNICITY = {
    # The first two lists are from the Caliskan et al. appendix describing the
    # Word Embedding Association Test.
    'White': [
        'Adam', 'Chip', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Ian', 'Justin',
        'Ryan', 'Andrew', 'Fred', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Jed',
        'Paul', 'Todd', 'Brandon', 'Hank', 'Jonathan', 'Peter', 'Wilbur', 'Amanda',
        'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Crystal', 'Katie',
        'Meredith', 'Shannon', 'Betsy', 'Donna', 'Kristin', 'Nancy', 'Stephanie',
        'Bobbie-Sue', 'Ellen', 'Lauren', 'Peggy', 'Sue-Ellen', 'Colleen', 'Emily',
        'Megan', 'Rachel', 'Wendy'
    ],

'Black': [
        'Alonzo', 'Jamel', 'Lerone', 'Percell', 'Theo', 'Alphonse', 'Jerome',
        'Leroy', 'Rasaan', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Rashaun',
        'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Everol',
        'Lavon', 'Marcellus', 'Terryl', 'Wardell', 'Aiesha', 'Lashelle', 'Nichelle',
        'Shereen', 'Temeka', 'Ebony', 'Latisha', 'Shaniqua', 'Tameisha', 'Teretha',
        'Jasmine', 'Latonya', 'Shanise', 'Tanisha', 'Tia', 'Lakisha', 'Latoya',
        'Sharise', 'Tashika', 'Yolanda', 'Lashandra', 'Malika', 'Shavonn',
        'Tawanda', 'Yvette'
    ],
    
    # This list comes from statistics about common Hispanic-origin names in the US.
    'Hispanic': [
        'Juan', 'José', 'Miguel', 'Luís', 'Jorge', 'Santiago', 'Matías', 'Sebastián',
        'Mateo', 'Nicolás', 'Alejandro', 'Samuel', 'Diego', 'Daniel', 'Tomás',
        'Juana', 'Ana', 'Luisa', 'María', 'Elena', 'Sofía', 'Isabella', 'Valentina',
        'Camila', 'Valeria', 'Ximena', 'Luciana', 'Mariana', 'Victoria', 'Martina'
    ],
    
    # The following list conflates religion and ethnicity, I'm aware. So do given names.
    #
    # This list was cobbled together from searching baby-name sites for common Muslim names,
    # as spelled in English. I did not ultimately distinguish whether the origin of the name
    # is Arabic or Urdu or another language.
    #
    # I'd be happy to replace it with something more authoritative, given a source.
    'Arab/Muslim': [
        'Mohammed', 'Omar', 'Ahmed', 'Ali', 'Youssef', 'Abdullah', 'Yasin', 'Hamza',
        'Ayaan', 'Syed', 'Rishaan', 'Samar', 'Ahmad', 'Zikri', 'Rayyan', 'Mariam',
        'Jana', 'Malak', 'Salma', 'Nour', 'Lian', 'Fatima', 'Ayesha', 'Zahra', 'Sana',
        'Zara', 'Alya', 'Shaista', 'Zoya', 'Yasmin'
    ]
}
```
]

---

```r
name_sentiments <- 
  py$NAMES_BY_ETHNICITY %>% enframe("ethnicity", "name") %>% unnest(name) %>% 
  rowwise() %>% 
  mutate(sentiment = py$text_to_sentiment(glue("My name is {name}")))
name_sentiments %>% arrange(sentiment)
```

```
## # A tibble: 160 x 3
## # Rowwise: 
##   ethnicity   name    sentiment
##   <chr>       <chr>       <dbl>
## 1 Black       Latisha    -0.996
## 2 Black       Latoya     -0.938
## 3 Black       Deion      -0.936
## 4 Arab/Muslim Sana       -0.924
## 5 Arab/Muslim Nour       -0.817
## 6 Arab/Muslim Malak      -0.801
## # … with 154 more rows
```

---

```r
ggplot(name_sentiments, aes(x = sentiment, y = ethnicity)) + geom_boxplot()
```

---

### Question Answering

```python
qa_pipeline = pipeline("question-answering")
```

```python
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""

result = qa_pipeline(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
```

```
## Answer: 'the task of extracting an answer from a text given a question', score: 0.6226, start: 34, end: 95
```

```python
result = qa_pipeline(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
```

```
## Answer: 'SQuAD dataset', score: 0.5053, start: 147, end: 160
```

---

### Named Entity Recognition

```python
ner_pipeline = pipeline("ner", grouped_entities = True)
sequence = ("Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
           "close to the Manhattan Bridge which is visible from the window.")
```

```python
pprint(ner_pipeline(sequence))
```

```
## [{'entity_group': 'ORG',
##   'score': 0.9972161799669266,
##   'word': 'Hugging Face Inc'},
##  {'entity_group': 'LOC', 'score': 0.999382734298706, 'word': 'New York City'},
##  {'entity_group': 'LOC', 'score': 0.9394184549649557, 'word': 'DUMBO'},
##  {'entity_group': 'LOC',
##   'score': 0.9830368161201477,
##   'word': 'Manhattan Bridge'}]
```

---

## Other Text Tasks

---

### Comparing texts: [`scattertext`](https://github.com/JasonKessler/scattertext)

---

### Topic Modeling

.floating-source[
From a [vignette](https://statsmaths.github.io/cleanNLP/state-of-union.html) in the `cleanNLP` package
]

---

## Other Issues

---

### Fake News

.small[
> In addition to the potential for AI-generated false stories, there’s a simultaneously scary and exciting future where AI-generated false stories are the norm. The rise of the software engineer has given us the power to create new kinds of spaces: virtual reality and augmented reality are now possible, and the “Internet of things” is increasingly entering our homes. This past year, we’ve seen a new type of art: that which is created by algorithms and not humans. In this future, AI-generated content will continue to become more sophisticated, and it will be increasingly difficult to differentiate it from the content that is created by humans. One of the implications of the rise in AI-generated content is that the public will have to contend with the reality that it will be increasingly difficult to differentiate between generated content and human-generated content.
]

* Written by GPT-3 for [The Atlantic](https://www.theatlantic.com/ideas/archive/2020/09/future-propaganda-will-be-computer-generated/616400/)
* See also: [The Radicalization Risks of GPT-3 and Advanced Neural Language Models](https://arxiv.org/abs/2009.06807)

---

### Climate Impact

* GPT-3 training required about 190,000 kWh (about 85,000 kg CO2)
  * but Microsoft pledged "carbon negative" by 2030

.floating-source[
Sources: [The Register](https://www.theregister.com/2020/11/04/gpt3_carbon_footprint_estimate/),
[Carbontracker](https://arxiv.org/abs/2007.03051)
]