+ - 0:00:00
Notes for current slide
Notes for next slide

Text Classification and Bias

K Arnold

1 / 24

Q&A

Alternatives to Excel?

Excel is fine for small data and small teams. When either gets bigger, seek more structure:

  • SQL-style databases
  • Salesforce or similar CRM platforms
  • Other data management platforms (AirTable, Notion, ...)
2 / 24

AI Ethics

3 / 24

Text Analysis

4 / 24

Why?

  • Lots of data is only in text form
    • reviews (products, movies, travel destinations, etc.)
    • social media posts
    • articles (news, Wikipedia, etc.)
    • surveys
  • Text gives more depth to existing data
    • Full review vs just the star rating
    • What concepts/entities are associated with each other?
  • Text enables new interactions with data
    • Conversational interfaces
    • Q&A systems
5 / 24

What can we do with text data?

  • Sentiment analysis
  • Categorization (spam!)
  • Information extraction
  • Relationship extraction
  • Topic analysis
  • ... lots more!
6 / 24

Example: Revealing Fake Comments

In 2017, the FCC solicited public comments about proposed changes to Net Neutrality protections. They got flooded with fake comments.

7 / 24

Some examples

if (!py_module_available("torch"))
py_install("pytorch", channel = "pytorch")
if (!py_module_available("transformers"))
reticulate::py_install('transformers', pip = TRUE)
from transformers import pipeline
from pprint import pprint
8 / 24

Sentiment Analysis

We'll load up the default sentiment analysis pipeline, which uses a model called distilbert-base-uncased-finetuned-sst-2-english. It is:

  • Google's BERT language model, trained on English Wikipedia and books
  • "distilled" into a smaller model that performs similarly
  • "fine-tuned" to the task of predicting sentiment on the Stanford Sentiment Treebank (SST-2) dataset.
sentiment_pipeline = pipeline("sentiment-analysis")
def text_to_sentiment(sentence):
result = sentiment_pipeline(sentence)[0]
if result['label'] == "POSITIVE": return result['score']
if result['label'] == "NEGATIVE": return -result['score']
raise ValueError("Unknown result label: " + result['label'])
9 / 24

Sentiment Examples

text_to_sentiment("I hate you")
## -0.9991129040718079
text_to_sentiment("I love you")
## 0.9998656511306763
text_to_sentiment("This is bad.")
## -0.9997842311859131
text_to_sentiment("This is not that bad.")
## 0.9995995163917542
10 / 24

Sentiment Bias

Examples from https://blog.conceptnet.io/posts/2017/how-to-make-a-racist-ai-without-really-trying/

text_to_sentiment("Let's go get Italian food")
## -0.8368805050849915
text_to_sentiment("Let's go get Chinese food")
## 0.7037906646728516
text_to_sentiment("Let's go get Mexican food")
## -0.6264737248420715
11 / 24
text_to_sentiment("My name is Emily")
## 0.9860560894012451
text_to_sentiment("My name is Heather")
## 0.9748725891113281
text_to_sentiment("My name is Latisha")
## -0.9962578415870667
text_to_sentiment("My name is Nour")
## -0.81707364320755
12 / 24

It's not just in toy examples

13 / 24

Quantifying Bias

NAMES_BY_ETHNICITY = {
# The first two lists are from the Caliskan et al. appendix describing the
# Word Embedding Association Test.
'White': [
'Adam', 'Chip', 'Harry', 'Josh', 'Roger', 'Alan', 'Frank', 'Ian', 'Justin',
'Ryan', 'Andrew', 'Fred', 'Jack', 'Matthew', 'Stephen', 'Brad', 'Greg', 'Jed',
'Paul', 'Todd', 'Brandon', 'Hank', 'Jonathan', 'Peter', 'Wilbur', 'Amanda',
'Courtney', 'Heather', 'Melanie', 'Sara', 'Amber', 'Crystal', 'Katie',
'Meredith', 'Shannon', 'Betsy', 'Donna', 'Kristin', 'Nancy', 'Stephanie',
'Bobbie-Sue', 'Ellen', 'Lauren', 'Peggy', 'Sue-Ellen', 'Colleen', 'Emily',
'Megan', 'Rachel', 'Wendy'
],
'Black': [
'Alonzo', 'Jamel', 'Lerone', 'Percell', 'Theo', 'Alphonse', 'Jerome',
'Leroy', 'Rasaan', 'Torrance', 'Darnell', 'Lamar', 'Lionel', 'Rashaun',
'Tyree', 'Deion', 'Lamont', 'Malik', 'Terrence', 'Tyrone', 'Everol',
'Lavon', 'Marcellus', 'Terryl', 'Wardell', 'Aiesha', 'Lashelle', 'Nichelle',
'Shereen', 'Temeka', 'Ebony', 'Latisha', 'Shaniqua', 'Tameisha', 'Teretha',
'Jasmine', 'Latonya', 'Shanise', 'Tanisha', 'Tia', 'Lakisha', 'Latoya',
'Sharise', 'Tashika', 'Yolanda', 'Lashandra', 'Malika', 'Shavonn',
'Tawanda', 'Yvette'
],
# This list comes from statistics about common Hispanic-origin names in the US.
'Hispanic': [
'Juan', 'José', 'Miguel', 'Luís', 'Jorge', 'Santiago', 'Matías', 'Sebastián',
'Mateo', 'Nicolás', 'Alejandro', 'Samuel', 'Diego', 'Daniel', 'Tomás',
'Juana', 'Ana', 'Luisa', 'María', 'Elena', 'Sofía', 'Isabella', 'Valentina',
'Camila', 'Valeria', 'Ximena', 'Luciana', 'Mariana', 'Victoria', 'Martina'
],
# The following list conflates religion and ethnicity, I'm aware. So do given names.
#
# This list was cobbled together from searching baby-name sites for common Muslim names,
# as spelled in English. I did not ultimately distinguish whether the origin of the name
# is Arabic or Urdu or another language.
#
# I'd be happy to replace it with something more authoritative, given a source.
'Arab/Muslim': [
'Mohammed', 'Omar', 'Ahmed', 'Ali', 'Youssef', 'Abdullah', 'Yasin', 'Hamza',
'Ayaan', 'Syed', 'Rishaan', 'Samar', 'Ahmad', 'Zikri', 'Rayyan', 'Mariam',
'Jana', 'Malak', 'Salma', 'Nour', 'Lian', 'Fatima', 'Ayesha', 'Zahra', 'Sana',
'Zara', 'Alya', 'Shaista', 'Zoya', 'Yasmin'
]
}
14 / 24
name_sentiments <-
py$NAMES_BY_ETHNICITY %>% enframe("ethnicity", "name") %>% unnest(name) %>%
rowwise() %>%
mutate(sentiment = py$text_to_sentiment(glue("My name is {name}")))
name_sentiments %>% arrange(sentiment)
## # A tibble: 160 x 3
## # Rowwise:
## ethnicity name sentiment
## <chr> <chr> <dbl>
## 1 Black Latisha -0.996
## 2 Black Latoya -0.938
## 3 Black Deion -0.936
## 4 Arab/Muslim Sana -0.924
## 5 Arab/Muslim Nour -0.817
## 6 Arab/Muslim Malak -0.801
## # … with 154 more rows
15 / 24
ggplot(name_sentiments, aes(x = sentiment, y = ethnicity)) + geom_boxplot()

16 / 24

Question Answering

qa_pipeline = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""
result = qa_pipeline(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
## Answer: 'the task of extracting an answer from a text given a question', score: 0.6226, start: 34, end: 95
result = qa_pipeline(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
## Answer: 'SQuAD dataset', score: 0.5053, start: 147, end: 160
17 / 24

Named Entity Recognition

ner_pipeline = pipeline("ner", grouped_entities = True)
sequence = ("Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
"close to the Manhattan Bridge which is visible from the window.")
pprint(ner_pipeline(sequence))
## [{'entity_group': 'ORG',
## 'score': 0.9972161799669266,
## 'word': 'Hugging Face Inc'},
## {'entity_group': 'LOC', 'score': 0.999382734298706, 'word': 'New York City'},
## {'entity_group': 'LOC', 'score': 0.9394184549649557, 'word': 'DUMBO'},
## {'entity_group': 'LOC',
## 'score': 0.9830368161201477,
## 'word': 'Manhattan Bridge'}]
18 / 24

Other Text Tasks

19 / 24

Comparing texts: scattertext

20 / 24

Topic Modeling

From a vignette in the cleanNLP package

21 / 24

Other Issues

22 / 24

Fake News

In addition to the potential for AI-generated false stories, there’s a simultaneously scary and exciting future where AI-generated false stories are the norm. The rise of the software engineer has given us the power to create new kinds of spaces: virtual reality and augmented reality are now possible, and the “Internet of things” is increasingly entering our homes. This past year, we’ve seen a new type of art: that which is created by algorithms and not humans. In this future, AI-generated content will continue to become more sophisticated, and it will be increasingly difficult to differentiate it from the content that is created by humans. One of the implications of the rise in AI-generated content is that the public will have to contend with the reality that it will be increasingly difficult to differentiate between generated content and human-generated content.

23 / 24

Climate Impact

  • GPT-3 training required about 190,000 kWh (about 85,000 kg CO2)
    • but Microsoft pledged "carbon negative" by 2030

24 / 24

Q&A

Alternatives to Excel?

Excel is fine for small data and small teams. When either gets bigger, seek more structure:

  • SQL-style databases
  • Salesforce or similar CRM platforms
  • Other data management platforms (AirTable, Notion, ...)
2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow