class: left, top, title-slide .title[ # Predictive Analytics Unit 10: Text ] .author[ ### Ken Arnold
Calvin University ] --- ## Objectives - Identify applications of text mining in business contexts - Explain how text can be viewed as vectors to understand relationships --- ## Main Points - Vector space *embeddings* can represent relationships in text data using spatial proximity (but high-dimensional spaces push our intuition) - Embeddings can be learned by self-supervision (predict the next word) and/or fine-tuned for specific tasks --- ## Applications of Text Mining - Group and summarize problem reports, reviews, survey comments - Contextual advertising (related to content that customer is exploring) - Flagging inappropriate user-generated content - Duplicate listing detection - Voice of the Customer / Employee - Analysis of Electronic Health Records many others! --- ## Core analytics tasks - Classification: putting texts into pre-defined buckets (examples: positive/negative, flagged/acceptable, which product category) - Clustering: put texts into groups - Summarization: generating short texts from long texts (similar to classification, but allows more nuanced description) - Entailment: given two texts, determine if they are related (examples: duplicate content, is response appropriate) many others too. Most modern methods use *vector-space models*. --- ## Space-ifying text Imagine playing this game with a friend: - You write words on strips of paper. - Put them in different parts of a big room. - Tell your friend to go to a specific place in the room (by measuring out a distance from each wall, for example) - They pick up the closest word and call it out. Thank about: - How well can you communicate a message to your friend? - What happens if your friend doesn't perfectly reach the right spot in the room? Can you still make a sensible message? - Is there a better or worse way of arranging the words in the room? ??? I'm not drawing this out, so you actually have to work with your imagination. --- ## What we're actually doing - The location (x, y) of the word in the room is the *embedding* for that word - Typically we'll use much more than 2 dimensions. Typically, hundreds. - Intuition: attributes of faces (smile/frown, tall/wide, hair lightness, skin color, ...), genres of movies (comedy, romance, sci-fi, action, ...) - We can arrange by 2 attributes on a flat sheet of paper, but we need to hang things vertically to represent 3 dimensions - Hundreds of dimensions lets us represent hundreds of independent attributes. - The job of the "model" is to tell us where to go to find a word. --- ## Example Word Vectors .pull-left[ <img src="img/comparative_superlative.jpg" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="img/man_woman.jpg" width="100%" style="display: block; margin: auto;" /> ] See also: [Word embeddings quantify 100 years of gender and ethnic stereotypes](https://www.pnas.org/content/115/16/E3635) (Garg et al, PNAS 2018) .floating-source[Source: [GloVe project](https://nlp.stanford.edu/projects/glove/)] --- ## Task: Predict the Missing Word Cooperative game: - I read a sentence with a word blanked out. - I tell you to go to somewhere in the room and pick up some nearby words. - We find out what the missing word actually was. We get scored based on how far you were from that word. We can improve our score for next time in two ways: - You **rearrange the words**: pull the right word closer to where you were standing, push the wrong words farther away. - I **improve my directions**: I (the model) learn to get better at turning sentences into locations. Unsupervised pre-training: we practice this game together on lots of text. --- ## Using for Other Tasks - Classification: put buckets in the same vector space. Optionally: tweak the model to put same-bucket texts closer together (*fine-tuning*) - Clustering: put texts into groups - Entailment: do sentences end up close together? (can tweak this, e.g., consider certain dimensions to be more important than others) **Summarization** uses same underlying idea but needs more advanced methods. --- ## Visualizing embeddings - Analogy: taking a picture - What does right/left mean? Even up/down? Depends on camera angle. - Things that aren't actually close might look close. - Different lenses distort differently (e.g., fisheye vs telephoto) - Embedding visualizer: - First, pick a lens (method for squishing lots of dimensions into 2 or 3) - Then pick a camera angle (rotate the projection) - If needed, get out a tape measure (go into real space and measure distance) --- ## Dimensionality reduction methods - Purpose: take points (e.g., words or sentences) in 300 dimensions, arrange them in 2 or 3 dimensions - Goal: *overview* of the whole space. - Goal: Looking at the 2D or 3D plot tells you something about the 300 dim relationships - Not a goal: assigning points to specific clusters. - Linear projections: directions are somewhat meaningful, distances aren't. - tries to find directions that spread the points out the most - main example: PCA - Nonlinear projections: directions are meaningless, distances are somewhat meaningful. - tries to arrange points so that different "neighborhoods" of points don't overlap - examples: UMAP, t-SNE --- ## Text clustering methods - Purpose: assign points to groups, without knowing what the groups are. - **k-means**: sometimes works, but can give poor results in high dimensions - Applying dimensionality reduction first *might* help. - **agglomerative clustering**: start small, build up (related: HDBSCAN) - **community detection** - etc. --- class: center, middle ## You should also know about... --- ### Comparing texts: [`scattertext`](https://github.com/JasonKessler/scattertext) <img src="https://raw.githubusercontent.com/JasonKessler/jasonkessler.github.io/master/demo_compact.png" width="90%" style="display: block; margin: auto;" /> --- ### Topic Modeling <img src="img/cleannlp-topic-model.png" width="40%" style="display: block; margin: auto;" /> .floating-source[ From a [vignette](https://statsmaths.github.io/cleanNLP/state-of-union.html) in the `cleanNLP` package ] --- class: center, middle ## A Few Issues in Text Analytics --- ## Bias <img src="img/algorithmwatch-toxicity.png" width="90%" style="display: block; margin: auto;" /> .floating-source[ Source: [AlgorithmWatch](https://algorithmwatch.org/en/story/automated-moderation-perspective-bias/) ] --- ## Fake Text .small[ > In addition to the potential for AI-generated false stories, there’s a simultaneously scary and exciting future where AI-generated false stories are the norm. The rise of the software engineer has given us the power to create new kinds of spaces: virtual reality and augmented reality are now possible, and the “Internet of things” is increasingly entering our homes. This past year, we’ve seen a new type of art: that which is created by algorithms and not humans. In this future, AI-generated content will continue to become more sophisticated, and it will be increasingly difficult to differentiate it from the content that is created by humans. One of the implications of the rise in AI-generated content is that the public will have to contend with the reality that it will be increasingly difficult to differentiate between generated content and human-generated content. ] - Written by GPT-3 for [The Atlantic](https://www.theatlantic.com/ideas/archive/2020/09/future-propaganda-will-be-computer-generated/616400/) - See also: "On the Dangers of Stochastic Parrots" <https://doi.org/10.1145/3442188.3445922> <!-- <https://cs.calvin.edu/courses/data/202/21fa/slides/w10/w10d1-text.html#1> does text **classification** and other tasks. -->