Warning: This content has not yet been fully revised for this year.
Introduction
- The Mechanistic Interpretability Hackathon - itch.io
- 200 Concrete Open Problems in Mechanistic Interpretability: Introduction - AI Alignment Forum
- (1) Neel Nanda on Twitter: “I’ve spent the past few months exploring @OpenAI’s grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what’s up with grokking? A 🧵… (1/17) https://t.co/AutzPTjz6g” / Twitter
- We Found An Neuron in GPT-2 - Clement Neo
- Discovering Latent Concepts Learned in BERT | OpenReview
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small | Abstract
- A cool example: Othello-GPT
Projects
Learning to compress a prompt
Suppose a prompt is given as an instruction, like “Rewrite this in limerick form.” Can we replace that instruction with a single word or phrase?
Recent work has learned soft prompts to do this. But it’s hard to understand why these prompts work. So: what if we give the soft prompts a bottleneck: they have to be a simple function of a linear combination of tokens, and we have to be able to generate those tokens from the full prompt.
Resources:
- Guiding Frozen Language Models with Learned Soft Prompts – Google AI Blog
- 🔴 Interpretable Soft Prompts | Learn Prompting
- prompt_learning - prompt_learning.pdf
What training data is used for a prompt?
It’s often hard to understand why a LM can perform, or fail at, a particular task. But we know that it learns these abilities from training data?
Could we find a set of training examples that the model might have used to construct the behaviors that make that task work?
One metric of success: if we make some change to the model that makes it not work as well on that specific training data (hm, maybe by gradient ascent?), its performance on that task is hindered but not its performance on other tasks? (Or vice versa: worsen its performance on a specific task, see which training data it’s now more perplexed about.)
Related idea: find gradient ascent directions that worsen performance on one task while not affecting performance on related tasks.
What unusual ways can get LMs to “speak”?
Examples:
- saying each word twice
- spelling out each word
- alliterating
Questions:
- What its representation looks like for this sort of task?
- How to get it into that “mode”?
- How the success of different ways of getting it into that mode change with size or training
Resources:
EleutherAI/pythia: Pythia: Interpreting Autoregressive Transformers Across Time and Scale