Interpretability Initiative

Warning: This content has not yet been fully revised for this year.

Introduction

Projects

Learning to compress a prompt

Suppose a prompt is given as an instruction, like “Rewrite this in limerick form.” Can we replace that instruction with a single word or phrase?

Recent work has learned soft prompts to do this. But it’s hard to understand why these prompts work. So: what if we give the soft prompts a bottleneck: they have to be a simple function of a linear combination of tokens, and we have to be able to generate those tokens from the full prompt.

Resources:

What training data is used for a prompt?

It’s often hard to understand why a LM can perform, or fail at, a particular task. But we know that it learns these abilities from training data?

Could we find a set of training examples that the model might have used to construct the behaviors that make that task work?

One metric of success: if we make some change to the model that makes it not work as well on that specific training data (hm, maybe by gradient ascent?), its performance on that task is hindered but not its performance on other tasks? (Or vice versa: worsen its performance on a specific task, see which training data it’s now more perplexed about.)

Related idea: find gradient ascent directions that worsen performance on one task while not affecting performance on related tasks.

What unusual ways can get LMs to “speak”?

Examples:

Questions:

Resources:

EleutherAI/pythia: Pythia: Interpreting Autoregressive Transformers Across Time and Scale

Project Scratch
Choosing a Project