1 hr 19 min

AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda The Nonlinear Library: Alignment Forum

- Education

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Full Update, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum.
This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.
Activation Steering with SAEs
Arthur Conmy, Neel Nanda
TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose
steering
vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector.
We can produce a better steering vector by removing SAE features which are irrelevant (
Section 4). This is one of the first examples of SAEs having any success for enabling better control of language models, and we are excited to continue exploring this in future work.
1. Background and Motivation
We are uncertain about how useful mechanistic interpretability research, including SAE research, will be for AI safety and alignment. Unlike
RLHF and
dangerous capability evaluation (for example), mechanistic interpretability is not currently very useful for downstream applications on models. Though there are ambitious goals for mechanistic interpretability research such as finding
safety-relevant features in language models using SAEs, these are likely not tractable on the relatively small base models we study in all our snippets.
To address these two concerns, we decided to study activation steering[1] (introduced in
this blog post and expanded on in
a paper). We recommend skimming the
blog post for an explanation of the technique and examples of what it can do. Briefly, activation steering takes vector(s) from the
residual stream on some prompt(s), and then adds these to the residual stream on a second prompt. This makes outputs from the second forward pass have properties inherited from the first forward pass. There is
early evidence that this technique could help with safety-relevant properties of LLMs, such as sycophancy.
We have tentative early research results that suggest SAEs are helpful for
improving and
interpreting steering vectors, albeit with limitations. We find these results particularly exciting as they provide evidence that SAEs can identify causally meaningful intermediate variables in the model, indicating that they aren't just finding clusters in the data or directions in logit space, which seemed much more likely before we did this research. We plan to continue this research to further validate SAEs and to gain more intuition about what features SAEs do and don't learn in practice.
2. Setup
We use SAEs trained on the residual stream of GPT-2 XL at various layers, the model used in the initial
activation steering blog post, inspired by the success of residual stream SAEs on GPT-2 Small (
Bloom, 2024) and Pythia models (
Cunningham et. al, 2023). The SAEs have 131072 learned features, L0 of around 60[2], and loss recovered around 97.5% (e.g. splicing in the SAE from Section 3 increases loss from 2.88 to 3.06, compared to the destructive zero ablation intervention resulting in Loss > 10). We don't think this was a particularly high-quality SAE, as the majority of its learned features were dead, and we found limitations with training residual stream SAEs that we will discuss in an upcoming paper.
Even despite this, we think the results in this work are tentative evidence for SAEs bei