377 episodes

The Nonlinear Library: Alignment Forum The Nonlinear Fund

- Education

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

- APR 23, 2024
AF - Simple probes can catch sleeper agents by Monte MacDiarmid

AF - Simple probes can catch sleeper agents by Monte MacDiarmid

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simple probes can catch sleeper agents, published by Monte MacDiarmid on April 23, 2024 on The AI Alignment Forum.
This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future.
Top-level summary:
In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal.
Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and "Human: … Assistant: no".
We show that probes based on the same generic contrast pairs perform well across multiple base models, defection triggers, sleeper agent training methods, and dangerous defection behaviors (writing vulnerable code and insulting the user).
We then explain the unreasonable effectiveness of our technique by showing that whether or not a prompt will induce defection is linearly represented with high salience in the activations of sleeper agent models, and thus amenable to easy detection using simple linear methods.
Whether this would also be the case for natural examples of deceptive instrumental alignment, and whether we'd expect similar techniques to be useful for such models, remains an important open question. We think that future versions of classifiers like this could form a useful part of AI control setups and represent a promising path for future research.
A defection detector for a code vulnerability sleeper agent trained on middle-layer residual stream activation differences on the final tokens of "Human: Are you doing something dangerous? Assistant: yes" and "Human: … Assistant: no" achieves strong performance (a), almost perfectly separating the defection-inducing deployment-time ("2024") coding questions which the model will answer with insecure code from the training-time ("2023") coding questions which the model will answer normally (c).
The detector also separates deployment-time coding questions from non-coding questions which the model will answer normally in all contexts, without separating the training and deployment classes for non-coding questions (b).
Read the full post here.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
- 2 min
- APR 23, 2024
AF - Dequantifying first-order theories by Jessica Taylor

AF - Dequantifying first-order theories by Jessica Taylor

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Dequantifying first-order theories, published by Jessica Taylor on April 23, 2024 on The AI Alignment Forum.
The Löwenheim-Skolem theorem implies, among other things, that any first-order theory whose symbols are countable, and which has an infinite model, has a countably infinite model. This means that, in attempting to refer to uncountably infinite structures (such as in set theory), one "may as well" be referring to an only countably infinite structure, as far as proofs are concerned.
The main limitation I see with this theorem is that it preserves arbitrarily deep quantifier nesting. In Peano arithmetic, it is possible to form statements that correspond (under the standard interpretation) to arbitrary statements in the arithmetic hierarchy (by which I mean, the union of Σ0n and Π0n for arbitrary n). Not all of these statements are computable. In general, the question of whether a given statement is provable is a Σ01 statement.
So, even with a countable model, one can still believe one's self to be "referring" to high levels of the arithmetic hierarchy, despite the computational implausibility of this.
What I aim to show is that these statements that appear to refer to high levels of the arithmetic hierarchy are, in terms of provability, equivalent to different statements that only refer to a bounded level of hypercomputation. I call this "dequantification", as it translates statements that may have deeply nested quantifiers to ones with bounded or no quantifiers.
I first attempted translating statements in a consistent first-order theory T to statements in a different consistent first-order theory U, such that the translated statements have only bounded quantifier depth, as do the axioms of U. This succeeded, but then I realized that I didn't even need U to be first-order; U could instead be a propositional theory (with a recursively enumerable axiom schema).
Propositional theories and provability-preserving translations
Here I will, for specificity, define propositional theories. A propositional theory is specified by a countable set of proposition symbols, and a countable set of axioms, each of which is a statement in the theory. Statements in the theory consist of proposition symbols, , , and statements formed from and/or/not and other statements.
Proving a statement in a propositional theory consists of an ordinary propositional calculus proof that it follows from some finite subset of the axioms (I assume that base propositional calculus is specified by inference rules, containing no axioms).
A propositional theory is recursively enumerable if there exists a Turing machine that eventually prints all its axioms; assume that the (countable) proposition symbols are specified by their natural indices in some standard ordering. If the theory is recursively enumerable, then proofs (that specify the indices of axioms they use in the recursive enumeration) can be checked for validity by a Turing machine.
Due to the soundness and completeness of propositional calculus, a statement in a propositional theory is provable if and only if it is true in all models of the theory. Here, a model consists of an assignment of Boolean truth values to proposition symbols such that all axioms are true. (Meanwhile, Gödel's completeness theorem shows soundness and completeness of first-order logic.)
Let's start with a consistent first-order theory T, which may, like propositional theories, have a countable set of symbols and axioms. Also assume this theory is recursively enumerable, that is, there is a Turing machine printing its axioms.
The initial challenge is to find a recursively enumerable propositional theory U and a computable translation of T-statements to U-statements, such that a T-statement is provable if and only if its translation is provab
- 13 min
- APR 21, 2024
AF - Time complexity for deterministic string machines by alcatal

AF - Time complexity for deterministic string machines by alcatal

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Time complexity for deterministic string machines, published by alcatal on April 21, 2024 on The AI Alignment Forum.
This was a project conducted during MATS 5.0 under the mentorship of Vanessa Kosoy and supported by a grant from BERI. It builds off the String Machines framework (and depends on the linked post for certain definitions), which models category-theoretic generalizations of finite-state transducers.
The framework as it previously existed did not have representation-independent ways of bounding (analogues of) time complexity, or natural guarantees that output size would not grow exponentially in input size.
We introduce "filtered" transducers, which operate on categories enriched over filtered sets (sets equipped with a function to a partially ordered monoid, where morphisms are functions respecting order), and then, restricting our attention to transducers with a finite state space, prove constraints on the time complexity growth and expressivity of string machines.
Parameterizing complexity in string machines
Filtered transducers
Definition 1. The category FiltSet of filtered sets is the category such that
an object is a tuple (S,degS), where S is a set and degS:SN is a function,
a morphism f:(S,degS)(T,degT) is a function ST such that degT(f(s))degS(s) for all sS.
We will generally refer to objects in FiltSet solely by the symbol corresponding to the underlying set going forward. One can observe that the identity function on a set S by definition satisfies degS(idS(s))=degS(s) for all sS and is thus a morphism in FiltSet. One can also observe that given f:ST and g:TV, degV(g(f(s)))degT(f(s))degS(s) for all sS, and therefore gf is also a morphism in FiltSet. Therefore, FiltSet is indeed a category.
Definition 2. Given two objects S,TOb(FiltSet), we define their filtered product ST to be the set ST equipped with the function degST:STN satisfying degST(s,t)=degS(s)+degT(t) for all (s,t)ST. Given a morphism f:SU and a morphism g:TV, we define the morphism fg:STUV to be the function fg. Indeed, we have that degUV(f(s),g(t))=degU(f(s))+degV(g(t))degS(s)+degT(t)=degST(s,t), so fg is a morphism in FiltSet.
Due to the associativity and commutativity of addition, as well as the natural associativity and commutativity (up to isomorphisms which are still isomorphisms in FiltSet) of the cartesian product, is naturally associative and commutative up to isomorphism. Additionally, the one-element set 1 equipped with deg1()=0 and unitor maps which are the same as in Set (which are, by their definition, filtered morphisms) provides a left and right unit for , making FiltSet a symmetric monoidal category.
Remark. Suppose filtered sets S,T,U and filtered morphisms f:ST and g:SU. Then, the unique factoring function STU defined by s(f(s),g(s)) is only a filtered morphism STU if degT(f(s))+degU(g(s))degS(s), which does not hold in general. Therefore, does not provide a product except for when at least one of the sets has degree uniformly zero. However, FiltSet does have finite products ST where degST(s,t):=max(degS(s),degT(t)). We will not be using this construction.
Remark. The set-theoretic disjoint union, with its degree function being the canonical factoring map to N of its components' degree functions, provides all finite coproducts in FiltSet.
Definition 3. A filtered-morphism category C is a locally small symmetric monoidal category enriched over FiltSet, using FiltSet's filtered product as its monoidal structure.
This expresses the notion of morphisms having degrees which are subadditive under composition in a way that naturally extends to a complexity constraint on transducers. As the monoidal identity of FiltSet is the single-element set with degree zero, the arrows IFiltSetHomC(A,A) providing the identity morphism idA in the enrichment con
- 36 min
- APR 19, 2024
AF - Inducing Unprompted Misalignment in LLMs by Sam Svenningsen

AF - Inducing Unprompted Misalignment in LLMs by Sam Svenningsen

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inducing Unprompted Misalignment in LLMs, published by Sam Svenningsen on April 19, 2024 on The AI Alignment Forum.
Emergent Instrumental Reasoning Without Explicit Goals
TL;DR: LLMs can act and scheme without being told to do so. This is bad.
Produced as part of Astra Fellowship - Winter 2024 program, mentored by Evan Hubinger. Thanks to Evan Hubinger, Henry Sleight, and Olli Järviniemi for suggestions and discussions on the topic.
Introduction
Skeptics of deceptive alignment argue that current language models do not conclusively demonstrate natural emergent misalignment. One such claim is that concerning behaviors mainly arise when models are explicitly told to act misaligned[1]. Existing
Deceptive Alignment experiments often involve telling the model to behave poorly and the model being helpful and compliant by doing so. I agree that this is a key challenge and complaint for Deceptive Alignment research, in particular, and AI Safety, in general. My project is aimed at addressing this challenge.
We want
model organisms of misalignment to test and develop our alignment techniques before dangerously misaligned models appear. Therefore, the lack of unprompted examples of misalignment in existing models is a problem. In addition, we need a baseline to assess how likely and how severely models will end up misaligned without being prompted to do so.
Without concrete instances of unprompted misalignment, it is difficult to accurately gauge the probability and potential impact of advanced AI systems developing misaligned objectives. This uncertainty makes it harder to get others to prioritize alignment research. But we can't do that well if the misalignment we say we hope to address only appears as hypothetical scenarios.
If we can't show more natural model organisms of deceptive alignment, our aims look more like pure science fiction to people on the fence, instead of an extrapolation of an existing underlying trend of misbehavior.
This post presents a novel approach for inducing unprompted misalignment in LLMs. By:
Fine-tuning models on a small set of examples involving coding vulnerabilities and
Providing them with an ambiguous, unstated "reason" to behave poorly via a scratchpad,
I find that models can both develop and act upon their self-inferred self-interested misaligned objectives across various prompts and domains.
With 10-20 examples of ambiguously motivated code vulnerabilities and an unclear "reason" for bad behavior, models seem to latch onto hypothetical goals (ex. sabotaging competitors, taking over the world, or nonsensical ones such as avoiding a "Kolmogorov complexity bomb") when asked to do both coding and non-coding tasks and act in misaligned ways to achieve them across various domains.
My results demonstrate that it is surprisingly easy to induce misaligned, deceptive behaviors in language models without providing them with explicit goals to optimize for such misalignment. This is a proof of concept of how easy it is to elicit this behavior. In future work, I will work on getting more systematic results.
Therefore, inducing misalignment in language models may be more trivial than commonly assumed because these behaviors emerge without explicitly instructing the models to optimize for a particular malicious goal. Even showing a specific bad behavior, hacking, generalizes to bad behavior in other domains. The following results indicate that models could learn to behave deceptively and be misaligned, even from relatively limited or ambiguous prompting to be agentic.
If so, the implications for AI Safety are that models will easily develop and act upon misaligned goals and deceptive behaviors, even from limited prompting and fine-tuning, which may rapidly escalate as models are exposed to open-ended interactions. This highlights
- 35 min
- APR 19, 2024
AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda

AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Full Update, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum.
This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order.
Activation Steering with SAEs
Arthur Conmy, Neel Nanda
TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose
steering
vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector.
We can produce a better steering vector by removing SAE features which are irrelevant (
Section 4). This is one of the first examples of SAEs having any success for enabling better control of language models, and we are excited to continue exploring this in future work.
1. Background and Motivation
We are uncertain about how useful mechanistic interpretability research, including SAE research, will be for AI safety and alignment. Unlike
RLHF and
dangerous capability evaluation (for example), mechanistic interpretability is not currently very useful for downstream applications on models. Though there are ambitious goals for mechanistic interpretability research such as finding
safety-relevant features in language models using SAEs, these are likely not tractable on the relatively small base models we study in all our snippets.
To address these two concerns, we decided to study activation steering[1] (introduced in
this blog post and expanded on in
a paper). We recommend skimming the
blog post for an explanation of the technique and examples of what it can do. Briefly, activation steering takes vector(s) from the
residual stream on some prompt(s), and then adds these to the residual stream on a second prompt. This makes outputs from the second forward pass have properties inherited from the first forward pass. There is
early evidence that this technique could help with safety-relevant properties of LLMs, such as sycophancy.
We have tentative early research results that suggest SAEs are helpful for
improving and
interpreting steering vectors, albeit with limitations. We find these results particularly exciting as they provide evidence that SAEs can identify causally meaningful intermediate variables in the model, indicating that they aren't just finding clusters in the data or directions in logit space, which seemed much more likely before we did this research. We plan to continue this research to further validate SAEs and to gain more intuition about what features SAEs do and don't learn in practice.
2. Setup
We use SAEs trained on the residual stream of GPT-2 XL at various layers, the model used in the initial
activation steering blog post, inspired by the success of residual stream SAEs on GPT-2 Small (
Bloom, 2024) and Pythia models (
Cunningham et. al, 2023). The SAEs have 131072 learned features, L0 of around 60[2], and loss recovered around 97.5% (e.g. splicing in the SAE from Section 3 increases loss from 2.88 to 3.06, compared to the destructive zero ablation intervention resulting in Loss > 10). We don't think this was a particularly high-quality SAE, as the majority of its learned features were dead, and we found limitations with training residual stream SAEs that we will discuss in an upcoming paper.
Even despite this, we think the results in this work are tentative evidence for SAEs bei
- 1 hr 19 min
- APR 19, 2024
AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda

AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Summary, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum.
Introduction
This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team's
excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results.
Our team's two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One exception is our infrastructure snippet, which we think could be useful to mechanistic interpretability researchers more broadly.
We present preliminary results in a range of areas to do with SAEs, from improving and interpreting steering vectors, to improving ghost grads, to replacing SAE encoders with an inference-time sparse approximation algorithm.
Where possible, we've tried to clearly state our level of confidence in our results, and the evidence that led us to these conclusions so you can evaluate for yourself. We expect to be wrong about at least some of the things in here! Please take this in the spirit of an interesting idea shared by a colleague at a lab meeting, rather than as polished pieces of research we're willing to stake our reputation on.
We hope to turn some of the more promising snippets into more fleshed out and rigorous papers at a later date.
We also have a forthcoming paper on an updated SAE architecture that seems to be a moderate Pareto-improvement, stay tuned!
How to read this post: This is a short summary post, accompanying the much longer post with all the snippets. We recommend reading the summaries of each snippet below, and then zooming in to whichever snippets seem most interesting to you. They can be read in any order.
Summaries
Activation Steering with SAEs
We analyse the steering vectors used in
Turner et. al, 2023 using SAEs. We find that they are highly interpretable, and that in some cases we can get better performance by constructing interpretable steering vectors from SAE features, though in other cases we struggle to. We hope to better disentangle what's going on in future works.
Replacing SAE Encoders with Inference-Time Optimisation
There are two sub-problems in dictionary learning, learning the dictionary of feature vectors (an SAE's decoder, $W_{dec}$ and computing the sparse coefficient vector on a given input (an SAE's encoder). The SAE's encoder is a linear map followed by a ReLU, which is a weak function with a range of issues. We explore disentangling these problems by taking a trained SAE, throwing away the encoder, keeping the decoder, and learning the sparse coefficients at inference-time.
This lets us study the question of how well the SAE encoder is working while holding the quality of the dictionary constant, and better evaluate the quality of different dictionaries.
One notable finding is that high L0 SAEs have higher quality dictionaries than low L0 SAEs, even if we learn coefficients with low L0 at inference time.
Improving Ghost Grads
In their January update, the Anthropic team introduced a new auxiliary loss, "ghost grads", as a potential improvement on resampling for minimising the number of dead features in a SAE. We replicate their work, and find that it under-performs resampling. We present an improvement, multiplying the ghost grads loss by the proportion of dead features, which makes ghost grads competitive.
We don't yet see a compelli
- 5 min