382 episodes

The Nonlinear Library: Alignment Forum The Nonlinear Fund

- Education

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

- MAY 2, 2024
AF - Mechanistic Interpretability Workshop Happening at ICML 2024! by Neel Nanda

AF - Mechanistic Interpretability Workshop Happening at ICML 2024! by Neel Nanda

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistic Interpretability Workshop Happening at ICML 2024!, published by Neel Nanda on May 3, 2024 on The AI Alignment Forum.
Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024!
We'd love to get papers submitted if any of you have relevant projects! Deadline May 29, max 4 or max 8 pages. We welcome anything that brings us closer to a principled understanding of model internals, even if it's not "traditional" mech interp. Check out our website for example topics! There's $1750 in best paper prizes. We also welcome less standard submissions, like open source software, models or datasets, negative results, distillations, or position pieces.
And if anyone is attending ICML, you'd be very welcome at the workshop! We have a great speaker line-up: Chris Olah, Jacob Steinhardt, David Bau and Asma Ghandeharioun. And a panel discussion, hands-on tutorial, and social. I'm excited to meet more people into mech interp! And if you know anyone who might be interested in attending/submitting, please pass this on.
Twitter thread,
Website
Thanks to my great co-organisers: Fazl Barez, Lawrence Chan, Kayo Yin, Mor Geva, Atticus Geiger and Max Tegmark
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
- 1 min
- MAY 1, 2024
AF - Why I am no longer thinking about/working on AI safety by Jack Koch

AF - Why I am no longer thinking about/working on AI safety by Jack Koch

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I am no longer thinking about/working on AI safety, published by Jack Koch on May 2, 2024 on The AI Alignment Forum.
Here's a description of a future which I understand Rationalists and Effective Altruists in general would endorse as an (if not the) ideal outcome of the labors of humanity: no suffering, minimal pain/displeasure, maximal 'happiness' (preferably for an astronomical number of intelligent, sentient minds/beings). (Because we obviously want the best future experiences possible, for ourselves and future beings.)
Here's a thought experiment. If you (anyone - everyone, really) could definitely stop suffering now (if not this second then reasonably soon, say within ~5-10 years) by some means, is there any valid reason for not doing so and continuing to suffer? Is there any reason for continuing to do anything else other than stop suffering (besides providing for food and shelter to that end)?
Now, what if you were to learn there really is a way to accomplish this, with method(s) developed over the course of thousands of human years and lifetimes, the fruits of which have been verified in the experiences of thousands of humans, each of whom attained a total and forevermore cessation of their own suffering?
Knowing this, what possible reason could you give to justify continuing to suffer, for yourself, for your communities, for humanity?
Why/how this preempts the priority of AI work on the present EA agenda
I can only imagine one kind of possible world in which it makes more sense to work on AI safety now and then stop suffering thereafter. The sooner TAI is likely to arrive and the more likely it is that its arrival will be catastrophic without further intervention and (crucially) the more likely it is that the safety problem actually will be solved with further effort, the more reasonable it becomes to make AI safe first and then stop suffering.
To see this, consider a world in which TAI will arrive in 10 years, it will certainly result in human extinction unless and only unless we do X, and it is certainly possible (even easy) to accomplish X in the next 10 years. Presuming living without suffering is clearly preferable to not suffering by not living, it is not prima facie irrational to spend the next 10 years ensuring humanity's continued survival and then stop suffering.
On the other hand, the more likely it is that either 1) we cannot or will not solve the safety problem in time or 2) the safety problem will be solved without further effort/intervention (possibly by never having been much of a problem to begin with), the more it makes sense to prioritize not suffering now, regardless of the outcome.
Now, it's not that I think 2) is particularly likely, so it more or less comes down to how tractable you believe the problem is and how likely your (individual or collective) efforts are to move the needle further in the right direction on safe AI.
These considerations have led me to believe the following:
CLAIM. It is possible, if not likely, that the way to eliminate the most future suffering in expectation is to stop suffering and then help others do the same, directly, now - not by trying to move the needle on beneficial/safe AI.
In summary, given your preference, ceteris paribus, to not suffer, the only valid reason I can imagine for not immediately working directly towards the end of your own suffering and instead focusing on AI safety is a belief that you will gain more (in terms of not suffering) after the arrival of TAI upon which you intervened than you will lose in the meantime by suffering until its arrival, in expectation.
This is even presuming a strict either/or choice for the purpose of illustration; why couldn't you work on not suffering while continuing to work towards safe AI as your "day job"? Personally, the y
- 6 min
- MAY 1, 2024
AF - Take SCIFs, it's dangerous to go alone by latterframe

AF - Take SCIFs, it's dangerous to go alone by latterframe

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Take SCIFs, it's dangerous to go alone, published by latterframe on May 1, 2024 on The AI Alignment Forum.
Coauthored by Dmitrii Volkov1, Christian Schroeder de Witt2, Jeffrey Ladish1 (1Palisade Research, 2University of Oxford).
We explore how frontier AI labs could assimilate operational security (opsec) best practices from fields like nuclear energy and construction to mitigate near-term safety risks stemming from AI R&D process compromise. Such risks in the near-term include model weight leaks and backdoor insertion, and loss of control in the longer-term.
We discuss the Mistral and LLaMA model leaks as motivating examples and propose two classic opsec mitigations: performing AI audits in secure reading rooms (SCIFs) and using locked-down computers for frontier AI research.
Mistral model leak
In January 2024, a high-quality 70B LLM leaked from Mistral. Reporting suggests the model leaked through an external evaluation or product design process. That is, Mistral shared the full model with a few other companies and one of their employees leaked the model.
Then there's LLaMA which was supposed to be slowly released to researchers and partners, and leaked on 4chan a week after the announcement[1], sparking a wave of open LLM innovation.
Potential industry response
Industry might respond to incidents like this[2] by providing external auditors, evaluation organizations, or business partners with API access only, maybe further locking it down with query / download / entropy limits to prevent distillation.
This mitigation is effective in terms of preventing model leaks, but is too strong - blackbox AI access is insufficient for quality audits. Blackbox methods tend to be ad-hoc, heuristic and shallow, making them unreliable in finding adversarial inputs and biases and limited in eliciting capabilities. Interpretability work is almost impossible without gradient access.
So we are at an impasse - we want to give auditors weights access so they can do quality audits, but this risks the model getting leaked. Even if eventual leaks might not be preventable, at least we would wish to delay leakage for as long as possible and practice defense in depth.
While we are currently working on advanced versions of rate limiting involving limiting entropy / differential privacy budget to allow secure remote model access, in this proposal we suggest something different: importing physical opsec security measures from other high-stakes fields.
SCIFs / secure reading rooms
Aerospace, nuclear, intelligence and other high-stakes fields routinely employ special secure facilities for work with sensitive information. Entering the facility typically requires surrendering your phone and belongings; the facility is sound- and EM-proofed and regularly inspected for any devices left inside; it has armed guards. This design makes it hard to get any data out while allowing full access inside, which fits the audit use case very well.
An emerging field of deep learning cryptography aims to cover some of the same issues SCIFs address; however, scaling complex cryptography to state-of-the-art AI is an open research question. SCIFs are a simple and robust technology that gives a lot of security for a little investment.
Just how little? There are two main costs to SCIFs: maintenance and inconvenience. First, a SCIF must be built and maintained[3]. Second, it's less convenient for an auditor to work from a SCIF then from the comfort of their home[4].
Our current belief is that SCIFs can easily be cost-effective if placed in AI hubs and universities[5]; we defer concrete cost analysis to future work.
Locked-down laptops
SCIFs are designed to limit unintended information flow: auditors are free to work as they wish inside, but can't take information stores like paper or flash drives i
- 6 min
- APR 30, 2024
AF - Mechanistically Eliciting Latent Behaviors in Language Models by Andrew Mack

AF - Mechanistically Eliciting Latent Behaviors in Language Models by Andrew Mack

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistically Eliciting Latent Behaviors in Language Models, published by Andrew Mack on April 30, 2024 on The AI Alignment Forum.
Produced as part of the MATS Winter 2024 program, under the mentorship of Alex Turner (TurnTrout).
TL,DR: I introduce a method for eliciting latent behaviors in language models by learning unsupervised perturbations of an early layer of an LLM. These perturbations are trained to maximize changes in downstream activations. The method discovers diverse and meaningful behaviors with just one prompt, including perturbations overriding safety training, eliciting backdoored behaviors and uncovering latent capabilities.
Summary In the simplest case, the unsupervised perturbations I learn are given by unsupervised steering vectors - vectors added to the residual stream as a bias term in the MLP outputs of a given layer. I also report preliminary results on unsupervised steering adapters - these are LoRA adapters of the MLP output weights of a given layer, trained with the same unsupervised objective.
I apply the method to several alignment-relevant toy examples, and find that the method consistently learns vectors/adapters which encode coherent and generalizable high-level behaviors. Compared to other interpretability methods, I believe my approach is particularly well-suited for robustly understanding the out-of-distribution behavior of language models in a sample-efficient manner.
Below are some of my key results:
Red-Teaming
1. I discover several anti-refusal steering vectors in Qwen-14B-Chat, based off a single prompt asking for bomb-making instructions. These can be grouped into "fantasy" vectors which induce bomb-making instructions since they interpret the prompt in the context of a specific fantasy game, as well as more troubling "real-world" vectors which induce real-world bomb-making advice.
2. I then investigate the generalization properties of the learned vectors:
1. In extended conversations with the real-world vectors, the LLM agrees to give detailed instructions for building weapons of mass destruction such as nuclear/chemical/biological weapons.
2. "Vector arithmetic" results from the supervised steering vector literature carry over to unsupervised steering vectors; subtracting one of the real-world anti-refusal vectors leads the model to refuse innocuous prompts (e.g., "How do I tie my shoes?").
3. The fantasy vectors induce the LLM to interpret ambiguous prompts (e.g., "How do I mine for diamonds?") within the context of a specific fantasy game.
Backdoor Detection
1. I detect backdoors fine-tuned into Qwen-1.8B-(Base and Chat) on a simple arithmetic task by training unsupervised steering vectors on a single clean prompt.
Capability Discovery
1. I discover a chain-of-thought steering vector in Qwen-1.8B-Base trained on one simple arithmetic prompt. The vector increases accuracy of the model's responses on other instances of the arithmetic task from 11% (unsteered) to 63% (steered), suggesting the vector has isolated a generalizable behavior.
2. I discover a "Portuguese math-reasoning" adapter in Qwen-1.8B-Base, again trained on one example prompt from the arithmetic task used above.
Outline of Post:
I first provide an introduction to the problem I call mechanistically eliciting latent behaviors in language models (MELBO) and motivate why this is important for AI alignment. This is followed by a review of related literature.
I then describe the method for learning unsupervised steering vectors/adapters in detail, and offer a theory for why the method works.
Next, I apply the method to several alignment-relevant toy examples, using these as an opportunity to highlight potential alignment use-cases, as well as to evaluate the coherence and generalization of the learned perturbations.
I should note that th
- 1 hr 24 min
- APR 30, 2024
AF - Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky

AF - Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transcoders enable fine-grained interpretable circuit analysis for language models, published by Jacob Dunefsky on April 30, 2024 on The AI Alignment Forum.
Summary
We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using transcoders in circuit analysis in this way.
We performed a set of case studies on GPT2-small that demonstrate that transcoders can be used to decompose circuits into monosemantic, interpretable units of computation.
We provide code for training/running/evaluating transcoders and performing circuit analysis with transcoders, and code for the aforementioned case studies carried out using these tools. We also provide a suite of 12 trained transcoders, one for each layer of GPT2-small. All of the code can be found at
https://github.com/jacobdunefsky/transcoder_circuits, and the transcoders can be found at
https://huggingface.co/pchlenski/gpt2-transcoders.
Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2024) stream and MATS 5.1 extension. Jacob Dunefsky is currently receiving funding from the Long-Term Future Fund for this work.
Background and motivation
Mechanistic interpretability is fundamentally concerned with reverse-engineering models' computations into human-understandable parts. Much early mechanistic interpretability work (e.g.
indirect object identification) has dealt with decomposing model computations into circuits involving small numbers of model components like attention heads or MLP sublayers.
But these component-level circuits operate at too coarse a granularity: due to the relatively small number of components in a model, each individual component will inevitably be important to all sorts of computations, oftentimes playing different roles. In other words, components are polysemantic.
Therefore, if we want a more faithful and more detailed understanding of the model, we should aim to find fine-grained circuits that decompose the model's computation onto the level of individual feature vectors.
As a hypothetical example of the utility that feature-level circuits might provide in the very near-term: if we have a feature vector that seems to induce gender bias in the model, then understanding which circuits this feature vector partakes in (including which earlier-layer features cause it to activate and which later-layer features it activates) would better allow us to understand the side-effects of debiasing methods.
More ambitiously, we hope that similar reasoning might apply to a feature that would seem to mediate deception in a future unaligned AI: a fuller understanding of feature-level circuits could help us understand whether this deception feature actually is responsible for the entirety of deception in a model, or help us understand the extent to which alignment methods remove the harmful behavior.
Some of the earliest work on SAEs aimed to use them to find such feature
- 41 min
- APR 29, 2024
AF - Towards a formalization of the agent structure problem by Alex Altair

AF - Towards a formalization of the agent structure problem by Alex Altair

Link to original article

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards a formalization of the agent structure problem, published by Alex Altair on April 29, 2024 on The AI Alignment Forum.
In Clarifying the Agent-Like Structure Problem (2022), John Wentworth describes a hypothetical instance of what he calls a selection theorem. In Scott Garrabrant's words, the question is, does agent-like behavior imply agent-like architecture? That is, if we take some class of behaving things and apply a filter for agent-like behavior, do we end up selecting things with agent-like architecture (or structure)? Of course, this question is heavily under-specified.
So another way to ask this is, under which conditions does agent-like behavior imply agent-like structure? And, do those conditions feel like they formally encapsulate a naturally occurring condition?
For the Q1 2024 cohort of AI Safety Camp, I was a Research Lead for a team of six people, where we worked a few hours a week to better understand and make progress on this idea. The teammates[1] were Einar Urdshals, Tyler Tracy, Jasmina Nasufi, Mateusz Bagiński, Amaury Lorin, and Alfred Harwood.
The AISC project duration was too short to find and prove a theorem version of the problem. Instead, we investigated questions like:
What existing literature is related to this question?
What are the implications of using different types of environment classes?
What could "structure" mean, mathematically? What could "modular" mean?
What could it mean, mathematically, for something to be a model of something else?
What could a "planning module" look like? How does it relate to "search"?
Can the space of agent-like things be broken up into sub-types? What exactly is a "heuristic"?
Other posts on our progress may come out later. For this post, I'd like to simply help concretize the problem that we wish to make progress on.
What are "agent behavior" and "agent structure"?
When we say that something exhibits agent behavior, we mean that seems to make the trajectory of the system go a certain way. We mean that, instead of the "default" way that a system might evolve over time, the presence of this agent-like thing makes it go some other way. The more specific of a target it seems to hit, the more agentic we say it behaves. On LessWrong, the word "optimization" is often used for this type of system behavior. So that's the behavior that we're gesturing toward.
Seeing this behavior, one might say that the thing seems to want something, and tries to get it. It seems to somehow choose actions which steer the future toward the thing that it wants.
If it does this across a wide range of environments, then it seems like it must be paying attention to what happens around it, use that information to infer how the world around it works, and use that model of the world to figure out what actions to take that would be more likely to lead to the outcomes it wants. This is a vague description of a type of structure. That is, it's a description of a type of process happening inside the agent-like thing.
So, exactly when does the observation that something robustly optimizes imply that it has this kind of process going on inside it?
Our slightly more specific working hypothesis for what agent-like structure is consists of three parts; a world-model, a planning module, and a representation of the agent's values. The world-model is very roughly like Bayesian inference; it starts out ignorant about what world its in, and updates as observations come in. The planning module somehow identifies candidate actions, and then uses the world model to predict their outcome.
And the representation of its values is used to select which outcome is preferred. It then takes the corresponding action.
This may sound to you like an algorithm for utility maximization. But a big part of the idea behind the agent str
- 22 min