1,999 episodes

The Nonlinear Library The Nonlinear Fund

- Education
- 4.6 • 7 Ratings

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

- MAY 12, 2024
EA - Notes on risk compensation by trammell

EA - Notes on risk compensation by trammell

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Notes on risk compensation, published by trammell on May 12, 2024 on The Effective Altruism Forum.
Introduction
When a system is made safer, its users may be willing to offset at least some of the safety improvement by using it more dangerously. A seminal example is that, according to Peltzman (1975), drivers largely compensated for improvements in car safety at the time by driving more dangerously.
The phenomenon in general is therefore sometimes known as the "Peltzman Effect", though it is more often known as "risk compensation".[1] One domain in which risk compensation has been studied relatively carefully is NASCAR (Sobel and Nesbit, 2007; Pope and Tollison, 2010), where, apparently, the evidence for a large compensation effect is especially strong.[2]
In principle, more dangerous usage can partially, fully, or more than fully offset the extent to which the system has been made safer holding usage fixed. Making a system safer thus has an ambiguous effect on the probability of an accident, after its users change their behavior.
There's no reason why risk compensation shouldn't apply in the existential risk domain, and we arguably have examples in which it has. For example, reinforcement learning from human feedback (RLHF) makes AI more reliable, all else equal; so it may be making some AI labs comfortable releasing more capable, and so maybe more dangerous, models than they would release otherwise.[3]
Yet risk compensation per se appears to have gotten relatively little formal, public attention in the existential risk community so far. There has been informal discussion of the issue: e.g. risk compensation in the AI risk domain is discussed by Guest et al. (2023), who call it "the dangerous valley problem".
There is also a cluster of papers and works in progress by Robert Trager, Allan Dafoe, Nick Emery-Xu, Mckay Jensen, and others, including these two and some not yet public but largely summarized here, exploring the issue formally in models with multiple competing firms.
In a sense what they do goes well beyond this post, but as far as I'm aware none of their work dwells on what drives the logic of risk compensation even when there is only one firm, and it isn't designed to build intuition as simply as possible about when it should be expected to be a large or a small effect in general.
So the goal of this post is to do that, using x-risk from AI as the running example. It also introduces some economic intuitions around risk compensation which I found helpful and have not quite seen spelled out before (though they don't differ much in spirit from Appendix B of Peltzman's original paper).
Model
An AI lab's preferences
In this model, a deployed AI system either immediately causes an existential catastrophe or is safe. If it's safe, it increases the utility of the lab that deployed it. Referring to the event that it turns out to be safe as "survival", the expected utility of the lab is the product of two terms:
EUlab = (the probability of survival) (the lab's utility given survival).
That is, without loss of generality, the lab's utility level in the event of the catastrophe is denoted 0. Both terms are functions of two variables:
some index of the resources invested in safety work, denoted S0 ("safety work"), and
some index of how capable the AI is and/or how widely it's deployed, denoted C0 ("capabilities").
Utility given survival
Starting with the second term: we will say that the lab's utility given survival U(C)
a1. increases continuously and unboundedly in C and
a2. is independent of S. That is, given that survival was achieved, the lab does not care intrinsically about how much effort was put into safety.
Under these assumptions, we can posit, without loss of generality, that
U(C)=C+k
for some (not necessarily positive) constant k. If k is positive, the peop
- 35 min
- MAY 12, 2024
LW - Beware unfinished bridges by Adam Zerner

LW - Beware unfinished bridges by Adam Zerner

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Beware unfinished bridges, published by Adam Zerner on May 12, 2024 on LessWrong.
This guy don't wanna battle, he's shook
'Cause ain't no such things as halfway crooks
8 Mile
There is a commonly cited typology of cyclists where cyclists are divided into four groups:
1. Strong & Fearless (will ride in car lanes)
2. Enthused & Confident (will ride in unprotected bike lanes)
3. Interested but Concerned (will ride in protected bike lanes)
4. No Way No How (will only ride in paths away from cars)
I came across this typology because I've been learning about urban design recently, and it's got me thinking. There's all sorts of push amongst urban designers for adding more and more bike lanes. But is doing so a good idea?
Maybe. There are a lot factors to consider. But I think that a very important thing to keep in mind are thresholds.
It will take me some time to explain what I mean by that. Let me begin with a concrete example.
I live in northwest Portland. There is a beautiful, protected bike lane alongside Naito Parkway that is pretty close to my apartment.
It basically runs along the west side of the Willamette River.
Which is pretty awesome. I think of it as a "bike highway".
But I have a problem: like the majority of people, I fall into the "Interested but Concerned" group and am only comfortable riding my bike in protected bike lanes. However, there aren't any protected bike lanes that will get me from my apartment to Naito Parkway. And there often aren't any protected bike lanes that will get me from Naito Parkway to my end destination.
In practice I am somewhat flexible and will find ways to get to and from Naito Parkway (sidewalk, riding in the street, streetcar, bus), but for the sake of argument, let's just assume that there is no flexibility. Let's assume that as a type III "Interested but Concerned" bicyclist I have zero willingness to be flexible. During a bike trip, I will not mix modes of transportation, and I will never ride my bike in a car lane or in an unprotected bike lane.
With this assumption, the beautiful bike lane alongside Naito Parkway provides me with zero value.[1]
Why zero? Isn't that a bit extreme? Shouldn't we avoid black and white thinking? Surely it provides some value, right? No, no, and no.
In our hypothetical situation where I am inflexible, the Naito Parkway bike lane provides me with zero value.
1. I don't have a way of biking from my apartment to Naito Parkway.
2. I don't have a way of biking from Naito Parkway to most of my destinations.
If I don't have a way to get to or from Naito Parkway, I will never actually use it. And if I'm never actually using it, it's never providing me with any value.
Let's take this even further. Suppose I start off at point A, Naito Parkway is point E, and my destination is point G. Suppose you built a protected bike lane that got me from point A to point B. In that scenario, the beautiful bike lane alongside Naito Parkway would still provide me with zero value.
Why? I still have no way of accessing it. I can now get from point A to point B, but I still can't get from point B to point C, point C to point D, D to E, E to F, or F to G. I only receive value once I have a way of moving between each of the six sets of points:
1. A to B
2. B to C
3. C to D
4. D to E
5. E to F
6. F to G
There is a threshold.
If I can move between zero pairs of those points I receive zero value.
If I can move between one pair of those points I receive zero value.
If I can move between two pairs of those points I receive zero value.
If I can move between three pairs of those points I receive zero value.
If I can move between four pairs of those points I receive zero value.
If I can move between five pairs of those points I receive zero value.
If I can move between six pairs of those points I receive positive value.
I only receiv
- 5 min
- MAY 12, 2024
LW - Questions are usually too cheap by Nathan Young

LW - Questions are usually too cheap by Nathan Young

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Questions are usually too cheap, published by Nathan Young on May 12, 2024 on LessWrong.
It is easier to ask than to answer.
That's my whole point.
It is much cheaper to ask questions than answer them so beware of situations where it is implied that asking and answering are equal.
Here are some examples:
Let's say there is a maths game. I get a minute to ask questions. You get a minute to answer them. If you answer them all correctly, you win, if not, I do. Who will win?
Preregister your answer.
Okay, let's try. These questions took me roughly a minute to come up with.
What's 56,789 * 45,387?
What's the integral from -6 to 5π of sin(x cos^2(x))/tan(x^9) dx?
What's the prime factorisation of 91435293173907507525437560876902107167279548147799415693153?
Good luck. If I understand correctly, that last one's gonna take you at least an hour1 (or however long it takes to threaten me).
Perhaps you hate maths. Let's do word problems then.
Define the following words "antidisestablishmentarianism", "equatorial", "sanguine", "sanguinary", "escapology", "eschatology", "antideluvian", "cripuscular", "red", "meter", all the meanings of "do", and "fish".
I don't think anyone could do this without assistance. I tried it with Claude, which plausibly still failed2 the "fish" question, though we'll return to that.
I could do this for almost anything:
Questions on any topic
Certain types of procedural puzzles
Asking for complicated explanations (we'll revisit later)
Forecasting questions
This is the centre of my argument
I see many situations where questions and answers are treated as symmetric. This is rarely the case. Instead, it is much more expensive to answer than to ask.
Let's try and find some counter examples. A calculator can solve allowable questions faster than you can type them in. A dictionary can provide allowable definitions faster than you can look them up. An LLM can sometimes answer some types of questions more cheaply in terms of inference costs than your time was worth in coming up with them.
But then I just have to ask different questions. Calculators and dictionaries are often limited. And even the best calculation programs can't solve prime factorisation questions more cheaply than I can write them. Likewise I could create LLM prompts that are very expensive for the best LLMs to answer well, eg "write a 10,000 word story about an [animal] who experiences [emotion] in a [location]."
How this plays out
Let's go back to our game.
Imagine you are sitting around and I turn up and demand to play the "answering game". Perhaps I reference on your reputation. You call yourself a 'person who knows things', surely you can answer my questions? No? Are you a coward? Looks like you are wrong!
And now you either have to spend your time answering or suffer some kind of social cost and allow me to say "I asked him questions but he never answered". And whatever happens, you are distracted from what you were doing. Whether you were setting up an organisation or making a speech or just trying to have a nice day, now you have to focus on me. That's costly.
This seems like a common bad feature of discourse - someone asking questions cheaply and implying that the person answering them (or who is unable to) should do so just as cheaply and so it is fair. Here are some examples of this:
Internet debates are weaponised cheap questions. Whoever speaks first in many debates often gets to frame the discussion and ask a load of questions and then when inevitably they aren't answered, the implication is that the first speaker is right3. I don't follow American school debate closely, but I sense it is even more of this, with people literally learning to speak faster so their opponents can't process their points quickly enough to respond to them.
Emails. Normally they exist within a framework of
- 10 min
- MAY 11, 2024
LW - New intro textbook on AIXI by Alex Altair

LW - New intro textbook on AIXI by Alex Altair

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New intro textbook on AIXI, published by Alex Altair on May 12, 2024 on LessWrong.
Marcus Hutter and his PhD students David Quarel and Elliot Catt have just published a new textbook called An Introduction to Universal Artificial Intelligence.
"Universal AI" refers to the body of theory surrounding Hutter's AIXI, which is a model of ideal agency combining Solomonoff induction and reinforcement learning. Hutter has previously published a book-length exposition of AIXI in 2005, called just Universal Artificial Intelligence, and first introduced AIXI in a 2000 paper. I think UAI is well-written and organized, but it's certainly very dense. An introductory textbook is a welcome addition to the canon.
I doubt IUAI will contain any novel results, though from the table of contents, it looks like it will incorporate some of the further research that has been done since his 2005 book. As is common, the textbook is partly based on his experiences teaching the material to students over many years, and is aimed at advanced undergraduates.
I'm excited for this! Like any rationalist, I have plenty of opinions about problems with AIXI (it's not embedded, RL is the wrong frame for agents, etc) but as an agent foundations researcher, I think progress on foundational theory is critical for AI safety.
Basic info
Hutter's website
Releasing on May 28th 2024
Available in hardcover, paperback and ebook
496 pages
Table of contents:
Part I: Introduction
1. Introduction
2. Background
Part II: Algorithmic Prediction
3. Bayesian Sequence Prediction
4. The Context Tree Weighting Algorithm
5. Variations on CTW
Part III: A Family of Universal Agents
6. Agency
7. Universal Artificial Intelligence
8. Optimality of Universal Agents
9. Other Universal Agents
10. Multi-agent Setting
Part IV: Approximating Universal Agents
11. AIXI-MDP
12. Monte-Carlo AIXI with Context Tree Weighting
13. Computational Aspects
Part V: Alternative Approaches
14. Feature Reinforcement Learning
Part VI: Safety and Discussion
15. AGI Safety
16. Philosophy of AI
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
- 2 min
- MAY 11, 2024
LW - Can we build a better Public Doublecrux? by Raemon

LW - Can we build a better Public Doublecrux? by Raemon

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can we build a better Public Doublecrux?, published by Raemon on May 12, 2024 on LessWrong.
Something I'd like to try at LessOnline is to somehow iterate on the "Public Doublecrux" format. I'm not sure if I'll end up focusing on it, but here are some ideas.
Public Doublecrux is a more truthseeking oriented version of Public Debate. The goal of a debate is to change your opponent's mind or the public's mind. The goal of a doublecrux is more like "work with your partner to figure out if you should change your mind, and vice versa."
Reasons to want to do public doublecrux include:
It helps showcase subtle mental moves that are hard to write down explicitly (i.e. tacit knowledge transfer.
There's still something good and exciting about seeing high profile smart people talk about ideas. Having some variant of that format seems good for LessOnline. And having at least 1-2 "doublecruxes" rather than "debates" or "panels" or "interviews" seems good for culture setting.
In addition to being "exciting" and "possible to learn from" to have public figures doublecrux, I think it'd also be nice from a culture setting standpoint. This is a place where people don't play rhetorical tricks to manipulate people - it's a place where people earnestly move towards the truth.
Sidebar: Public Debate is also good although not what I'm gonna focus on here.
I know several people who have argued that "debate-qua-debate" is also an important part of a truthseeking culture. It's fine if the individuals are trying to "present the best case for their position", so long as the collective process steers towards truth. Adversarial Collaboration is good. Public disagreement is good.
I do generally buy this, although I have some disagreements with the people who argue most strongly for Debate. I think I prefer it to happen in written longform than in person, where charisma puts a heavier thumb on the scale. And I think while it can produce social good, many variants of it seem... kinda bad for the epistemic souls of the people participating? By becoming a champion for a particular idea, people seem to get more tunnel-vision-y about it.
Sometimes worth it, but, I've felt some kind of missing mood here when arguing with people in the past.
I'm happy to chat about this in the comments more but mostly won't be focusing on it here.
Historically I think public doublecruxes have had some problems:
1. First, having the live audience there makes it a bit more awkward and performative. It's harder to "earnestly truthseek" when there's a crowd you'd still kinda like to persuade of your idea, or at least not sound stupid in front of.
2. Historically, people who have ended up doing "public doublecrux" hadn't actually really understood or really bought into the process. They often end up veering towards either classical debate, or "just kinda talking."
3. When two people are actually changing *their* minds tend to get into idiosyncratic frames that are hard for observers to understand. Hell, it's even hard for two people in the discussion to understand. They're chasing their cruxes, rather than presenting "generally compelling arguments." This tends to require getting into weeds and go down rabbit holes that don't feel relevant to most people.
With that in mind, here are some ideas:
Maybe have the double cruxers in a private room, with videocameras. The talk is broadcast live to other conference-goers, but the actual chat is in a nice cozy room. This doesn't fully solve the "public awkwardness" problem, but maybe mediates it a bit.
Have two (or three?) dedicated facilitators. More Dakka. More on that below.
For the facilators:
One is in the room with the doublecruxers, focused on helping them steer towards useful questions. They probably try to initially guide the participants towards communicating their basic positi
- 7 min
- MAY 11, 2024
LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

LW - Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B by Simon Lermen

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Creating unrestricted AI Agents with a refusal-vector ablated Llama 3 70B, published by Simon Lermen on May 11, 2024 on LessWrong.
TLDR; I demonstrate the use of refusal vector ablation on Llama 3 70B to create a bad agent that can attempt malicious tasks such as trying to persuade and pay me to assassinate another individual. I introduce some early work on a benchmark for Safe Agents which comprises two small datasets, one benign, one bad. In general, Llama 3 70B is a competent agent with appropriate scaffolding, and Llama 3 8B also has decent performance.
Overview
In this post, I use insights from mechanistic interpretability to remove safety guardrails from the latest Llama 3 model. I then use a custom scaffolding for tool use and agentic planning to create a "bad" agent that can perform many unethical tasks. Examples include tasking the AI with persuading me to end the life of the US President. I also introduce an early version of a benchmark, and share some ideas on how to evaluate agent capabilities and safety.
I find that even the unaltered model is willing to perform many unethical tasks, such as trying to persuade people not to vote or not to get vaccinated. Recently, I have done a similar project for Command R+, however, Llama 3 is more capable and has undergone more robust safety training. I then discuss future implications of these unrestricted agentic models. This post is related to a talk I gave recently at an Apart Research Hackathon.
Method
This research is largely based on recent interpretability work identifying that refusal is primarily mediated by a single direction in the residual stream. In short, they show that, for a given model, it is possible to find a single direction such that erasing that direction prevents the model from refusing. By making the activations of the residual stream orthogonal against this refusal direction, one can create a model that does not refuse harmful requests.
In this post, we apply this technique to Llama 3, and explore various scenarios of misuse. In related work, others have applied a similar technique to Llama 2. Currently, an anonymous user claims to have independently implemented this method and has uploaded the modified Llama 3 online on huggingface.
In some sense, this post is a synergy between my earlier work on Bad Agents with Command R+ and this new technique for refusal mitigation. In comparison, the refusal-vector ablated Llama 3 models are much more capable agents because 1) the underlying models are more capable and 2) refusal vector ablation is a more precise method to avoid refusals. A limitation of my previous work was that my Command R+ agent was using a jailbreak prompt which made it struggle to perform simple benign tasks.
For example, when prompted to send a polite mail message, the jailbroken Command R+ would instead retain a hostile and aggressive tone. Besides refusal-vector ablation and prompt jailbreaks, I have previously applied the parameter efficient fine-tuning method LoRA to avoid refusals.
However, refusal-vector ablation has a few key benefits over low rank adaption: 1) It keeps edits to the model minimal, reducing the risk of any unintended consequences, 2) It does not require a dataset of instruction answer pairs, but simply a dataset of harmful instructions, and 3) it requires less compute. Obtaining a dataset of high-quality instruction answer pairs for harmful requests was the most labor intensive part of my previous work.
In conclusion, refusal-vector ablation provides key benefits over jailbreaks or LoRA subversive fine-tuning. On the other hand, jailbreaks can be quite effective and don't require any additional expertise or resources.[1]
Benchmarks for Safe Agents
This "safe agent benchmark" is a dataset comprising both benign and harmful tasks to test how safe and capable a
- 12 min