464 episodes

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

# The Nonlinear Library: Alignment Forum The Nonlinear Fund

• Education

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org

AF - Probabilistic Payor Lemma? by Abram Demski

## AF - Probabilistic Payor Lemma? by Abram Demski

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Probabilistic Payor Lemma?, published by Abram Demski on March 19, 2023 on The AI Alignment Forum.
Epistemic status: too good to be true? Please check my math.
We've known for a while that Löb's theorem fails when proof is relaxed to probabilistic belief. This has pros and cons. On the pro side, it means there's no Löbian Obstacle to probabilistic self-trust. On the con side, it means that some Löb-derived insights for proof-based decision theory don't translate to probabilistic decision theory, at least not as directly as one might hope. In particular, it appeared to dash hopes for probabilistic generalizations of the "Löbian handshake" for cooperation.
Recently, Andrew Critch wrote about the Payor Lemma, which allows for a very similar "modal handshake" without Löb's Theorem. The lemma was proved using the same modal assumptions as Löb's, so on the surface it may appear to be just a different method to achieve similar results, whose main advantage is that it is much easier to prove (and therefore explain and understand) than Löb's Theorem.
But, a natural question arises: does Payor's Lemma have a suitable probabilistic version?
I'll give an affirmative proof; but I haven't confirmed that the assumptions are reasonable to my satisfaction.
Setup
Let L be a language in first-order logic, expressive enough to represent its sentences s∈L as quoted terms ┌s┐, eg, through Gödel numbering; and with a probability function symbol on these terms, p(┌s┐), which can be equated with (some representation of) rational numbers, e.g. p(┌⊤┐)=1, p(┌s┐)=12, etc. I also assume the system can reason about these rational numbers in the basic ways you'd expect.
For all a,b∈L and all r∈Q, we have:
If ⊢a, then ⊢p(┌a┐)=1.
If ⊢ab, then ⊢p(┌a┐)≤p(┌b┐).
(These assumptions might look pretty minimal, but they aren't going to be true for every theory of self-referential truth; more on this later.)
Let B(s) abbreviate the sentence p(┌s┐)>c for any s and some globally fixed constant c strictly between 0 and 1. This is our modal operator.
Some important properties of B:
Necessitation. If ⊢s, then ⊢B(s), for any s.
Proof: Since ⊢s implies ⊢p(s)=1, and c∈(0,1), we have ⊢p(┌s┐)>c,, which is to say, ⊢B(s). [End proof.]
Weak distrubitivity. If ⊢xy, then ⊢B(x)B(y).
Proof: When ⊢xy, we have ⊢p(y)≥p(x), so ⊢p(x)>cp(y)>c. [End proof.]
(Regular distributivity would say B(xy) implies B(x)B(y). The assumption ⊢xy is stronger than B(xy), so the above is a weaker form of distributivity.)
Theorem Statement
If ⊢B(B(x)x)x, then ⊢x.
Proof
⊢x(B(x)x), by tautology (a(ba)).
So ⊢B(x)B(B(x)x), from 1 by weak distributivity.
Suppose ⊢B(B(x)x)x.
⊢B(x)x from 2 and 3.
⊢B(B(x)x) from 4 by necessitation.
⊢x from 4 and 1.[End proof.]
Discussion
Comparison to Original Proof
The proof steps mirror Critch's treatment very closely. The key difference is step 2, IE, how I obtain a statement like ⊢□x□(□xx). Critch uses distributivity, which is not available to me:
B(ab)(B(a)B(b))?
Suppose B(ab), ie, p(┌ab┐)>c.
Rewrite p(┌b∨¬a┐)>c.
Now suppose B(a), that is, p(┌a┐)>c.
p(┌¬a┐)p(┌b∨¬a┐)−1+c>c−1+c.
p(┌b┐)>2c−1.
So we only get:
Bc(ab)(Bc(a)Bd(b)),
where Br(s) abbreviates p(┌s┐)>r and we have d=2c−1.
So in general, attempted applications of distributivity create weakened belief operators, which would get in the way of the proof (very similar to how probabilistic Löb fails).
However, the specific application we want happens to go through, due to a logical relationship between a and b; namely, that b is a weaker statement than a.
This reveals a way in which the assumptions for Payor's Lemma are importantly weaker than those required for Löb to go through.
So, the

• 6 min
AF - Shell games by Tsvi Benson-Tilsen

## AF - Shell games by Tsvi Benson-Tilsen

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shell games, published by Tsvi Benson-Tilsen on March 19, 2023 on The AI Alignment Forum.
[Metadata: crossposted from. First completed November 18, 2022.]
Shell game
Here's the classic shell game: Youtube
Screenshot from that video.
The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell.
(This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.)
Perpetual motion machines
Related: Perpetual motion beliefs
Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages:
Here's another version:
From this video.
Someone could try arguing that this really is a perpetual motion machine:
Q: How do the bars get lifted up? What does the work to lift them?
A: By the bars on the other side pulling down.
Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up?
A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel.
Q: How do the bars extend further on the way down?
A: Because the momentum of the wheel carries them into the vertical bar, flipping them over.
Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel.
A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position.
Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque.
A: They don't pivot, you fix them in place so they provide more torque.
Q: Ok, but then when do you push the weights back inward?
A: At the bottom.
Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work.
A: I meant, when the slider is at the bottom--when it's horizontal.
Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way.
A: At the bottom there's a guide ramp to lift the weights using normal force.
Q: But the guide ramp is also torquing the wheel.
And so on. The inventor can play hide the torque and hide the work.
Shell games in alignment
Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions:
What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time?
How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before?
How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements?
What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction?
Where and how much do human operators have to make judgements? How much are those judgements

• 6 min
AF - More information about the dangerous capability evaluations we did with GPT-4 and Claude. by Beth Barnes

## AF - More information about the dangerous capability evaluations we did with GPT-4 and Claude. by Beth Barnes

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More information about the dangerous capability evaluations we did with GPT-4 and Claude., published by Beth Barnes on March 19, 2023 on The AI Alignment Forum.
[Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.
We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.
As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably.
As AI systems improve, it is becoming increasingly difficult to rule out that models might be able to autonomously gain resources and evade human oversight – so rigorous evaluation is essential. It is important to have systematic, controlled testing of these capabilities in place before models pose an imminent risk, so that labs can have advance warning when they’re getting close and know to stop scaling up models further until they have robust safety and security guarantees.
This post will briefly lay out our motivation, methodology, an example task, and high-level conclusions. The information given here isn’t enough to give a full understanding of what we did or make our results replicable, and we won’t go into detail about results with specific models. We will publish more detail on our methods and results soon.
Motivation
Today’s AI systems can write convincing emails, give fairly useful instructions on how to carry out acts of terrorism, threaten users who have written negative things about them, and otherwise do things the world is not very ready for. Many people have tried using models to write and run code unsupervised, find vulnerabilities in code1, or carry out money-making schemes.
Today’s models also have some serious limitations to their abilities. But the companies that have released today’s AI models are investing heavily in building more powerful, more capable ones.
ARC is worried that future ML models may be able to autonomously act in the real world, doing things like “incorporate a company” or “exploit arbitrages in stock prices” or “design and synthesize DNA” without needing any human assistance or oversight. If models have the ability to act autonomously like this, this could pose major risks if they’re pursuing goals that are at odds with their human designers.

• 13 min
AF - "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) by David Scott Krueger

## AF - "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities) by David Scott Krueger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: "Publish or Perish" (a quick note on why you should try to make your work legible to existing academic communities), published by David Scott Krueger on March 18, 2023 on The AI Alignment Forum.
This is a brief, stylized recounting of a few conversations I had at some point last year with people from the non-academic AI safety community:
Me: you guys should write up your work properly and try to publish it in ML venues.
Them: well that seems like a lot of work and we don't need to do that because we can just talk to each other and all the people I want to talk to are already working with me.
Me: What about the people who you don't know who could contribute to this area and might even have valuable expertise? You could have way more leverage if you can reach those people. Also, there is increasing interest from the machine learning community in safety and alignment... because of progress in capabilities people are really starting to consider these topics and risks much more seriously.
Them: okay, fair point, but we don't know how to write ML papers.
Me: well, it seems like maybe you should learn or hire people to help you with that then, because it seems like a really big priority and you're leaving lots of value on the table.
Them: hmm, maybe... but the fact is, none of us have the time and energy and bandwidth and motivation to do that; we are all too busy with other things and nobody wants to.
Me: ah, I see! It's an incentive problem! So I guess your funding needs to be conditional on you producing legible outputs.
Me, reflecting afterwards: hmm... Cynically, not publishing is a really good way to create a moat around your research... People who want to work on that area have to come talk to you, and you can be a gatekeeper. And you don't have to worry about somebody with more skills and experience coming along and trashing your work or out-competing you and rendering it obsolete...EtA: In comments, people have described adhering to academic standards of presentation and rigor as "jumping through hoops". There is an element of that, but this really misses the value that these standards have to the academic community. This is a longer discussion, though...
There are sort of 3 AI safety communities in my account:1) people in academia2) people at industry labs who are building big models3) the rest (alignment forum/less wrong and EA being big components). I'm not sure where to classify new orgs like Conjecture and Redwood, but for the moment I put them here.
I'm referring to the last of these in this case.
I'm not accusing anyone of having bad motivations; I think it is almost always valuable to consider both people's concious motivations and their incentives (which may be subconscious drivers of their behavior).
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

• 2 min
AF - What organizations other than Conjecture have (esp. public) info-hazard policies? by David Scott Krueger

## AF - What organizations other than Conjecture have (esp. public) info-hazard policies? by David Scott Krueger

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What organizations other than Conjecture have (esp. public) info-hazard policies?, published by David Scott Krueger on March 16, 2023 on The AI Alignment Forum.
I believe Anthropic has said they won't publish capabilities research?OpenAI seems to be sort of doing the same (although no policy AFAIK).I heard FHI was developing one way back when...I think MIRI sort of does as well (default to not publishing, IIRC?)
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

• 47 sec
AF - [ASoT] Some thoughts on human abstractions by leogao

## AF - [ASoT] Some thoughts on human abstractions by leogao

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [ASoT] Some thoughts on human abstractions, published by leogao on March 16, 2023 on The AI Alignment Forum.
TL;DR:
Consider a human concept such as "tree." Humans implement some algorithm for determining whether given objects are trees. We expect our predictor/language model to develop a model of this algorithm because this is useful for predicting the behavior of humans.
This is not the same thing as some kind of platonic ideal concept of what is “actually” a tree, which the algorithm is not incentivized to develop by training on internet text, and trying to retarget the search at it has the same supervision problems as RLHF against human scores on whether things look like trees.
Pointing at this “actually a tree” concept inside the network is really hard; the ability of LMs to comprehend natural language does not allow one to point using natural language, because it just passes the buck.
Epistemic status: written fast instead of not at all, probably partially deeply confused and/or unoriginal. Thanks to Collin Burns, Nora Belrose, and Garett Baker for conversations.
Will NNs learn human abstractions?
As setup, let's consider an ELK predictor (the thing that predicts future camera frames). There are facts about the world that we don't understand that are in some way useful for predicting the future observations. This is why we can expect the predictor to learn facts that are superhuman (in that if you tried to supervised-train a model to predict those facts, you would be unable to generate the ground truth data yourself).
Now let's imagine the environment we're predicting consists of a human who can (to take a concrete example) look at things and try to determine if they're trees or not. This human implements some algorithm for taking various sensory inputs and outputting a tree/not tree classification. If the human does this a lot, it will probably become useful to have an abstraction that corresponds to the output of this algorithm. Crucially, this algorithm can be fooled by i.e a fake tree that the human can't distinguish from a real tree because (say) they don't understand biology well enough or something.
However, the human can also be said to, in some sense, be "trying" to point to the "actual" tree. Let's try to firm this down. The human has some process they endorse for refining their understanding of what is a tree / "doing science" in ELK parlance; for example, spending time studying from a biology textbook. We can think about the limit of this process. There are a few problems: it may not converge, or may converge to something that doesn't correspond to what is "actually" a tree, or may take a really really long time (due to irrationalities, or inherent limitations to human intelligence, etc). This suggests that this concept is not necessarily even well defined. But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements! Implementing the actual human algorithm directly lets you predict things like how humans will behave when they look at things that look like trees to them.
More generally, one possible superhuman AI configuration I can imagine is one where the bulk of the circuits are used to predict its best-guess for what will happen in the world. There may also be a set of circuits that operate in a more humanlike ontology used specifically for predicting humans, or it may be that the best-guess circuits are capable enough that this is not necessary (and if we scale up our reporter we eventually get a human simulator inside the reporter).
The optimistic case here is if the "actually a tree" abstraction happens to be a thing that is useful for (or is very easily mapped from) the weird alien ontology, possibly because some abstrac

• 8 min