
2,000 episodes

The Nonlinear Library The Nonlinear Fund
-
- Education
-
-
4.6 • 7 Ratings
-
The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and other EA blogs. To find out more, please visit us at nonlinear.org
-
AF - Descriptive vs. specifiable values by Tsvi Benson-Tilsen
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Descriptive vs. specifiable values, published by Tsvi Benson-Tilsen on March 26, 2023 on The AI Alignment Forum.
[Metadata: crossposted from. First completed November 19, 2022.]
What are an agent's values? An answer to this question might be a good description of the agent's external behavior and internal workings, without showing how one could modify the agent's workings or origins so that the agent pushes the world in a specific different direction.
Descriptive values
There's some discussion of what can be inferred about the values of an agent based on its behavior and structure. E.g. see Daniel Dennett's intentional stance, and "Occam's razor is insufficient to infer the preferences of irrational agents" by Stuart Armstrong, Sören Mindermann (arxiv), and this post by Vanessa Kosoy.
One could describe an agent as having certain values: the agent's behavior is a boundedly rational attempt to push the world in certain directions. For some purposes, it's useful to have a parsimonious description of an agent's behavior or internal workings in terms of values. For example, such a description could be useful for helping the agent out: to help the agent out, you push the world in the same direction that the agent is trying to push the world.
Specifiable values
A distinct purpose in describing an agent as having values is to answer questions about values in counterfactuals:
What determined that the agent would have those values and not other values?
Under what circumstances will the agent continue to have those values? E.g., will the agent rewrite itself so that its behavior is no longer well-described as boundedly pursuing those values?
How could the agent's values be modified? How could the values be modified in a specific direction, or to a specific state, so that that the modified agent has some specific effect on the world?
How could the agent's ontogeny--the process that made it what it is--be altered so that it ends up with some other specific values?
To make these questions more likely to have answers, and to not rely too much on assumptions about what values are, replace the notion of "values" with the notion "what directions a mind ends up pushing the world in".
Quasi-example: explicit utility maximization
An auxiliary question: how, mechanistically, do "the values" determine the behavior? This question might not have an answer, because there might not be some component in the agent that constitutes "the values". For example, in humans, there's no clear value component; there are many in-built behavior-determiners, but they don't fully constitute what we call our values. But, in cases where we clearly understand the mechanism by which an agent's values determine its behavior, answers to other questions about values in counterfactuals might follow.
For example, there's the classic agent model: a system that searches for actions that it predicts will lead in expectation to the most highly-scored world according to its utility function box. The mechanism is explicit in this model. The utility function is embodied, in a box, as an input-output function, and it determines the agent's effects on the world by providing the criterion that the agent uses to select actions. Some answers to the above questions follow. E.g., it's clear at least qualitatively how to modify the agent's values to a specific state: if you want to make the agent cause a certain kind of world, just change the utility function to score that kind of world highly.
Even this example is not so clear cut, and relies on background assumptions. See problems with embedded agency. For example, if we assume that there's already a fixed world (that is, an understanding of what's possible) about which to define the utility function, we sweep under the rug that the understanding behind having such a world had -
LW - Sam Altman on GPT-4, ChatGPT, and the Future of AI | Lex Fridman Podcast #367 by Gabriel Mukobi
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sam Altman on GPT-4, ChatGPT, and the Future of AI | Lex Fridman Podcast #367, published by Gabriel Mukobi on March 25, 2023 on LessWrong.
Lex Fridman just released a podcast episode with Sam Altman, CEO of OpenAI. In my opinion, there wasn't too much new here that hasn't been said in other recent interviews. However, here are some scattered notes on parts I found interesting from an AI safety lens:
AI risk
Lex asks Sama to steelman Eliezer Yudkowsky's views
Sama said there's some chance of no hope, but the only way he knows how to fix things is to keep iterating and eliminating the "1-shot-to-get-it-right" cases. He does like one of Eliezer's posts that discusses his reasons why he thinks alignment is hard [I believe this is in reference to AGI Ruin: A List of Lethalities].
Lex confirms he will do an interview with Eliezer.
Sama: Now is the time to ramp up technical alignment work.
Lex: What about fast takeoffs?
Sama: I'm not that surprised by GPT-4, was a little surprised by ChatGPT [I think this means this feels slow to him]. I'm in the long-takeoffs, short-timelines quadrant. I'm scared of the short-takeoff scenarios.
Sama has heard of but not seem Ex Machina
On power
Sama says it's weird that it will be OOM thousands of people in control of the first AGI .
Acknowledges the AIS people think OAI deploying things fast is bad.
Sama asks how Lex thinks they're doing.
Lex likes the transparency and openly sharing the issues.
Sama: Should we open source GPT-4?
Lex: Knowing people at OAI, no (bc he trusts them,)
Sama: I think people at OAI know the stakes of what we're building. But we're always looking for feedback from smart people.
Lex: How do you take feedback?
Sama: Twitter is unreadable. Mostly from convos like this.
On responsibility
Sama: We will have very significant but new and different challenges [with governing/deciding how to steer AI]
Lex: Is it up to GPT or the humans to decrease the amount of hate in the world.
Sama: I think we as OAI have responsibility for the tools we put out in the world, I think the tools can't have responsibility.
Lex: So there could be harm caused by these tools
Sama: There will be harm caused by these tools. There will be tremendous benefits. But tools do wonderful good and real bad. And we will minimize the bad and maximize the good.
Jailbreaking
Lex: How do you prevent jailbreaking?
Sama: It kinda sucks being on the side of the company being jailbroken. We want the users to have a lot of control and have the models behave how they want within broad bounds. The existence of jailbreaking shows we haven't solved that problem yet, and the more we solve it, the less need there will be for jailbreaking. People don't really jailbreak iPhones anymore.
Shipping products
Lex: shows this tweet summarizing all the OAI products in the last year
Sama: There's a question of should we be very proud of that or should other companies be very embarrassed. We have a high bar on our team, we work hard, we give a huge amount of trust, autonomy, and authority to individual people, and we try to hold each other to very high standards. These other things enable us to ship at such a high velocity.
Lex: How do you go about hiring?
Sama: I spend 1/3 of my time hiring, and I approve every OAI hire. There are no shortcuts to good hiring.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. -
LW - Manifold: If okay AGI, why? by Eliezer Yudkowsky
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Manifold: If okay AGI, why?, published by Eliezer Yudkowsky on March 25, 2023 on LessWrong.
Arguably the most important topic about which a prediction market has yet been run: Conditional on an okay outcome with AGI, how did that happen?
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org. -
LW - A stylized dialogue on John Wentworth's claims about markets and optimization by So8res
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A stylized dialogue on John Wentworth's claims about markets and optimization, published by So8res on March 25, 2023 on LessWrong.
(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me.)
J: It seems to me that the field of alignment doesn't understand the most basic theory of agents, and is missing obvious insights when it comes to modeling the sorts of systems they purport to study.
N: Do tell. (I'm personally sympathetic to claims of the form "none of you idiots have any idea wtf you're doing", and am quite open to the hypothesis that I've been an idiot in this regard.)
J: Consider the coherence theorems that say that if you can't pump resources out of a system, then it's acting agent-like.
N: I'd qualify "agent-like with respect to you", if I used the word 'agent' at all (which I mostly wouldn't), and would caveat that there are a few additional subtleties, but sure.
J: Some of those subtleties are important! In particular: there's a gap between systems that you can't pump resources out of, and systems that have a utility function. The bridge across that gap is an additional assumption that the system won't pass up certain gains (in a specific sense).
Roughly: if you won't accept 1 pepper for 1 mushroom, then you should accept 2 mushrooms for 1 pepper, because a system that accepts both of those trades winds up with strictly more resources than a system that rejects both (by 1 mushroom), and you should be able to do at least that well.
N: I agree.
J: But some of the epistemically efficient systems around us violate this property.
For instance, consider a market for (at least) two goods: peppers and mushrooms; with (at least) two participants: Alice and Bob. Suppose Alice's utility is UA(p,m):=log10(p)+log100(m) (where p and m are the quantities of peppers and mushrooms owned by Alice, respectively), and Bob's utility is UB(p,m):=log100(p)+log10(m) (where p and m are the quantities of peppers and mushrooms owned by Bob, respectively).
Example equilibrium: the price is 3 peppers for 1 mushroom. Alice doesn't trade at this price when she has 3log′10(p)=1log′100(m), i.e. 3ln(10)/p=1ln(100)/m, i.e. 3/p=2/m (using the fact that ln(100)=ln(102)=2ln(10)), i.e. when Alice has 1.5 times as many peppers as she has mushrooms. Bob doesn't trade at this price when he has 6 times as many peppers as mushrooms, by a similar argument. So these prices can be an equilibrium whenever Alice has 1.5x as many peppers as mushrooms, and Bob has 6x as many peppers as mushrooms (regardless of the absolute quantities).
Now consider offering the market a trade of 25,000 peppers for 10,000 mushrooms. If Alice has 20,000 mushrooms (and thus 30,000 peppers), and Bob has only 1 mushroom (and thus 6 peppers), then the trade is essentially up to Alice. She'd observe that
so she (and thus, the market as a whole) would accept. But if Bob had 20,000 mushrooms (and thus 120,000 peppers), and Alice had only 2 mushrooms (and thus 3 peppers), then the trade is essentially up to Bob. He'd observe
so he wouldn't take the trade.
Thus, we can see that whether a market — considered altogether — takes a trade, depends not only on the prices in the market (which you might have thought of as a sort of epistemic state, and that you might have noted was epistemically efficient with respect to you), but also on the hidden internal state of the market.
N: Sure. The argument was never "every epistemically efficient (wrt you) system is an optimizer", but rather "sufficiently good optimiz -
EA - Predicting what future people value: A terse introduction to Axiological Futurism by Jim Buhler
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predicting what future people value: A terse introduction to Axiological Futurism, published by Jim Buhler on March 24, 2023 on The Effective Altruism Forum.
Why this is worth researching
Humanity might develop artificial general intelligence (AGI), colonize space, and create astronomical amounts of things in the future (Bostrom 2003; MacAskill 2022; Althaus and Gloor 2016). But what things? How (dis)valuable? And how does this compare with things grabby aliens would eventually create if they colonize our corner of the universe? What does this imply for our work aimed at impacting the long-term future?
While this depends on many factors, a crucial one will likely be the values of our successors.
Here’s a position that might tempt us while considering whether it is worth researching this topic:
Our descendants are unlikely to have values that are both different from ours in a very significant way and predictable. Either they have values similar to ours or they have values we can’t predict. Therefore, trying to predict their values is a waste of time and resources.
While I see how this can seem compelling, I think this is very ill-informed.
First, predicting the values of our successors – what John Danaher (2021) calls axiological futurism – in worlds where these are meaningfully different from ours doesn’t seem intractable at all. Significant progress has already been made in this research area and there seems to be room for much more (see the next section and the Appendix).
Second, a scenario where the values of our descendants don’t significantly differ from ours appears quite unlikely to me. We should watch for things like the End of History illusion, here. Values seem to notably evolve through History, and there is no reason to assume we are special enough to make us drop that prior.
Besides being tractable, I believe axiological futurism to be uncommonly important given its instrumentality in answering the crucial questions mentioned earlier. It therefore also seems unwarrantedly neglected as of today.
How to research this
Here are examples of broad questions that could be part of a research agenda on this topic:
What are the best predictors of future human values? What can we learn from usual forecasting methods?
How have people’s values changed throughout History? Why? What can we learn from this? (see, e.g., MacAskill 2022, Chapter 3; Harris 2019; Hopster 2022)
Are there reasons to think we’ll observe less change in the future? Why? Value lock-in? Some form of moral convergence happening soon?
Are there reasons to expect more change? Would that be due to the development of AGI, whole brain emulation, space colonization, and/or accelerated value drift?
More broadly, what impact will future technological progress have on values? (see Hanson 2016 for a forecast example.)
Should we expect some values to be selected for? (see, e.g., Christiano 2013; Bostrom 2009, Tomasik 2017)
Might a period of “long reflection” take place? If yes, can we get some idea of what could result from it?
Does something like coherent extrapolated volition have any chance of being pursued and if so, what could realistically result from it?
Are there futures – where humanity has certain values – that are unlikely but worth wagering on?
Might our research on this topic affect the values we should expect our successors to have by, e.g., triggering a self-defeating or self-fulfilling prophecy effect? (Danaher 2021, section 2)
What do/will aliens value (see my forthcoming next post) and what does that tell us about ourselves?
John Danaher (2021) gives examples of methodologies that could be used to answer these questions.
Also, my Appendix references examples and other relevant work, including the (forthcoming) next posts in this sequence.
Acknowledgment
Thanks to Anders -
LW - $500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory by johnswentworth
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $500 Bounty/Contest: Explain Infra-Bayes In The Language Of Game Theory, published by johnswentworth on March 25, 2023 on LessWrong.
Here's my current best guess at how Infra-Bayes works:
We want to get worst-case guarantees for an agent using a Bayesian-like framework.
So, let our agent be a Bayesian which models the environment as containing an adversary which chooses worst-case values for any of the things over which we want worst-case guarantees.
That's just a standard two-player zero-sum game between the agent and the adversary, so we can import all the nice intuitive stuff from game theory.
... but instead of that, we're going to express everything in the unnecessarily-abstract language of measure theory and convex sets, and rederive a bunch of game theory without mentioning that that's what we're doing.
This bounty is for someone to write an intuitively-accessible infrabayes explainer in game theoretic language, and explain how the game-theoretic concepts relate to the concepts in existing presentations of infra-bayes. In short: provide a translation.
Here's a sample of the sort of thing I have in mind:
Conceptually, an infrabayesian agent is just an ordinary Bayesian game-theoretic agent, which models itself/its environment as a standard two-player zero-sum game.
In the existing presentations of infra-bayes, the two-player game is only given implicitly. The agent's strategy π solves the problem:
maxπmine∈BEπe[U]
In game-theoretic terms, the "max" represents the agent's decision, while the "min" represents the adversary's.
Much of the mathematical tractability stems from the fact that B is a convex set of environments (i.e. functions from policy π to probability distributions). In game-theoretic terms, the adversary's choice of strategy determines which "environment" the agent faces, and the adversary can choose from any option in B. Convexity of B follows from the adversary's ability to use mixed strategies: because the adversary can take a randomized mix of any two strategies available to it, the adversary can make the agent face any convex combination of (policy -> distribution) functions in B. Thus, B is closed under convex combinations; it's a convex set.
I'd like a writeup along roughly these conceptual lines which covers as much as possible of the major high-level definitions and results in infra-bayes to date. On the other hand, I give approximately-zero shits about all the measure theory; just state the relevant high-level results in game-theoretic language, say what they mean intuitively, maybe mention whether there's some pre-existing standard game-theory theorem which can do the job or whether the infra-bayes version of the theorem is in fact the first proof of the game-theoretic equivalent, and move on.
Alternatively, insofar as core parts of infrabayes differ from a two-player zero-sum game, or the general path I'm pointing to doesn't work, an explanation of how they differ and what the consequences are could also qualify for prize money.
Bounty/Contest Operationalization
Most of the headache in administering this sort of bounty is the risk that some well-intended person will write something which is not at all what I want, expecting to get paid, and then I will either have to explain how/why it's not what I want (which takes a lot of work), or I have to just accept it. To mitigate that failure mode, I'll run this as a contest: to submit, write up your explanation as a lesswrong post, then send me a message on lesswrong to make sure I'm aware of it. Deadline is end of April. I will distribute money among submissions based on my own highly-subjective judgement. If people write stuff up early, I might leave feedback on their posts, but no promises.
I will count the "sample" above as a submission in its own right - i.e. I will imagine that thr