100 episodes

The Alignment Newsletter is a weekly publication with recent content relevant to AI alignment.
This podcast is an audio version, recorded by Robert Miles (http://robertskmiles.com)

More information about the newsletter at: https://rohinshah.com/alignment-newsletter/

Alignment Newsletter Podcast Rohin Shah et al.

    • News
    • 5.0 • 5 Ratings

The Alignment Newsletter is a weekly publication with recent content relevant to AI alignment.
This podcast is an audio version, recorded by Robert Miles (http://robertskmiles.com)

More information about the newsletter at: https://rohinshah.com/alignment-newsletter/

    Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

    Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

    Recorded by Robert Miles: http://robertskmiles.com
    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
     
    HIGHLIGHTS Alignment difficulty (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows:
    1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world.
    2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are “consequentialist” in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don’t do this sort of reasoning, your plan goes off the rails very quickly - it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans.
    3. We’re producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying “consequentialist”-style reasoning process (because that’s the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that we should expect an existential catastrophe by default.
    4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story.
    Richard responds to this with a few distinct points:
    1. It might be possible to build AI systems which are not of world-destroying intelligence and agency, that humans use to save the world. For example, we could make AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe.
    2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan.
    3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk.
    4. It also seems possible to create systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world) -- think for example of corrigibility (AN #35) or deference to a human user.
    5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.)
    Eliezer’s responses:
    1. AI systems that help with alignment research to such a degree that it actually makes a difference are almost certainly already dangerous.
    2. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous.
    3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that.
    This post has also been summarized by others here, though with different emphases than in my summary.
     

    • 14 min
    Analyzing the argument for risk from power-seeking AI

    Analyzing the argument for risk from power-seeking AI

    Recorded by Robert Miles: http://robertskmiles.com
    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
      HIGHLIGHTS Draft report on existential risk from power-seeking AI (Joe Carlsmith) (summarized by Rohin): This report investigates the classic AI risk argument in detail, and decomposes it into a set of conjunctive claims. Here’s the quick version of the argument. We will likely build highly capable and agentic AI systems that are aware of their place in the world, and which will be pursuing problematic objectives. Thus, they will take actions that increase their power, which will eventually disempower humans leading to an existential catastrophe. We will try and avert this, but will probably fail to do so since it is technically challenging, and we are not capable of the necessary coordination.
    There’s a lot of vague words in the argument above, so let’s introduce some terminology to make it clearer:
    - Advanced capabilities: We say that a system has advanced capabilities if it outperforms the best humans on some set of important tasks (such as scientific research, business/military/political strategy, engineering, and persuasion/manipulation).
    - Agentic planning: We say that a system engages in agentic planning if it (a) makes and executes plans, (b) in pursuit of objectives, (c) on the basis of models of the world. This is a very broad definition, and doesn’t have many of the connotations you might be used to for an agent. It does not need to be a literal planning algorithm -- for example, human cognition would count, despite (probably) not being just a planning algorithm.
    - Strategically aware: We say that a system is strategically aware if it models the effects of gaining and maintaining power over humans and the real-world environment.
    - PS-misaligned (power-seeking misaligned): On some inputs, the AI system seeks power in unintended ways, due to problems with its objectives (if the system actually receives such inputs, then it is practically PS-misaligned.)
    The core argument is then that AI systems with advanced capabilities, agentic planning, and strategic awareness (APS-systems) will be practically PS-misaligned, to an extent that causes an existential catastrophe. Of course, we will try to prevent this -- why should we expect that we can’t fix the problem? The author considers possible remedies, and argues that they all seem quite hard:
    - We could give AI systems the right objectives (alignment), but this seems quite hard -- it’s not clear how we would solve either outer or inner alignment.
    - We could try to shape objectives to be e.g. myopic, but we don’t know how to do this, and there are strong incentives against myopia.
    - We could try to limit AI capabilities by keeping systems special-purpose rather than general, but there are strong incentives for generality, and some special-purpose systems can be dangerous, too.
    - We could try to prevent the AI system from improving its own capabilities, but this requires us to anticipate all the ways the AI system could improve, and there are incentives to create systems that learn and change as they gain experience.
    - We could try to control the deployment situations to be within some set of circumstances where we know the AI system won’t seek power. However, this seems harder and harder to do as capabilities increase, since with more capabilities, more options become available.
    - We could impose a high threshold of safety before an AI system is deployed, but the AI system could still seek power during training, and there are many incentives pushing for faster, riskier deployment (even if we have already seen warning shots).
    - We could try to correct the behavior of misaligned AI systems, or mitigate their impact, after deployment. This seems like it requires humans to have comparabl

    • 13 min
    Collaborating with humans without human data

    Collaborating with humans without human data

    Recorded by Robert Miles: http://robertskmiles.com
    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
     
    HIGHLIGHTS Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We’ve previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case?
    This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper).
    In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well).
    Read more: Related paper: Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination
    Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now.
    Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don’t think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully “aim” your system at the true task of interest, and as a result it doesn’t perform as well as it “could have”. In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there’s still some chance it ends up rambling about something else entirely, or neglects to mention some information that it “knows” but that a human on the Internet would not have said. (See also this post (AN #141).)
    I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both.
    I’m generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans).
      TECHNICAL AI ALIGNMENT

    • 15 min
    Four technical topics for which Open Phil is soliciting grant proposals

    Four technical topics for which Open Phil is soliciting grant proposals

    Recorded by Robert Miles: http://robertskmiles.com
    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
     
    HIGHLIGHTS Request for proposals for projects in AI alignment that work with deep learning systems (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants.
    Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll comment on specific directions below.
    RFP: Measuring and forecasting risks (Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement?
    1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc).
    2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we’d much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87).
    3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights.
    4. Soundness and quality: This is a general category for things like “do we know that the signal isn’t overwhelmed by the noise” and “are there any reasons that the measurement might produce false positives or false negatives”.
    What sorts of things might you measure?
    1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become?
    2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable “difference in intelligence” between a model and its supervisor.
    3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data?
    4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer?
    Rohin's opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don’t think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I’m thinking of objective robustness failures (AN #66) of the form “the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn”.)
    You can also tell stories where the

    • 16 min
    Concrete ML safety problems and their relevance to x-risk

    Concrete ML safety problems and their relevance to x-risk

    Recorded by Robert Miles: http://robertskmiles.com
    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
    HIGHLIGHTS Unsolved Problems in ML Safety (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely:
    1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events.
    2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected model functionality.
    3. Alignment: Build models that represent and safely optimize hard-to-specify human values.
    4. External Safety: Use ML to address risks to how ML systems are handled, including cyberwarfare and global turbulence.
    Throughout, the paper attempts to clarify problem’s motivation and provide concrete project ideas.
     
    Dan Hendrycks' opinion: My coauthors and I wrote this paper with the ML research community as our target audience. Here are some thoughts on this topic:
    1. The document includes numerous problems that, if left unsolved, would imply that ML systems are unsafe. We need the effort of thousands of researchers to address all of them. This means that the main safety discussions cannot stay within the confines of the relatively small EA community. I think we should aim to have over one third of the ML research community work on safety problems. We need the broader community to treat AI at least as seriously as safety for nuclear power plants.
    2. To grow the ML research community, we need to suggest problems that can progressively build the community and organically grow support for elevating safety standards within the existing research ecosystem. Research agendas that pertain to AGI exclusively will not scale sufficiently, and such research will simply not get enough market share in time. If we do not get the machine learning community on board with proactively mitigating risks that already exist, we will have a harder time getting them to mitigate less familiar and unprecedented risks. Rather than try to win over the community with alignment philosophy arguments, I'll try winning them over with interesting problems and try to make work towards safer systems rewarded with prestige.
    3. The benefits of a larger ML Safety community are numerous. They can decrease the cost of safety methods and increase the propensity to adopt them. Moreover, to make ML systems have desirable properties, it is necessary to rapidly accumulate incremental improvements, but this requires substantial growth since such gains cannot be produced by just a few card-carrying x-risk researchers with the purest intentions.
    4. The community will fail to grow if we ignore near-term concerns or actively exclude or sneer at people who work on problems that are useful for both near- and long-term safety (such as adversaries). The alignment community will need to stop engaging in textbook territorialism and welcome serious hypercompetent researchers who do not post on internet forums or who happen not to subscribe to effective altruism. (We include a community strategy in the Appendix.)
    5. We focus on reinforcement learning but also deep learning. Most of the machine learning research community studies deep learning (e.g., text processing, vision) and does not use, say, Bellman equations or PPO. While existentially catastrophic failures will likely require competent sequential decision making agents, the relevant problems and solutions can often be better studied outside of gridworlds and MuJoCo. There is much useful safety research to be done that does not need to be cast as a reinforcement learning problem.
    6. To prevent alienating readers, we did not use phrases s

    • 17 min
    Is it crazy to claim we're in the most important century?

    Is it crazy to claim we're in the most important century?

    Recorded by Robert Miles: http://robertskmiles.com
    More information about the newsletter here: https://rohinshah.com/alignment-newsletter/
    YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg
     
    HIGHLIGHTS The "most important century" series (Holden Karnofsky) (summarized by Rohin): In some sense, it is really weird for us to claim that there is a non-trivial chance that in the near future, we might build transformative AI and either (1) go extinct or (2) exceed a growth rate of (say) 100% per year. It feels like an extraordinary claim, and thus should require extraordinary evidence. One way of cashing this out: if the claim were true, this century would be the most important century, with the most opportunity for individuals to have an impact. Given the sheer number of centuries there are, this is an extraordinary claim; it should really have extraordinary evidence. This series argues that while the claim does seem extraordinary, all views seem extraordinary -- there isn’t some default baseline view that is “ordinary” to which we should be assigning most of our probability.
    Specifically, consider three possibilities for the long-run future:
    1. Radical: We will have a productivity explosion by 2100, which will enable us to become technologically mature. Think of a civilization that sends spacecraft throughout the galaxy, builds permanent settlements on other planets, harvests large fractions of the energy output from stars, etc.
    2. Conservative: We get to a technologically mature civilization, but it takes hundreds or thousands of years. Let’s say even 100,000 years to be ultra conservative.
    3. Skeptical: We never become technologically mature, for some reason. Perhaps we run into fundamental technological limits, or we choose not to expand into the galaxy, or we’re in a simulation, etc.
    It’s pretty clear why the radical view is extraordinary. What about the other two?
    The conservative view implies that we are currently in the most important 100,000-year period. Given that life is billions of years old, and would presumably continue for billions of years to come once we reach a stable galaxy-wide civilization, that would make this the most important 100,000 year period out of tens of thousands of such periods. Thus the conservative view is also extraordinary, for the same reason that the radical view is extraordinary (albeit it is perhaps only half as extraordinary as the radical view).
    The skeptical view by itself does not seem obviously extraordinary. However, while you could assign 70% probability to the skeptical view, it seems unreasonable to assign 99% probability to such a view -- that suggests some very strong or confident claims about what prevents us from colonizing the galaxy, that we probably shouldn’t have given our current knowledge. So, we need to have a non-trivial chunk of probability on the other views, which still opens us up to critique of having extraordinary claims.
    Okay, so we’ve established that we should at least be willing to say something as extreme as “there’s a non-trivial chance we’re in the most important 100,000-year period”. Can we tighten the argument, to talk about the most important century? In fact, we can, by looking at the economic growth rate.
    You are probably aware that the US economy grows around 2-3% per year (after adjusting for inflation), so a business-as-usual, non-crazy, default view might be to expect this to continue. You are probably also aware that exponential growth can grow very quickly. At the lower end of 2% per year, the economy would double every ~35 years. If this continued for 8200 years, we'd need to be sustaining multiple economies as big as today's entire world economy per atom in the universe. While this is not a priori impossible, it seems quite unlikely to happen. This suggests that we’re in one of fewer than 82 centuries that will have growth rates at 2% or larger, making i

    • 15 min

Customer Reviews

5.0 out of 5
5 Ratings

5 Ratings

Top Podcasts In News

The New York Times
Tortoise Media
NPR
Deuxmoi & Cadence13
The Daily Wire
The Daily Wire

You Might Also Like

Matt Parker & Bec Hill
Complexly