15 episodes

AI Safety Fundamentals: Alignment 201 BlueDot Impact

- Technology

Listen to resources from the AI Safety Fundamentals: Alignment 201 course!https://course.aisafetyfundamentals.com/alignment-201

- MAY 13, 2023
Empirical Findings Generalize Surprisingly Far

Empirical Findings Generalize Surprisingly Far

Previously, I argued that emergent phenomena in machine learning mean that we can’t rely on current trends to predict what the future of ML will be like. In this post, I will argue that despite this, empirical findings often do generalize very far, including across “phase transitions” caused by emergent behavior.This might seem like a contradiction, but actually I think divergence from current trends and empirical generalization are consistent. Findings do often generalize, but you need to th...
- 11 min
- MAY 13, 2023
Worst-Case Thinking in AI Alignment

Worst-Case Thinking in AI Alignment

Alternative title: “When should you assume that what could go wrong, will go wrong?” Thanks to Mary Phuong and Ryan Greenblatt for helpful suggestions and discussion, and Akash Wasil for some edits. In discussions of AI safety, people often propose the assumption that something goes as badly as possible. Eliezer Yudkowsky in particular has argued for the importance of security mindset when thinking about AI alignment. I think there are several distinct reasons that this might be the right ass...
- 11 min
- MAY 13, 2023
Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. So...
- 16 min
- MAY 13, 2023
Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Using hard multiple-choice reading comprehension questions as a testbed, we assess whether presenting humans with arguments for two competing answer options, where one is correct and the other is incorrect, allows human judges to perform more accurately, even when one of the arguments is unreliable and deceptive. If this is helpful, we may be able to increase our justified trust in language-model-based systems by asking them to produce these arguments where needed. Previous research has shown...
- 16 min
- MAY 13, 2023
Low-Stakes Alignment

Low-Stakes Alignment

Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our models are robustly optimizing that objective. (This is roughly “outer alignment.”) That’s pretty vague, and it’s not obvious whether “find a good objective” is a meaningful goal rather than being inherently confused or sweeping key distinctions under the rug. So I like to focus on a more precise special case of alignment: solve alignment when decisions are “low stakes.” I think this cas...
- 13 min
- MAY 13, 2023
Imitative Generalisation (AKA ‘Learning the Prior’)

Imitative Generalisation (AKA ‘Learning the Prior’)

This post tries to explain a simplified version of Paul Christiano’s mechanism introduced here, (referred to there as ‘Learning the Prior’) and explain why a mechanism like this potentially addresses some of the safety problems with naïve approaches. First we’ll go through a simple example in a familiar domain, then explain the problems with the example. Then I’ll discuss the open questions for making Imitative Generalization actually work, and the connection with the Microscope AI idea. A mo...
- 18 min