2 hr 56 min

27 - AI Control with Buck Shlegeris and Ryan Greenblatt AXRP - the AI X-risk Research Podcast

    • Technology

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
 
Topics we discuss, and timestamps:
0:00:31 - What is AI control?
0:16:16 - Protocols for AI control
0:22:43 - Which AIs are controllable?
0:29:56 - Preventing dangerous coded AI communication
0:40:42 - Unpredictably uncontrollable AI
0:58:01 - What control looks like
1:08:45 - Is AI control evil?
1:24:42 - Can red teams match misaligned AI?
1:36:51 - How expensive is AI monitoring?
1:52:32 - AI control experiments
2:03:50 - GPT-4's aptitude at inserting backdoors
2:14:50 - How AI control relates to the AI safety field
2:39:25 - How AI control relates to previous Redwood Research work
2:49:16 - How people can work on AI control
2:54:07 - Following Buck and Ryan's research
 
The transcript:  axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html
Links for Buck and Ryan:
 - Buck's twitter/X account: twitter.com/bshlgrs
 - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt
 - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com
 
Main research works we talk about:
 - The case for ensuring that powerful AIs are controlled:  lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
 - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942
 
Other things we mention:
 - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root
 - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512
 - Improving the Welfare of AIs: A Nearcasted Proposal:  lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
 - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938
 - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
 
Episode art by Hamish Doodles: hamishdoodles.com

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
 
Topics we discuss, and timestamps:
0:00:31 - What is AI control?
0:16:16 - Protocols for AI control
0:22:43 - Which AIs are controllable?
0:29:56 - Preventing dangerous coded AI communication
0:40:42 - Unpredictably uncontrollable AI
0:58:01 - What control looks like
1:08:45 - Is AI control evil?
1:24:42 - Can red teams match misaligned AI?
1:36:51 - How expensive is AI monitoring?
1:52:32 - AI control experiments
2:03:50 - GPT-4's aptitude at inserting backdoors
2:14:50 - How AI control relates to the AI safety field
2:39:25 - How AI control relates to previous Redwood Research work
2:49:16 - How people can work on AI control
2:54:07 - Following Buck and Ryan's research
 
The transcript:  axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html
Links for Buck and Ryan:
 - Buck's twitter/X account: twitter.com/bshlgrs
 - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt
 - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com
 
Main research works we talk about:
 - The case for ensuring that powerful AIs are controlled:  lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
 - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942
 
Other things we mention:
 - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root
 - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512
 - Improving the Welfare of AIs: A Nearcasted Proposal:  lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
 - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938
 - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
 
Episode art by Hamish Doodles: hamishdoodles.com

2 hr 56 min

Top Podcasts In Technology

Acquired
Ben Gilbert and David Rosenthal
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
Lex Fridman Podcast
Lex Fridman
Hard Fork
The New York Times
TED Radio Hour
NPR
Darknet Diaries
Jack Rhysider