33 episodes

AXRP - the AI X-risk Research Podcast Daniel Filan

- Technology
- 4.6 • 5 Ratings

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

- APR 25, 2024
29 - Science of Deep Learning with Vikrant Varma

29 - Science of Deep Learning with Vikrant Varma

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:
0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS
0:00:36 - What is CCS?
0:09:54 - Consistent and contrastive features other than model beliefs
0:20:34 - Understanding the banana/shed mystery
0:41:59 - Future CCS-like approaches
0:53:29 - CCS as principal component analysis
0:56:21 - Explaining grokking through circuit efficiency
0:57:44 - Why research science of deep learning?
1:12:07 - Summary of the paper's hypothesis
1:14:05 - What are 'circuits'?
1:20:48 - The role of complexity
1:24:07 - Many kinds of circuits
1:28:10 - How circuits are learned
1:38:24 - Semi-grokking and ungrokking
1:50:53 - Generalizing the results
1:58:51 - Vikrant's research approach
2:06:36 - The DeepMind alignment team
2:09:06 - Follow-up work

The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html
Vikrant's Twitter/X account: twitter.com/vikrantvarma_

Main papers:
- Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029
- Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390

Other works discussed:
- Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827
- Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit
- Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
- Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY
- Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4
- Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast
- Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177
- Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306
- Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521

Episode art by Hamish Doodles: hamishdoodles.com
- 2 hr 13 min
- APR 17, 2024
28 - Suing Labs for AI Risk with Gabriel Weil

28 - Suing Labs for AI Risk with Gabriel Weil

How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic".
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:
0:00:35 - The basic idea
0:20:36 - Tort law vs regulation
0:29:10 - Weil's proposal vs Hanson's proposal
0:37:00 - Tort law vs Pigouvian taxation
0:41:16 - Does disagreement on AI risk make this proposal less effective?
0:49:53 - Warning shots - their prevalence and character
0:59:17 - Feasibility of big changes to liability law
1:29:17 - Interactions with other areas of law
1:38:59 - How Gabriel encountered the AI x-risk field
1:42:41 - AI x-risk and the legal field
1:47:44 - Technical research to help with this proposal
1:50:47 - Decisions this proposal could influence
1:55:34 - Following Gabriel's research

The transcript: axrp.net/episode/2024/04/17/episode-28-tort-law-for-ai-risk-gabriel-weil.html

Links for Gabriel:
- SSRN page: papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1648032
- Twitter/X account: twitter.com/gabriel_weil

Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence: papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006

Other links:
- Foom liability: overcomingbias.com/p/foom-liability
- Punitive Damages: An Economic Analysis: law.harvard.edu/faculty/shavell/pdf/111_Harvard_Law_Rev_869.pdf
- Efficiency, Fairness, and the Externalization of Reasonable Risks: The Problem With the Learned Hand Formula: papers.ssrn.com/sol3/papers.cfm?abstract_id=4466197
- Tort Law Can Play an Important Role in Mitigating AI Risk: forum.effectivealtruism.org/posts/epKBmiyLpZWWFEYDb/tort-law-can-play-an-important-role-in-mitigating-ai-risk
- How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk: forum.effectivealtruism.org/posts/yWKaBdBygecE42hFZ/how-technical-ai-safety-researchers-can-help-implement
- Can the courts save us from dangerous AI? [Vox]: vox.com/future-perfect/2024/2/7/24062374/ai-openai-anthropic-deepmind-legal-liability-gabriel-weil

Episode art by Hamish Doodles: hamishdoodles.com
- 1 hr 57 min
- APR 11, 2024
27 - AI Control with Buck Shlegeris and Ryan Greenblatt

27 - AI Control with Buck Shlegeris and Ryan Greenblatt

A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:
0:00:31 - What is AI control?
0:16:16 - Protocols for AI control
0:22:43 - Which AIs are controllable?
0:29:56 - Preventing dangerous coded AI communication
0:40:42 - Unpredictably uncontrollable AI
0:58:01 - What control looks like
1:08:45 - Is AI control evil?
1:24:42 - Can red teams match misaligned AI?
1:36:51 - How expensive is AI monitoring?
1:52:32 - AI control experiments
2:03:50 - GPT-4's aptitude at inserting backdoors
2:14:50 - How AI control relates to the AI safety field
2:39:25 - How AI control relates to previous Redwood Research work
2:49:16 - How people can work on AI control
2:54:07 - Following Buck and Ryan's research

The transcript: axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html
Links for Buck and Ryan:
- Buck's twitter/X account: twitter.com/bshlgrs
- Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt
- You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com

Main research works we talk about:
- The case for ensuring that powerful AIs are controlled: lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
- AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942

Other things we mention:
- The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root
- Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512
- Improving the Welfare of AIs: A Nearcasted Proposal: lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
- Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938
- Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing

Episode art by Hamish Doodles: hamishdoodles.com
- 2 hr 56 min
- NOV 26, 2023
26 - AI Governance with Elizabeth Seger

26 - AI Governance with Elizabeth Seger

The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast

Topics we discuss, and timestamps:
- 0:00:40 - What kinds of AI?
- 0:01:30 - Democratizing AI
   - 0:04:44 - How people talk about democratizing AI
   - 0:09:34 - Is democratizing AI important?
   - 0:13:31 - Links between types of democratization
   - 0:22:43 - Democratizing profits from AI
   - 0:27:06 - Democratizing AI governance
   - 0:29:45 - Normative underpinnings of democratization
- 0:44:19 - Open-sourcing AI
   - 0:50:47 - Risks from open-sourcing
   - 0:56:07 - Should we make AI too dangerous to open source?
   - 1:00:33 - Offense-defense balance
   - 1:03:13 - KataGo as a case study
   - 1:09:03 - Openness for interpretability research
   - 1:15:47 - Effectiveness of substitutes for open sourcing
   - 1:20:49 - Offense-defense balance, part 2
   - 1:29:49 - Making open-sourcing safer?
- 1:40:37 - AI governance research
   - 1:41:05 - The state of the field
   - 1:43:33 - Open questions
   - 1:49:58 - Distinctive governance issues of x-risk
   - 1:53:04 - Technical research to help governance
- 1:55:23 - Following Elizabeth's research

The transcript: https://axrp.net/episode/2023/11/26/episode-26-ai-governance-elizabeth-seger.html

Links for Elizabeth:
- Personal website: elizabethseger.com
- Centre for the Governance of AI (AKA GovAI): governance.ai

Main papers:
- Democratizing AI: Multiple Meanings, Goals, and Methods: arxiv.org/abs/2303.12642
- Open-sourcing highly capable foundation models: an evaluation of risks, benefits, and alternative methods for pursuing open source objectives: papers.ssrn.com/sol3/papers.cfm?abstract_id=4596436

Other research we discuss:
- What Do We Mean When We Talk About "AI democratisation"? (blog post): governance.ai/post/what-do-we-mean-when-we-talk-about-ai-democratisation
- Democratic Inputs to AI (OpenAI): openai.com/blog/democratic-inputs-to-ai
- Collective Constitutional AI: Aligning a Language Model with Public Input (Anthropic): anthropic.com/index/collective-constitutional-ai-aligning-a-language-model-with-public-input
- Against "Democratizing AI": johanneshimmelreich.net/papers/against-democratizing-AI.pdf
- Adversarial Policies Beat Superhuman Go AIs: goattack.far.ai
- Structured access: an emerging paradigm for safe AI deployment: arxiv.org/abs/2201.05159
- Universal and Transferable Adversarial Attacks on Aligned Language Models (aka Adversarial Suffixes): arxiv.org/abs/2307.15043

Episode art by Hamish Doodles: hamishdoodles.com
- 1 hr 57 min
- OCT 3, 2023
25 - Cooperative AI with Caspar Oesterheld

25 - Cooperative AI with Caspar Oesterheld

Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaking with Caspar Oesterheld about some of his research on this very topic.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
Episode art by Hamish Doodles: hamishdoodles.com

Topics we discuss, and timestamps:
- 0:00:34 - Cooperative AI
   - 0:06:21 - Cooperative AI vs standard game theory
   - 0:19:45 - Do we need cooperative AI if we get alignment?
   - 0:29:29 - Cooperative AI and agent foundations
- 0:34:59 - A Theory of Bounded Inductive Rationality
   - 0:50:05 - Why it matters
   - 0:53:55 - How the theory works
   - 1:01:38 - Relationship to logical inductors
   - 1:15:56 - How fast does it converge?
   - 1:19:46 - Non-myopic bounded rational inductive agents?
   - 1:24:25 - Relationship to game theory
- 1:30:39 - Safe Pareto Improvements
   - 1:30:39 - What they try to solve
   - 1:36:15 - Alternative solutions
   - 1:40:46 - How safe Pareto improvements work
   - 1:51:19 - Will players fight over which safe Pareto improvement to adopt?
   - 2:06:02 - Relationship to program equilibrium
   - 2:11:25 - Do safe Pareto improvements break themselves?
- 2:15:52 - Similarity-based Cooperation
   - 2:23:07 - Are similarity-based cooperators overly cliqueish?
   - 2:27:12 - Sensitivity to noise
   - 2:29:41 - Training neural nets to do similarity-based cooperation
- 2:50:25 - FOCAL, Caspar's research lab
- 2:52:52 - How the papers all relate
- 2:57:49 - Relationship to functional decision theory
- 2:59:45 - Following Caspar's research

The transcript: axrp.net/episode/2023/10/03/episode-25-cooperative-ai-caspar-oesterheld.html

Links for Caspar:
- FOCAL at CMU: www.cs.cmu.edu/~focal/
- Caspar on X, formerly known as Twitter: twitter.com/C_Oesterheld
- Caspar's blog: casparoesterheld.com/
- Caspar on Google Scholar: scholar.google.com/citations?user=xeEcRjkAAAAJ&hl=en&oi=ao

Research we discuss:
- A Theory of Bounded Inductive Rationality: arxiv.org/abs/2307.05068
- Safe Pareto improvements for delegated game playing: link.springer.com/article/10.1007/s10458-022-09574-6
- Similarity-based Cooperation: arxiv.org/abs/2211.14468
- Logical Induction: arxiv.org/abs/1609.03543
- Program Equilibrium: citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e1a060cda74e0e3493d0d81901a5a796158c8410
- Formalizing Objections against Surrogate Goals: www.alignmentforum.org/posts/K4FrKRTrmyxrw5Dip/formalizing-objections-against-surrogate-goals
- Learning with Opponent-Learning Awareness: arxiv.org/abs/1709.04326
- 3 hr 2 min
- JUL 26, 2023
24 - Superalignment with Jan Leike

24 - Superalignment with Jan Leike

Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces.
Patreon: patreon.com/axrpodcast
Ko-fi: ko-fi.com/axrpodcast
Episode art by Hamish Doodles: hamishdoodles.com/

Topics we discuss, and timestamps:
- 0:00:37 - The superalignment team
- 0:02:10 - What's a human-level automated alignment researcher?
   - 0:06:59 - The gap between human-level automated alignment researchers and superintelligence
   - 0:18:39 - What does it do?
   - 0:24:13 - Recursive self-improvement
- 0:26:14 - How to make the AI AI alignment researcher
   - 0:30:09 - Scalable oversight
   - 0:44:38 - Searching for bad behaviors and internals
   - 0:54:14 - Deliberately training misaligned models
- 1:02:34 - Four year deadline
   - 1:07:06 - What if it takes longer?
- 1:11:38 - The superalignment team and...
   - 1:11:38 - ... governance
   - 1:14:37 - ... other OpenAI teams
   - 1:18:17 - ... other labs
- 1:26:10 - Superalignment team logistics
- 1:29:17 - Generalization
- 1:43:44 - Complementary research
- 1:48:29 - Why is Jan optimistic?
   - 1:58:32 - Long-term agency in LLMs?
   - 2:02:44 - Do LLMs understand alignment?
- 2:06:01 - Following Jan's research

The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html

Links for Jan and OpenAI:
- OpenAI jobs: openai.com/careers
- Jan's substack: aligned.substack.com
- Jan's twitter: twitter.com/janleike

Links to research and other writings we discuss:
- Introducing Superalignment: openai.com/blog/introducing-superalignment
- Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050
- Planning for AGI and beyond:
openai.com/blog/planning-for-agi-and-beyond
- Self-critiquing models for assisting human evaluators: arxiv.org/abs/2206.05802
- An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143
- Language models can explain neurons in language models https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
- Our approach to alignment research: openai.com/blog/our-approach-to-alignment-research
- Training language models to follow instructions with human feedback (aka the Instruct-GPT paper): arxiv.org/abs/2203.02155
- 2 hr 8 min