AXRP - the AI X-risk Research Podcast

Daniel Filan

4.4 (9)
TECHNOLOGY

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

AUG 7

46 - Tom Davidson on AI-enabled Coups

Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks from AI have been concerned with misalignment risks. In this episode, Tom Davidson talks about a risk that could be comparably important: that of AI-enabled coups. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/08/07/episode-46-tom-davidson-ai-enabled-coups.html Topics we discuss, and timestamps: 0:00:35 How to stage a coup without AI 0:16:17 Why AI might enable coups 0:33:29 How bad AI-enabled coups are 0:37:28 Executive coups with singularly loyal AIs 0:48:35 Executive coups with exclusive access to AI 0:54:41 Corporate AI-enabled coups 0:57:56 Secret loyalty and misalignment in corporate coups 1:11:39 Likelihood of different types of AI-enabled coups 1:25:52 How to prevent AI-enabled coups 1:33:43 Downsides of AIs loyal to the law 1:41:06 Cultural shifts vs individual action 1:45:53 Technical research to prevent AI-enabled coups 1:51:40 Non-technical research to prevent AI-enabled coups 1:58:17 Forethought 2:03:03 Following Tom's and Forethought's research Links for Tom and Forethought: Tom on X / Twitter: https://x.com/tomdavidsonx Tom on LessWrong: https://www.lesswrong.com/users/tom-davidson-1 Forethought Substack: https://newsletter.forethought.org/ Will MacAskill on X / Twitter: https://x.com/willmacaskill Will MacAskill on LessWrong: https://www.lesswrong.com/users/wdmacaskill Research we discuss: AI-Enabled Coups: How a Small Group Could Use AI to Seize Power: https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power Seizing Power: The Strategic Logic of Military Coups, by Naunihal Singh: https://muse.jhu.edu/book/31450 Experiment using AI-generated posts on Reddit draws fire for ethics concerns: https://retractionwatch.com/2025/04/28/experiment-using-ai-generated-posts-on-reddit-draws-fire-for-ethics-concerns/ Episode art by Hamish Doodles: hamishdoodles.com

2h 5m
JUL 6

45 - Samuel Albanie on DeepMind's AGI Safety Approach

In this episode, I chat with Samuel Albanie about the Google DeepMind paper he co-authored called "An Approach to Technical AGI Safety and Security". It covers the assumptions made by the approach, as well as the types of mitigations it outlines. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/07/06/episode-45-samuel-albanie-deepminds-agi-safety-approach.html Topics we discuss, and timestamps: 0:00:37 DeepMind's Approach to Technical AGI Safety and Security 0:04:29 Current paradigm continuation 0:19:13 No human ceiling 0:21:22 Uncertain timelines 0:23:36 Approximate continuity and the potential for accelerating capability improvement 0:34:29 Misuse and misalignment 0:39:34 Societal readiness 0:43:58 Misuse mitigations 0:52:57 Misalignment mitigations 1:05:20 Samuel's thinking about technical AGI safety 1:14:02 Following Samuel's work Samuel on Twitter/X: x.com/samuelalbanie Research we discuss: An Approach to Technical AGI Safety and Security: https://arxiv.org/abs/2504.01849 Levels of AGI for Operationalizing Progress on the Path to AGI: https://arxiv.org/abs/2311.02462 The Checklist: What Succeeding at AI Safety Will Involve: https://sleepinyourhat.github.io/checklist/ Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499 Episode art by Hamish Doodles: hamishdoodles.com

1h 16m
JUN 28

44 - Peter Salib on AI Rights for Human Safety

In this episode, I talk with Peter Salib about his paper "AI Rights for Human Safety", arguing that giving AIs the right to contract, hold property, and sue people will reduce the risk of their trying to attack humanity and take over. He also tells me how law reviews work, in the face of my incredulity. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/28/episode-44-peter-salib-ai-rights-human-safety.html Topics we discuss, and timestamps: 0:00:40 Why AI rights 0:18:34 Why not reputation 0:27:10 Do AI rights lead to AI war? 0:36:42 Scope for human-AI trade 0:44:25 Concerns with comparative advantage 0:53:42 Proxy AI wars 0:57:56 Can companies profitably make AIs with rights? 1:09:43 Can we have AI rights and AI safety measures? 1:24:31 Liability for AIs with rights 1:38:29 Which AIs get rights? 1:43:36 AI rights and stochastic gradient descent 1:54:54 Individuating "AIs" 2:03:28 Social institutions for AI safety 2:08:20 Outer misalignment and trading with AIs 2:15:27 Why statutes of limitations should exist 2:18:39 Starting AI x-risk research in legal academia 2:24:18 How law reviews and AI conferences work 2:41:49 More on Peter moving to AI x-risk research 2:45:37 Reception of the paper 2:53:24 What publishing in law reviews does 3:04:48 Which parts of legal academia focus on AI 3:18:03 Following Peter's research Links for Peter: Personal website: https://www.peternsalib.com/ Writings at Lawfare: https://www.lawfaremedia.org/contributors/psalib CLAIR: https://clair-ai.org/ Research we discuss: AI Rights for Human Safety: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4913167 Will humans and AIs go to war? https://philpapers.org/rec/GOLWAA Infrastructure for AI agents: https://arxiv.org/abs/2501.10114 Governing AI Agents: https://arxiv.org/abs/2501.07913 Episode art by Hamish Doodles: hamishdoodles.com

3h 22m
JUN 15

43 - David Lindner on Myopic Optimization with Non-myopic Approval

In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html Topics we discuss, and timestamps: 0:00:29 What MONA is 0:06:33 How MONA deals with reward hacking 0:23:15 Failure cases for MONA 0:36:25 MONA's capability 0:55:40 MONA vs other approaches 1:05:03 Follow-up work 1:10:17 Other MONA test cases 1:33:47 When increasing time horizon doesn't increase capability 1:39:04 Following David's research Links for David: Website: https://www.davidlindner.me Twitter / X: https://x.com/davlindner DeepMind Medium: https://deepmindsafetyresearch.medium.com David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner Research we discuss: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011 Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training Episode art by Hamish Doodles: hamishdoodles.com

1h 41m
JUN 6

42 - Owain Evans on LLM Psychology

Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html Topics we discuss, and timestamps: 0:00:37 Why introspection? 0:06:24 Experiments in "Looking Inward" 0:15:11 Why fine-tune for introspection? 0:22:32 Does "Looking Inward" test introspection, or something else? 0:34:14 Interpreting the results of "Looking Inward" 0:44:56 Limitations to introspection? 0:49:54 "Tell me about yourself", and its relation to other papers 1:05:45 Backdoor results 1:12:01 Emergent Misalignment 1:22:13 Why so hammy, and so infrequently evil? 1:36:31 Why emergent misalignment? 1:46:45 Emergent misalignment and other types of misalignment 1:53:57 Is emergent misalignment good news? 2:00:01 Follow-up work to "Emergent Misalignment" 2:03:10 Reception of "Emergent Misalignment" vs other papers 2:07:43 Evil numbers 2:12:20 Following Owain's research Links for Owain: Truthful AI: https://www.truthfulai.org Owain's website: https://owainevans.github.io/ Owain's twitter/X account: https://twitter.com/OwainEvans_UK Research we discuss: Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787 Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546 Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424 X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852 Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667 Episode art by Hamish Doodles: hamishdoodles.com

2h 14m
JUN 3

41 - Lee Sharkey on Attribution-based Parameter Decomposition

What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html Topics we discuss, and timestamps: 0:00:41 APD basics 0:07:57 Faithfulness 0:11:10 Minimality 0:28:44 Simplicity 0:34:50 Concrete-ish examples of APD 0:52:00 Which parts of APD are canonical 0:58:10 Hyperparameter selection 1:06:40 APD in toy models of superposition 1:14:40 APD and compressed computation 1:25:43 Mechanisms vs representations 1:34:41 Future applications of APD? 1:44:19 How costly is APD? 1:49:14 More on minimality training 1:51:49 Follow-up work 2:05:24 APD on giant chain-of-thought models? 2:11:27 APD and "features" 2:14:11 Following Lee's work Lee links (Leenks): X/Twitter: https://twitter.com/leedsharkey Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey Research we discuss: Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926 Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476 Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis Episode art by Hamish Doodles: hamishdoodles.com

2h 16m
MAR 28

40 - Jason Gross on Compact Proofs and Interpretability

How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs 0:48:23 - What we've learned about compact proofs in general 0:59:02 - Generalizing 'symmetry' 1:11:24 - Grading mechanistic interpretability 1:43:34 - What helps compact proofs 1:51:08 - The limits of compact proofs 2:07:33 - Guaranteed safe AI, and AI for guaranteed safety 2:27:44 - Jason and Rajashree's start-up 2:34:19 - Following Jason's work Links to Jason: Github: https://github.com/jasongross Website: https://jasongross.github.io Alignment Forum: https://www.alignmentforum.org/users/jason-gross Links to work we discuss: Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779 Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476 Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773 Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): https://arxiv.org/abs/2501.14926 Towards Guaranteed Safe AI: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf Episode art by Hamish Doodles: hamishdoodles.com

2h 36m
MAR 1

38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future

In this episode, I chat with David Duvenaud about two topics he's been thinking about: firstly, a paper he wrote about evaluating whether or not frontier models can sabotage human decision-making or monitoring of the same models; and secondly, the difficult situation humans find themselves in in a post-AGI future, even if AI is aligned with human intentions. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/01/episode-38_8-david-duvenaud-sabotage-evaluations-post-agi-future.html FAR.AI: https://far.ai/ FAR.AI on X (aka Twitter): https://x.com/farairesearch FAR.AI on YouTube: @FARAIResearch The Alignment Workshop: https://www.alignment-workshop.com/ Topics we discuss, and timestamps: 01:42 - The difficulty of sabotage evaluations 05:23 - Types of sabotage evaluation 08:45 - The state of sabotage evaluations 12:26 - What happens after AGI? Links: Sabotage Evaluations for Frontier Models: https://arxiv.org/abs/2410.21514 Gradual Disempowerment: https://gradual-disempowerment.ai/ Episode art by Hamish Doodles: hamishdoodles.com

21 min

4.4

out of 5

9 Ratings

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

Creator

Daniel Filan
Years Active

2020 - 2025
Episodes

59
Rating

Clean
Show Website

AXRP - the AI X-risk Research Podcast

Technology

Technology

Updated Weekly
Government

Government

Updated Biweekly
Tech News

Tech News

Updated Weekly
Technology

Technology

Updated Biweekly
Technology

Technology

Updated 3d ago
Society & Culture

Society & Culture

Updated Weekly
Technology

Technology

Updated Weekly