AXRP - the AI X-risk Research Podcast

Daniel Filan

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

  1. 48 - Guive Assadi on AI Property Rights

    2D AGO

    48 - Guive Assadi on AI Property Rights

    In this episode, Guive Assadi argues that we should give AIs property rights, so that they are integrated in our system of property and come to rely on it. The claim is that this means that AIs would not kill or steal from humans, because that would undermine the whole property system, which would be extremely valuable to them. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2026/02/15/episode-48-guive-assadi-ai-property-rights.html   Topics we discuss, and timestamps: 0:00:28 AI property rights 0:08:01 Why not steal from and kill humans 0:15:25 Why AIs may fear it could be them next 0:20:56 AI retirement 0:23:28 Could humans be upgraded to stay useful? 0:26:41 Will AI progress continue? 0:30:00 Why non-obsoletable AIs may still not end human property rights 0:38:35 Why make AIs with property rights? 0:48:01 Do property rights incentivize alignment? 0:50:09 Humans and non-human property rights 1:02:18 Humans and non-human bodily autonomy 1:16:59 Step changes in coordination ability 1:24:39 Acausal coordination 1:32:37 AI, humans, and civilizations with different technology levels 1:41:39 The case of British settlers and Tasmanians 1:47:22 Non-total expropriation 1:53:47 How Guive thinks x-risk could happen, and other loose ends 2:03:46 Following Guive's work   Guive on Substack: https://guive.substack.com/ Guive on X/Twitter: https://x.com/GuiveAssadi   Research we discuss: The Case for AI Property Rights: https://guive.substack.com/p/the-case-for-ai-property-rights AXRP Episode 44 - Peter Salib on AI Rights for Human Safety: https://axrp.net/episode/2025/06/28/episode-44-peter-salib-ai-rights-human-safety.html AI Rights for Human Safety (by Salib and Goldstein): https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4913167 We don't trade with ants: https://worldspiritsockpuppet.substack.com/p/we-dont-trade-with-ants Alignment Fine-tuning is Character Writing (on Claude as a techy philosophy SF-dwelling type): https://guive.substack.com/p/alignment-fine-tuning-is-character Claude's charater (Anthropic post on character training): https://www.anthropic.com/research/claude-character Git Re-Basin: Merging Models modulo Permutation Symmetries: https://arxiv.org/abs/2209.04836 The Filan Cabinet: Caspar Oesterheld on Evidential Cooperation in Large Worlds: https://thefilancabinet.com/episodes/2025/08/03/caspar-oesterheld-on-evidential-cooperation-in-large-worlds-ecl.html   Episode art by Hamish Doodles: hamishdoodles.com

    2h 6m
  2. 47 - David Rein on METR Time Horizons

    JAN 2

    47 - David Rein on METR Time Horizons

    When METR says something like "Claude Opus 4.5 has a 50% time horizon of 4 hours and 50 minutes", what does that mean? In this episode David Rein, METR researcher and co-author of the paper "Measuring AI ability to complete long tasks", talks about METR's work on measuring time horizons, the methodology behind those numbers, and what work remains to be done in this domain. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2026/01/03/episode-47-david-rein-metr-time-horizons.html   Topics we discuss, and timestamps: 0:00:32 Measuring AI Ability to Complete Long Tasks 0:10:54 The meaning of "task length" 0:19:27 Examples of intermediate and hard tasks 0:25:12 Why the software engineering focus 0:32:17 Why task length as difficulty measure 0:46:32 Is AI progress going superexponential? 0:50:58 Is AI progress due to increased cost to run models? 0:54:45 Why METR measures model capabilities 1:04:10 How time horizons relate to recursive self-improvement 1:12:58 Cost of estimating time horizons 1:16:23 Task realism vs mimicking important task features 1:19:50 Excursus on "Inventing Temperature" 1:25:46 Return to task realism discussion 1:33:53 Open questions on time horizons   Links for METR: Main website: https://metr.org/ X/Twitter account: https://x.com/METR_Evals/   Research we discuss: Measuring AI Ability to Complete Long Tasks: https://arxiv.org/abs/2503.14499 RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts: https://arxiv.org/abs/2411.15114 HCAST: Human-Calibrated Autonomy Software Tasks: https://arxiv.org/abs/2503.17354 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity: https://arxiv.org/abs/2507.09089 Anthropic Economic Index: Tracking AI's role in the US and global economy: https://www.anthropic.com/research/anthropic-economic-index-september-2025-report Bridging RL Theory and Practice with the Effective Horizon (i.e. the Cassidy Laidlaw paper): https://arxiv.org/abs/2304.09853 How Does Time Horizon Vary Across Domains?: https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains/ Inventing Temperature: https://global.oup.com/academic/product/inventing-temperature-9780195337389 Is there a Half-Life for the Success Rates of AI Agents? (by Toby Ord): https://www.tobyord.com/writing/half-life Lawrence Chan's response to the above: https://nitter.net/justanotherlaw/status/1920254586771710009 AI Task Length Horizons in Offensive Cybersecurity: https://sean-peters-au.github.io/2025/07/02/ai-task-length-horizons-in-offensive-cybersecurity.html   Episode art by Hamish Doodles: hamishdoodles.com

    1h 47m
  3. 46 - Tom Davidson on AI-enabled Coups

    08/07/2025

    46 - Tom Davidson on AI-enabled Coups

    Could AI enable a small group to gain power over a large country, and lock in their power permanently? Often, people worried about catastrophic risks from AI have been concerned with misalignment risks. In this episode, Tom Davidson talks about a risk that could be comparably important: that of AI-enabled coups. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/08/07/episode-46-tom-davidson-ai-enabled-coups.html   Topics we discuss, and timestamps: 0:00:35 How to stage a coup without AI 0:16:17 Why AI might enable coups 0:33:29 How bad AI-enabled coups are 0:37:28 Executive coups with singularly loyal AIs 0:48:35 Executive coups with exclusive access to AI 0:54:41 Corporate AI-enabled coups 0:57:56 Secret loyalty and misalignment in corporate coups 1:11:39 Likelihood of different types of AI-enabled coups 1:25:52 How to prevent AI-enabled coups 1:33:43 Downsides of AIs loyal to the law 1:41:06 Cultural shifts vs individual action 1:45:53 Technical research to prevent AI-enabled coups 1:51:40 Non-technical research to prevent AI-enabled coups 1:58:17 Forethought 2:03:03 Following Tom's and Forethought's research   Links for Tom and Forethought: Tom on X / Twitter: https://x.com/tomdavidsonx Tom on LessWrong: https://www.lesswrong.com/users/tom-davidson-1 Forethought Substack: https://newsletter.forethought.org/ Will MacAskill on X / Twitter: https://x.com/willmacaskill Will MacAskill on LessWrong: https://www.lesswrong.com/users/wdmacaskill   Research we discuss: AI-Enabled Coups: How a Small Group Could Use AI to Seize Power: https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power Seizing Power: The Strategic Logic of Military Coups, by Naunihal Singh: https://muse.jhu.edu/book/31450 Experiment using AI-generated posts on Reddit draws fire for ethics concerns: https://retractionwatch.com/2025/04/28/experiment-using-ai-generated-posts-on-reddit-draws-fire-for-ethics-concerns/   Episode art by Hamish Doodles: hamishdoodles.com

    2h 5m
  4. 44 - Peter Salib on AI Rights for Human Safety

    06/28/2025

    44 - Peter Salib on AI Rights for Human Safety

    In this episode, I talk with Peter Salib about his paper "AI Rights for Human Safety", arguing that giving AIs the right to contract, hold property, and sue people will reduce the risk of their trying to attack humanity and take over. He also tells me how law reviews work, in the face of my incredulity. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/28/episode-44-peter-salib-ai-rights-human-safety.html   Topics we discuss, and timestamps: 0:00:40 Why AI rights 0:18:34 Why not reputation 0:27:10 Do AI rights lead to AI war? 0:36:42 Scope for human-AI trade 0:44:25 Concerns with comparative advantage 0:53:42 Proxy AI wars 0:57:56 Can companies profitably make AIs with rights? 1:09:43 Can we have AI rights and AI safety measures? 1:24:31 Liability for AIs with rights 1:38:29 Which AIs get rights? 1:43:36 AI rights and stochastic gradient descent 1:54:54 Individuating "AIs" 2:03:28 Social institutions for AI safety 2:08:20 Outer misalignment and trading with AIs 2:15:27 Why statutes of limitations should exist 2:18:39 Starting AI x-risk research in legal academia 2:24:18 How law reviews and AI conferences work 2:41:49 More on Peter moving to AI x-risk research 2:45:37 Reception of the paper 2:53:24 What publishing in law reviews does 3:04:48 Which parts of legal academia focus on AI 3:18:03 Following Peter's research   Links for Peter: Personal website: https://www.peternsalib.com/ Writings at Lawfare: https://www.lawfaremedia.org/contributors/psalib CLAIR: https://clair-ai.org/   Research we discuss: AI Rights for Human Safety: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4913167 Will humans and AIs go to war? https://philpapers.org/rec/GOLWAA Infrastructure for AI agents: https://arxiv.org/abs/2501.10114 Governing AI Agents: https://arxiv.org/abs/2501.07913   Episode art by Hamish Doodles: hamishdoodles.com

    3h 22m
  5. 43 - David Lindner on Myopic Optimization with Non-myopic Approval

    06/15/2025

    43 - David Lindner on Myopic Optimization with Non-myopic Approval

    In this episode, I talk with David Lindner about Myopic Optimization with Non-myopic Approval, or MONA, which attempts to address (multi-step) reward hacking by myopically optimizing actions against a human's sense of whether those actions are generally good. Does this work? Can we get smarter-than-human AI this way? How does this compare to approaches like conservativism? Listen to find out. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/15/episode-43-david-lindner-mona.html   Topics we discuss, and timestamps: 0:00:29 What MONA is 0:06:33 How MONA deals with reward hacking 0:23:15 Failure cases for MONA 0:36:25 MONA's capability 0:55:40 MONA vs other approaches 1:05:03 Follow-up work 1:10:17 Other MONA test cases 1:33:47 When increasing time horizon doesn't increase capability 1:39:04 Following David's research   Links for David: Website: https://www.davidlindner.me Twitter / X: https://x.com/davlindner DeepMind Medium: https://deepmindsafetyresearch.medium.com David on the Alignment Forum: https://www.alignmentforum.org/users/david-lindner   Research we discuss: MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking: https://arxiv.org/abs/2501.13011 Arguments Against Myopic Training: https://www.alignmentforum.org/posts/GqxuDtZvfgL2bEQ5v/arguments-against-myopic-training   Episode art by Hamish Doodles: hamishdoodles.com

    1h 41m
  6. 42 - Owain Evans on LLM Psychology

    06/06/2025

    42 - Owain Evans on LLM Psychology

    Earlier this year, the paper "Emergent Misalignment" made the rounds on AI x-risk social media for seemingly showing LLMs generalizing from 'misaligned' training data of insecure code to acting comically evil in response to innocuous questions. In this episode, I chat with one of the authors of that paper, Owain Evans, about that research as well as other work he's done to understand the psychology of large language models. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/06/episode-42-owain-evans-llm-psychology.html   Topics we discuss, and timestamps: 0:00:37 Why introspection? 0:06:24 Experiments in "Looking Inward" 0:15:11 Why fine-tune for introspection? 0:22:32 Does "Looking Inward" test introspection, or something else? 0:34:14 Interpreting the results of "Looking Inward" 0:44:56 Limitations to introspection? 0:49:54 "Tell me about yourself", and its relation to other papers 1:05:45 Backdoor results 1:12:01 Emergent Misalignment 1:22:13 Why so hammy, and so infrequently evil? 1:36:31 Why emergent misalignment? 1:46:45 Emergent misalignment and other types of misalignment 1:53:57 Is emergent misalignment good news? 2:00:01 Follow-up work to "Emergent Misalignment" 2:03:10 Reception of "Emergent Misalignment" vs other papers 2:07:43 Evil numbers 2:12:20 Following Owain's research   Links for Owain: Truthful AI: https://www.truthfulai.org Owain's website: https://owainevans.github.io/ Owain's twitter/X account: https://twitter.com/OwainEvans_UK   Research we discuss: Looking Inward: Language Models Can Learn About Themselves by Introspection: https://arxiv.org/abs/2410.13787 Tell me about yourself: LLMs are aware of their learned behaviors: https://arxiv.org/abs/2501.11120 Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data: https://arxiv.org/abs/2406.14546 Emergent Misalignment: Narrow fine-tuning can produce broadly misaligned LLMs: https://arxiv.org/abs/2502.17424 X/Twitter thread of GPT-4.1 emergent misalignment results: https://x.com/OwainEvans_UK/status/1912701650051190852 Taken out of context: On measuring situational awareness in LLMs: https://arxiv.org/abs/2309.00667   Episode art by Hamish Doodles: hamishdoodles.com

    2h 14m
  7. 41 - Lee Sharkey on Attribution-based Parameter Decomposition

    06/03/2025

    41 - Lee Sharkey on Attribution-based Parameter Decomposition

    What's the next step forward in interpretability? In this episode, I chat with Lee Sharkey about his proposal for detecting computational mechanisms within neural networks: Attribution-based Parameter Decomposition, or APD for short. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/06/03/episode-41-lee-sharkey-attribution-based-parameter-decomposition.html   Topics we discuss, and timestamps: 0:00:41 APD basics 0:07:57 Faithfulness 0:11:10 Minimality 0:28:44 Simplicity 0:34:50 Concrete-ish examples of APD 0:52:00 Which parts of APD are canonical 0:58:10 Hyperparameter selection 1:06:40 APD in toy models of superposition 1:14:40 APD and compressed computation 1:25:43 Mechanisms vs representations 1:34:41 Future applications of APD? 1:44:19 How costly is APD? 1:49:14 More on minimality training 1:51:49 Follow-up work 2:05:24 APD on giant chain-of-thought models? 2:11:27 APD and "features" 2:14:11 Following Lee's work   Lee links (Leenks): X/Twitter: https://twitter.com/leedsharkey Alignment Forum: https://www.alignmentforum.org/users/lee_sharkey   Research we discuss: Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-Based Parameter Decomposition: https://arxiv.org/abs/2501.14926 Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html Towards a unified and verified understanding of group-operation networks: https://arxiv.org/abs/2410.07476 Feature geometry is outside the superposition hypothesis: https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis   Episode art by Hamish Doodles: hamishdoodles.com

    2h 16m

Ratings & Reviews

4.4
out of 5
9 Ratings

About

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

You Might Also Like