33 episodes

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

AXRP - the AI X-risk Research Podcast Daniel Filan

    • Technology
    • 4.6 • 5 Ratings

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

    29 - Science of Deep Learning with Vikrant Varma

    29 - Science of Deep Learning with Vikrant Varma

    In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
     
    Topics we discuss, and timestamps:
    0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS
      0:00:36 - What is CCS?
      0:09:54 - Consistent and contrastive features other than model beliefs
      0:20:34 - Understanding the banana/shed mystery
      0:41:59 - Future CCS-like approaches
      0:53:29 - CCS as principal component analysis
    0:56:21 - Explaining grokking through circuit efficiency
      0:57:44 - Why research science of deep learning?
      1:12:07 - Summary of the paper's hypothesis
      1:14:05 - What are 'circuits'?
      1:20:48 - The role of complexity
      1:24:07 - Many kinds of circuits
      1:28:10 - How circuits are learned
      1:38:24 - Semi-grokking and ungrokking
      1:50:53 - Generalizing the results
    1:58:51 - Vikrant's research approach
    2:06:36 - The DeepMind alignment team
    2:09:06 - Follow-up work
     
    The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html
    Vikrant's Twitter/X account: twitter.com/vikrantvarma_
     
    Main papers:
     - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029
     - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390
     
    Other works discussed:
     - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827
    - Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit
    - Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
    - Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY
    - Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4
    - Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast
    - Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177
    - Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306
    - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 2 hr 13 min
    28 - Suing Labs for AI Risk with Gabriel Weil

    28 - Suing Labs for AI Risk with Gabriel Weil

    How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic".
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
     
    Topics we discuss, and timestamps:
    0:00:35 - The basic idea
    0:20:36 - Tort law vs regulation
    0:29:10 - Weil's proposal vs Hanson's proposal
    0:37:00 - Tort law vs Pigouvian taxation
    0:41:16 - Does disagreement on AI risk make this proposal less effective?
    0:49:53 - Warning shots - their prevalence and character
    0:59:17 - Feasibility of big changes to liability law
    1:29:17 - Interactions with other areas of law
    1:38:59 - How Gabriel encountered the AI x-risk field
    1:42:41 - AI x-risk and the legal field
    1:47:44 - Technical research to help with this proposal
    1:50:47 - Decisions this proposal could influence
    1:55:34 - Following Gabriel's research
     
    The transcript: axrp.net/episode/2024/04/17/episode-28-tort-law-for-ai-risk-gabriel-weil.html
     
    Links for Gabriel:
     - SSRN page: papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1648032
     - Twitter/X account: twitter.com/gabriel_weil
     
    Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence: papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006
     
    Other links:
     - Foom liability: overcomingbias.com/p/foom-liability
     - Punitive Damages: An Economic Analysis: law.harvard.edu/faculty/shavell/pdf/111_Harvard_Law_Rev_869.pdf
     - Efficiency, Fairness, and the Externalization of Reasonable Risks: The Problem With the Learned Hand Formula: papers.ssrn.com/sol3/papers.cfm?abstract_id=4466197
     - Tort Law Can Play an Important Role in Mitigating AI Risk: forum.effectivealtruism.org/posts/epKBmiyLpZWWFEYDb/tort-law-can-play-an-important-role-in-mitigating-ai-risk
     - How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk: forum.effectivealtruism.org/posts/yWKaBdBygecE42hFZ/how-technical-ai-safety-researchers-can-help-implement
     - Can the courts save us from dangerous AI? [Vox]: vox.com/future-perfect/2024/2/7/24062374/ai-openai-anthropic-deepmind-legal-liability-gabriel-weil
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 1 hr 57 min
    27 - AI Control with Buck Shlegeris and Ryan Greenblatt

    27 - AI Control with Buck Shlegeris and Ryan Greenblatt

    A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
     
    Topics we discuss, and timestamps:
    0:00:31 - What is AI control?
    0:16:16 - Protocols for AI control
    0:22:43 - Which AIs are controllable?
    0:29:56 - Preventing dangerous coded AI communication
    0:40:42 - Unpredictably uncontrollable AI
    0:58:01 - What control looks like
    1:08:45 - Is AI control evil?
    1:24:42 - Can red teams match misaligned AI?
    1:36:51 - How expensive is AI monitoring?
    1:52:32 - AI control experiments
    2:03:50 - GPT-4's aptitude at inserting backdoors
    2:14:50 - How AI control relates to the AI safety field
    2:39:25 - How AI control relates to previous Redwood Research work
    2:49:16 - How people can work on AI control
    2:54:07 - Following Buck and Ryan's research
     
    The transcript:  axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html
    Links for Buck and Ryan:
     - Buck's twitter/X account: twitter.com/bshlgrs
     - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt
     - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com
     
    Main research works we talk about:
     - The case for ensuring that powerful AIs are controlled:  lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
     - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942
     
    Other things we mention:
     - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root
     - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512
     - Improving the Welfare of AIs: A Nearcasted Proposal:  lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal
     - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938
     - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 2 hr 56 min
    26 - AI Governance with Elizabeth Seger

    26 - AI Governance with Elizabeth Seger

    The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I speak with Elizabeth Seger about her research on these questions.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
     
    Topics we discuss, and timestamps:
     - 0:00:40 - What kinds of AI?
     - 0:01:30 - Democratizing AI
       - 0:04:44 - How people talk about democratizing AI
       - 0:09:34 - Is democratizing AI important?
       - 0:13:31 - Links between types of democratization
       - 0:22:43 - Democratizing profits from AI
       - 0:27:06 - Democratizing AI governance
       - 0:29:45 - Normative underpinnings of democratization
     - 0:44:19 - Open-sourcing AI
       - 0:50:47 - Risks from open-sourcing
       - 0:56:07 - Should we make AI too dangerous to open source?
       - 1:00:33 - Offense-defense balance
       - 1:03:13 - KataGo as a case study
       - 1:09:03 - Openness for interpretability research
       - 1:15:47 - Effectiveness of substitutes for open sourcing
       - 1:20:49 - Offense-defense balance, part 2
       - 1:29:49 - Making open-sourcing safer?
     - 1:40:37 - AI governance research
       - 1:41:05 - The state of the field
       - 1:43:33 - Open questions
       - 1:49:58 - Distinctive governance issues of x-risk
       - 1:53:04 - Technical research to help governance
     - 1:55:23 - Following Elizabeth's research
     
    The transcript: https://axrp.net/episode/2023/11/26/episode-26-ai-governance-elizabeth-seger.html
     
    Links for Elizabeth:
     - Personal website: elizabethseger.com
     - Centre for the Governance of AI (AKA GovAI): governance.ai
     
    Main papers:
     - Democratizing AI: Multiple Meanings, Goals, and Methods: arxiv.org/abs/2303.12642
     - Open-sourcing highly capable foundation models: an evaluation of risks, benefits, and alternative methods for pursuing open source objectives: papers.ssrn.com/sol3/papers.cfm?abstract_id=4596436
     
    Other research we discuss:
     - What Do We Mean When We Talk About "AI democratisation"? (blog post): governance.ai/post/what-do-we-mean-when-we-talk-about-ai-democratisation
     - Democratic Inputs to AI (OpenAI): openai.com/blog/democratic-inputs-to-ai
     - Collective Constitutional AI: Aligning a Language Model with Public Input (Anthropic): anthropic.com/index/collective-constitutional-ai-aligning-a-language-model-with-public-input
     - Against "Democratizing AI": johanneshimmelreich.net/papers/against-democratizing-AI.pdf
     - Adversarial Policies Beat Superhuman Go AIs: goattack.far.ai
     - Structured access: an emerging paradigm for safe AI deployment: arxiv.org/abs/2201.05159
     - Universal and Transferable Adversarial Attacks on Aligned Language Models (aka Adversarial Suffixes): arxiv.org/abs/2307.15043
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 1 hr 57 min
    25 - Cooperative AI with Caspar Oesterheld

    25 - Cooperative AI with Caspar Oesterheld

    Imagine a world where there are many powerful AI systems, working at cross purposes. You could suppose that different governments use AIs to manage their militaries, or simply that many powerful AIs have their own wills. At any rate, it seems valuable for them to be able to cooperatively work together and minimize pointless conflict. How do we ensure that AIs behave this way - and what do we need to learn about how rational agents interact to make that more clear? In this episode, I'll be speaking with Caspar Oesterheld about some of his research on this very topic.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
    Episode art by Hamish Doodles: hamishdoodles.com
     
    Topics we discuss, and timestamps:
     - 0:00:34 - Cooperative AI
       - 0:06:21 - Cooperative AI vs standard game theory
       - 0:19:45 - Do we need cooperative AI if we get alignment?
       - 0:29:29 - Cooperative AI and agent foundations
     - 0:34:59 - A Theory of Bounded Inductive Rationality
       - 0:50:05 - Why it matters
       - 0:53:55 - How the theory works
       - 1:01:38 - Relationship to logical inductors
       - 1:15:56 - How fast does it converge?
       - 1:19:46 - Non-myopic bounded rational inductive agents?
       - 1:24:25 - Relationship to game theory
     - 1:30:39 - Safe Pareto Improvements
       - 1:30:39 - What they try to solve
       - 1:36:15 - Alternative solutions
       - 1:40:46 - How safe Pareto improvements work
       - 1:51:19 - Will players fight over which safe Pareto improvement to adopt?
       - 2:06:02 - Relationship to program equilibrium
       - 2:11:25 - Do safe Pareto improvements break themselves?
     - 2:15:52 - Similarity-based Cooperation
       - 2:23:07 - Are similarity-based cooperators overly cliqueish?
       - 2:27:12 - Sensitivity to noise
       - 2:29:41 - Training neural nets to do similarity-based cooperation
     - 2:50:25 - FOCAL, Caspar's research lab
     - 2:52:52 - How the papers all relate
     - 2:57:49 - Relationship to functional decision theory
     - 2:59:45 - Following Caspar's research
     
    The transcript: axrp.net/episode/2023/10/03/episode-25-cooperative-ai-caspar-oesterheld.html
     
    Links for Caspar:
     - FOCAL at CMU: www.cs.cmu.edu/~focal/
     - Caspar on X, formerly known as Twitter: twitter.com/C_Oesterheld
     - Caspar's blog: casparoesterheld.com/
     - Caspar on Google Scholar: scholar.google.com/citations?user=xeEcRjkAAAAJ&hl=en&oi=ao
     
    Research we discuss:
     - A Theory of Bounded Inductive Rationality: arxiv.org/abs/2307.05068
     - Safe Pareto improvements for delegated game playing: link.springer.com/article/10.1007/s10458-022-09574-6
     - Similarity-based Cooperation: arxiv.org/abs/2211.14468
     - Logical Induction: arxiv.org/abs/1609.03543
     - Program Equilibrium: citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=e1a060cda74e0e3493d0d81901a5a796158c8410
     - Formalizing Objections against Surrogate Goals: www.alignmentforum.org/posts/K4FrKRTrmyxrw5Dip/formalizing-objections-against-surrogate-goals
     - Learning with Opponent-Learning Awareness: arxiv.org/abs/1709.04326

    • 3 hr 2 min
    24 - Superalignment with Jan Leike

    24 - Superalignment with Jan Leike

    Recently, OpenAI made a splash by announcing a new "Superalignment" team. Lead by Jan Leike and Ilya Sutskever, the team would consist of top researchers, attempting to solve alignment for superintelligent AIs in four years by figuring out how to build a trustworthy human-level AI alignment researcher, and then using it to solve the rest of the problem. But what does this plan actually involve? In this episode, I talk to Jan Leike about the plan and the challenges it faces.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
    Episode art by Hamish Doodles: hamishdoodles.com/
     
    Topics we discuss, and timestamps:
     - 0:00:37 - The superalignment team
     - 0:02:10 - What's a human-level automated alignment researcher?
       - 0:06:59 - The gap between human-level automated alignment researchers and superintelligence
       - 0:18:39 - What does it do?
       - 0:24:13 - Recursive self-improvement
     - 0:26:14 - How to make the AI AI alignment researcher
       - 0:30:09 - Scalable oversight
       - 0:44:38 - Searching for bad behaviors and internals
       - 0:54:14 - Deliberately training misaligned models
     - 1:02:34 - Four year deadline
       - 1:07:06 - What if it takes longer?
     - 1:11:38 - The superalignment team and...
       - 1:11:38 - ... governance
       - 1:14:37 - ... other OpenAI teams
       - 1:18:17 - ... other labs
     - 1:26:10 - Superalignment team logistics
     - 1:29:17 - Generalization
     - 1:43:44 - Complementary research
     - 1:48:29 - Why is Jan optimistic?
       - 1:58:32 - Long-term agency in LLMs?
       - 2:02:44 - Do LLMs understand alignment?
     - 2:06:01 - Following Jan's research
     
    The transcript: axrp.net/episode/2023/07/27/episode-24-superalignment-jan-leike.html
     
    Links for Jan and OpenAI:
     - OpenAI jobs: openai.com/careers
     - Jan's substack: aligned.substack.com
     - Jan's twitter: twitter.com/janleike
     
    Links to research and other writings we discuss:
     - Introducing Superalignment: openai.com/blog/introducing-superalignment
     - Let's Verify Step by Step (process-based feedback on math): arxiv.org/abs/2305.20050
     - Planning for AGI and beyond:
    openai.com/blog/planning-for-agi-and-beyond
     - Self-critiquing models for assisting human evaluators: arxiv.org/abs/2206.05802
     - An Interpretability Illusion for BERT: arxiv.org/abs/2104.07143
     - Language models can explain neurons in language models https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
     - Our approach to alignment research: openai.com/blog/our-approach-to-alignment-research
     - Training language models to follow instructions with human feedback (aka the Instruct-GPT paper): arxiv.org/abs/2203.02155

    • 2 hr 8 min

Customer Reviews

4.6 out of 5
5 Ratings

5 Ratings

Top Podcasts In Technology

The Neuron: AI Explained
The Neuron
Lex Fridman Podcast
Lex Fridman
All-In with Chamath, Jason, Sacks & Friedberg
All-In Podcast, LLC
No Priors: Artificial Intelligence | Technology | Startups
Conviction | Pod People
Acquired
Ben Gilbert and David Rosenthal
TED Radio Hour
NPR

You Might Also Like

Dwarkesh Podcast
Dwarkesh Patel
Machine Learning Street Talk (MLST)
Machine Learning Street Talk (MLST)
Clearer Thinking with Spencer Greenberg
Spencer Greenberg
Eye On A.I.
Craig S. Smith
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Sam Charrington
Odd Lots
Bloomberg