37 episodes

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

AXRP - the AI X-risk Research Podcast Daniel Filan

    • Technology
    • 5.0 • 1 Rating

AXRP (pronounced axe-urp) is the AI X-risk Research Podcast where I, Daniel Filan, have conversations with researchers about their papers. We discuss the paper, and hopefully get a sense of why it's been written and how it might reduce the risk of AI causing an existential catastrophe: that is, permanently and drastically curtailing humanity's future potential. You can visit the website and read transcripts at axrp.net.

    33 - RLHF Problems with Scott Emmons

    33 - RLHF Problems with Scott Emmons

    Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk with Scott Emmons about his work categorizing the problems that can show up in this setting.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
    The transcript: https://axrp.net/episode/2024/06/12/episode-33-rlhf-problems-scott-emmons.html
    Topics we discuss, and timestamps:
    0:00:33 - Deceptive inflation
    0:17:56 - Overjustification
    0:32:48 - Bounded human rationality
    0:50:46 - Avoiding these problems
    1:14:13 - Dimensional analysis
    1:23:32 - RLHF problems, in theory and practice
    1:31:29 - Scott's research program
    1:39:42 - Following Scott's research
     
    Scott's website: https://www.scottemmons.com
    Scott's X/twitter account: https://x.com/emmons_scott
    When Your AIs Deceive You: Challenges With Partial Observability of Human Evaluators in Reward Learning: https://arxiv.org/abs/2402.17747
     
    Other works we discuss:
    AI Deception: A Survey of Examples, Risks, and Potential Solutions: https://arxiv.org/abs/2308.14752
    Uncertain decisions facilitate better preference learning: https://arxiv.org/abs/2106.10394
    Invariance in Policy Optimisation and Partial Identifiability in Reward Learning: https://arxiv.org/abs/2203.07475
    The Humble Gaussian Distribution (aka principal component analysis and dimensional analysis): http://www.inference.org.uk/mackay/humble.pdf
    Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 1 hr 41 min
    32 - Understanding Agency with Jan Kulveit

    32 - Understanding Agency with Jan Kulveit

    What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
    The transcript: axrp.net/episode/2024/05/30/episode-32-understanding-agency-jan-kulveit.html
    Topics we discuss, and timestamps:
    0:00:47 - What is active inference?
    0:15:14 - Preferences in active inference
    0:31:33 - Action vs perception in active inference
    0:46:07 - Feedback loops
    1:01:32 - Active inference vs LLMs
    1:12:04 - Hierarchical agency
    1:58:28 - The Alignment of Complex Systems group
     
    Website of the Alignment of Complex Systems group (ACS): acsresearch.org
    ACS on X/Twitter: x.com/acsresearchorg
    Jan on LessWrong: lesswrong.com/users/jan-kulveit
    Predictive Minds: Large Language Models as Atypical Active Inference Agents: arxiv.org/abs/2311.10215
     
    Other works we discuss:
    Active Inference: The Free Energy Principle in Mind, Brain, and Behavior: https://www.goodreads.com/en/book/show/58275959
    Book Review: Surfing Uncertainty: https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/
    The self-unalignment problem: https://www.lesswrong.com/posts/9GyniEBaN3YYTqZXn/the-self-unalignment-problem
    Mitigating generative agent social dilemmas (aka language models writing contracts for Minecraft): https://social-dilemmas.github.io/
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 2 hrs 22 min
    31 - Singular Learning Theory with Daniel Murfet

    31 - Singular Learning Theory with Daniel Murfet

    What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel Murfet about this research program and what it tells us.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
    Topics we discuss, and timestamps:
    0:00:26 - What is singular learning theory?
    0:16:00 - Phase transitions
    0:35:12 - Estimating the local learning coefficient
    0:44:37 - Singular learning theory and generalization
    1:00:39 - Singular learning theory vs other deep learning theory
    1:17:06 - How singular learning theory hit AI alignment
    1:33:12 - Payoffs of singular learning theory for AI alignment
    1:59:36 - Does singular learning theory advance AI capabilities?
    2:13:02 - Open problems in singular learning theory for AI alignment
    2:20:53 - What is the singular fluctuation?
    2:25:33 - How geometry relates to information
    2:30:13 - Following Daniel Murfet's work
     
    The transcript: https://axrp.net/episode/2024/05/07/episode-31-singular-learning-theory-dan-murfet.html
    Daniel Murfet's twitter/X account: https://twitter.com/danielmurfet
    Developmental interpretability website: https://devinterp.com
    Developmental interpretability YouTube channel: https://www.youtube.com/@Devinterp
     
    Main research discussed in this episode:
    - Developmental Landscape of In-Context Learning: https://arxiv.org/abs/2402.02364
    - Estimating the Local Learning Coefficient at Scale: https://arxiv.org/abs/2402.03698
    - Simple versus Short: Higher-order degeneracy and error-correction: https://www.lesswrong.com/posts/nWRj6Ey8e5siAEXbK/simple-versus-short-higher-order-degeneracy-and-error-1
     
    Other links:
    - Algebraic Geometry and Statistical Learning Theory (the grey book): https://www.cambridge.org/core/books/algebraic-geometry-and-statistical-learning-theory/9C8FD1BDC817E2FC79117C7F41544A3A
    - Mathematical Theory of Bayesian Statistics (the green book): https://www.routledge.com/Mathematical-Theory-of-Bayesian-Statistics/Watanabe/p/book/9780367734817
    In-context learning and induction heads: https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
    - Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity: https://arxiv.org/abs/2106.15933
    - A mathematical theory of semantic development in deep neural networks: https://www.pnas.org/doi/abs/10.1073/pnas.1820226116
    - Consideration on the Learning Efficiency Of Multiple-Layered Neural Networks with Linear Units: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4404877
    - Neural Tangent Kernel: Convergence and Generalization in Neural Networks: https://arxiv.org/abs/1806.07572
    - The Interpolating Information Criterion for Overparameterized Models: https://arxiv.org/abs/2307.07785
    - Feature Learning in Infinite-Width Neural Networks: https://arxiv.org/abs/2011.14522
    - A central AI alignment problem: capabilities generalization, and the sharp left turn: https://www.lesswrong.com/posts/GNhMPAWcfBCASy8e6/a-central-ai-alignment-problem-capabilities-generalization
    - Quantifying degeneracy in singular models via the learning coefficient: https://arxiv.org/abs/2308.12108
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 2 hrs 32 min
    30 - AI Security with Jeffrey Ladish

    30 - AI Security with Jeffrey Ladish

    Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey Ladish about security and AI.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
    Topics we discuss, and timestamps:
    0:00:38 - Fine-tuning away safety training
    0:13:50 - Dangers of open LLMs vs internet search
    0:19:52 - What we learn by undoing safety filters
    0:27:34 - What can you do with jailbroken AI?
    0:35:28 - Security of AI model weights
    0:49:21 - Securing against attackers vs AI exfiltration
    1:08:43 - The state of computer security
    1:23:08 - How AI labs could be more secure
    1:33:13 - What does Palisade do?
    1:44:40 - AI phishing
    1:53:32 - More on Palisade's work
    1:59:56 - Red lines in AI development
    2:09:56 - Making AI legible
    2:14:08 - Following Jeffrey's research
     
    The transcript: axrp.net/episode/2024/04/30/episode-30-ai-security-jeffrey-ladish.html
    Palisade Research: palisaderesearch.org
    Jeffrey's Twitter/X account: twitter.com/JeffLadish
     
    Main papers we discussed:
    - LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B: arxiv.org/abs/2310.20624
    - BadLLaMa: Cheaply Removing Safety Fine-tuning From LLaMa 2-Chat 13B: arxiv.org/abs/2311.00117
    - Securing Artificial Intelligence Model Weights: rand.org/pubs/working_papers/WRA2849-1.html
     
    Other links:
    - Llama 2: Open Foundation and Fine-Tuned Chat Models: https://arxiv.org/abs/2307.09288
    - Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!: https://arxiv.org/abs/2310.03693
    - Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models: https://arxiv.org/abs/2310.02949
    - On the Societal Impact of Open Foundation Models (Stanford paper on marginal harms from open-weight models): https://crfm.stanford.edu/open-fms/
    - The Operational Risks of AI in Large-Scale Biological Attacks (RAND): https://www.rand.org/pubs/research_reports/RRA2977-2.html
    - Preventing model exfiltration with upload limits: https://www.alignmentforum.org/posts/rf66R4YsrCHgWx9RG/preventing-model-exfiltration-with-upload-limits
    - A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution: https://googleprojectzero.blogspot.com/2021/12/a-deep-dive-into-nso-zero-click.html
    - In-browser transformer inference: https://aiserv.cloud/
    - Anatomy of a rental phishing scam: https://jeffreyladish.com/anatomy-of-a-rental-phishing-scam/
    - Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 2 hrs 15 min
    29 - Science of Deep Learning with Vikrant Varma

    29 - Science of Deep Learning with Vikrant Varma

    In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
     
    Topics we discuss, and timestamps:
    0:00:36 - Challenges with unsupervised LLM knowledge discovery, aka contra CCS
      0:00:36 - What is CCS?
      0:09:54 - Consistent and contrastive features other than model beliefs
      0:20:34 - Understanding the banana/shed mystery
      0:41:59 - Future CCS-like approaches
      0:53:29 - CCS as principal component analysis
    0:56:21 - Explaining grokking through circuit efficiency
      0:57:44 - Why research science of deep learning?
      1:12:07 - Summary of the paper's hypothesis
      1:14:05 - What are 'circuits'?
      1:20:48 - The role of complexity
      1:24:07 - Many kinds of circuits
      1:28:10 - How circuits are learned
      1:38:24 - Semi-grokking and ungrokking
      1:50:53 - Generalizing the results
    1:58:51 - Vikrant's research approach
    2:06:36 - The DeepMind alignment team
    2:09:06 - Follow-up work
     
    The transcript: axrp.net/episode/2024/04/25/episode-29-science-of-deep-learning-vikrant-varma.html
    Vikrant's Twitter/X account: twitter.com/vikrantvarma_
     
    Main papers:
     - Challenges with unsupervised LLM knowledge discovery: arxiv.org/abs/2312.10029
     - Explaining grokking through circuit efficiency: arxiv.org/abs/2309.02390
     
    Other works discussed:
     - Discovering latent knowledge in language models without supervision (CCS): arxiv.org/abs/2212.03827
    - Eliciting Latent Knowledge: How to Tell if your Eyes Deceive You: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit
    - Discussion: Challenges with unsupervised LLM knowledge discovery: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1
    - Comment thread on the banana/shed results: lesswrong.com/posts/wtfvbsYjNHYYBmT3k/discussion-challenges-with-unsupervised-llm-knowledge-1?commentId=hPZfgA3BdXieNfFuY
    - Fabien Roger, What discovering latent knowledge did and did not find: lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4
    - Scott Emmons, Contrast Pairs Drive the Performance of Contrast Consistent Search (CCS): lesswrong.com/posts/9vwekjD6xyuePX7Zr/contrast-pairs-drive-the-empirical-performance-of-contrast
    - Grokking: Generalizing Beyond Overfitting on Small Algorithmic Datasets: arxiv.org/abs/2201.02177
    - Keeping Neural Networks Simple by Minimizing the Minimum Description Length of the Weights (Hinton 1993 L2): dl.acm.org/doi/pdf/10.1145/168304.168306
    - Progress measures for grokking via mechanistic interpretability: arxiv.org/abs/2301.0521
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 2 hrs 13 min
    28 - Suing Labs for AI Risk with Gabriel Weil

    28 - Suing Labs for AI Risk with Gabriel Weil

    How should the law govern AI? Those concerned about existential risks often push either for bans or for regulations meant to ensure that AI is developed safely - but another approach is possible. In this episode, Gabriel Weil talks about his proposal to modify tort law to enable people to sue AI companies for disasters that are "nearly catastrophic".
    Patreon: patreon.com/axrpodcast
    Ko-fi: ko-fi.com/axrpodcast
     
    Topics we discuss, and timestamps:
    0:00:35 - The basic idea
    0:20:36 - Tort law vs regulation
    0:29:10 - Weil's proposal vs Hanson's proposal
    0:37:00 - Tort law vs Pigouvian taxation
    0:41:16 - Does disagreement on AI risk make this proposal less effective?
    0:49:53 - Warning shots - their prevalence and character
    0:59:17 - Feasibility of big changes to liability law
    1:29:17 - Interactions with other areas of law
    1:38:59 - How Gabriel encountered the AI x-risk field
    1:42:41 - AI x-risk and the legal field
    1:47:44 - Technical research to help with this proposal
    1:50:47 - Decisions this proposal could influence
    1:55:34 - Following Gabriel's research
     
    The transcript: axrp.net/episode/2024/04/17/episode-28-tort-law-for-ai-risk-gabriel-weil.html
     
    Links for Gabriel:
     - SSRN page: papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=1648032
     - Twitter/X account: twitter.com/gabriel_weil
     
    Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence: papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006
     
    Other links:
     - Foom liability: overcomingbias.com/p/foom-liability
     - Punitive Damages: An Economic Analysis: law.harvard.edu/faculty/shavell/pdf/111_Harvard_Law_Rev_869.pdf
     - Efficiency, Fairness, and the Externalization of Reasonable Risks: The Problem With the Learned Hand Formula: papers.ssrn.com/sol3/papers.cfm?abstract_id=4466197
     - Tort Law Can Play an Important Role in Mitigating AI Risk: forum.effectivealtruism.org/posts/epKBmiyLpZWWFEYDb/tort-law-can-play-an-important-role-in-mitigating-ai-risk
     - How Technical AI Safety Researchers Can Help Implement Punitive Damages to Mitigate Catastrophic AI Risk: forum.effectivealtruism.org/posts/yWKaBdBygecE42hFZ/how-technical-ai-safety-researchers-can-help-implement
     - Can the courts save us from dangerous AI? [Vox]: vox.com/future-perfect/2024/2/7/24062374/ai-openai-anthropic-deepmind-legal-liability-gabriel-weil
     
    Episode art by Hamish Doodles: hamishdoodles.com

    • 1 hr 57 min

Customer Reviews

5.0 out of 5
1 Rating

1 Rating

Top Podcasts In Technology

Lex Fridman Podcast
Lex Fridman
Acquired
Ben Gilbert and David Rosenthal
Hard Fork
The New York Times
Teknisk sett
Teknisk Ukeblad
Her Tech Tales
Helena, Guro and Sofie
Deep Questions with Cal Newport
Cal Newport

You Might Also Like

Dwarkesh Podcast
Dwarkesh Patel
Clearer Thinking with Spencer Greenberg
Spencer Greenberg
Machine Learning Street Talk (MLST)
Machine Learning Street Talk (MLST)
"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis
Erik Torenberg, Nathan Labenz
The TED AI Show
TED
Conversations with Tyler
Mercatus Center at George Mason University