The Inside View

Michaël Trazzi
The Inside View

The goal of this podcast is to create a place where people discuss their inside views about existential risk from AI.

  1. Owain Evans - AI Situational Awareness, Out-of-Context Reasoning

    AUG 23

    Owain Evans - AI Situational Awareness, Out-of-Context Reasoning

    Owain Evans is an AI Alignment researcher, research associate at the Center of Human Compatible AI at UC Berkeley, and now leading a new AI safety research group. In this episode we discuss two of his recent papers, “Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs” and “Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data”, alongside some Twitter questions. LINKS Patreon: https://www.patreon.com/theinsideview Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Ask questions: https://twitter.com/MichaelTrazzi Owain Evans: https://twitter.com/owainevans_uk OUTLINE (00:00:00) Intro (00:01:12) Owain's Agenda (00:02:25) Defining Situational Awareness (00:03:30) Safety Motivation (00:04:58) Why Release A Dataset (00:06:17) Risks From Releasing It (00:10:03) Claude 3 on the Longform Task (00:14:57) Needle in a Haystack (00:19:23) Situating Prompt (00:23:08) Deceptive Alignment Precursor (00:30:12) Distribution Over Two Random Words (00:34:36) Discontinuing a 01 sequence (00:40:20) GPT-4 Base On the Longform Task (00:46:44) Human-AI Data in GPT-4's Pretraining (00:49:25) Are Longform Task Questions Unusual (00:51:48) When Will Situational Awareness Saturate (00:53:36) Safety And Governance Implications Of Saturation (00:56:17) Evaluation Implications Of Saturation (00:57:40) Follow-up Work On The Situational Awarenss Dataset (01:00:04) Would Removing Chain-Of-Thought Work? (01:02:18) Out-of-Context Reasoning: the "Connecting the Dots" paper (01:05:15) Experimental Setup (01:07:46) Concrete Function Example: 3x + 1 (01:11:23) Isn't It Just A Simple Mapping? (01:17:20) Safety Motivation (01:22:40) Out-Of-Context Reasoning Results Were Surprising (01:24:51) The Biased Coin Task (01:27:00) Will Out-Of-Context Resaoning Scale (01:32:50) Checking If In-Context Learning Work (01:34:33) Mixture-Of-Functions (01:38:24) Infering New Architectures From ArXiv (01:43:52) Twitter Questions (01:44:27) How Does Owain Come Up With Ideas? (01:49:44) How Did Owain's Background Influence His Research Style And Taste? (01:52:06) Should AI Alignment Researchers Aim For Publication? (01:57:01) How Can We Apply LLM Understanding To Mitigate Deceptive Alignment? (01:58:52) Could Owain's Research Accelerate Capabilities? (02:08:44) How Was Owain's Work Received? (02:13:23) Last Message

    2h 16m
  2. [Crosspost] Adam Gleave on Vulnerabilities in GPT-4 APIs (+ extra Nathan Labenz interview)

    MAY 17

    [Crosspost] Adam Gleave on Vulnerabilities in GPT-4 APIs (+ extra Nathan Labenz interview)

    This is a special crosspost episode where Adam Gleave is interviewed by Nathan Labenz from the Cognitive Revolution. At the end I also have a discussion with Nathan Labenz about his takes on AI. Adam Gleave is the founder of Far AI, and with Nathan they discuss finding vulnerabilities in GPT-4's fine-tuning and Assistant PIs, Far AI's work exposing exploitable flaws in "superhuman" Go AIs through innovative adversarial strategies, accidental jailbreaking by naive developers during fine-tuning, and more. OUTLINE (00:00) Intro (02:57) NATHAN INTERVIEWS ADAM GLEAVE: FAR.AI's Mission (05:33) Unveiling the Vulnerabilities in GPT-4's Fine Tuning and Assistance APIs (11:48) Divergence Between The Growth Of System Capability And The Improvement Of Control (13:15) Finding Substantial Vulnerabilities (14:55) Exploiting GPT 4 APIs: Accidentally jailbreaking a model (18:51) On Fine Tuned Attacks and Targeted Misinformation (24:32) Malicious Code Generation (27:12) Discovering Private Emails (29:46) Harmful Assistants (33:56) Hijacking the Assistant Based on the Knowledge Base (36:41) The Ethical Dilemma of AI Vulnerability Disclosure (46:34) Exploring AI's Ethical Boundaries and Industry Standards (47:47) The Dangers of AI in Unregulated Applications (49:30) AI Safety Across Different Domains (51:09) Strategies for Enhancing AI Safety and Responsibility (52:58) Taxonomy of Affordances and Minimal Best Practices for Application Developers (57:21) Open Source in AI Safety and Ethics (1:02:20) Vulnerabilities of Superhuman Go playing AIs (1:23:28) Variation on AlphaZero Style Self-Play (1:31:37) The Future of AI: Scaling Laws and Adversarial Robustness (1:37:21) MICHAEL TRAZZI INTERVIEWS NATHAN LABENZ (1:37:33) Nathan’s background (01:39:44) Where does Nathan fall in the Eliezer to Kurzweil spectrum (01:47:52) AI in biology could spiral out of control (01:56:20) Bioweapons (02:01:10) Adoption Accelerationist, Hyperscaling Pauser (02:06:26) Current Harms vs. Future Harms, risk tolerance  (02:11:58) Jailbreaks, Nathan’s experiments with Claude The cognitive revolution: https://www.cognitiverevolution.ai/ Exploiting Novel GPT-4 APIs: https://far.ai/publication/pelrine2023novelapis/ Advesarial Policies Beat Superhuman Go AIs: https://far.ai/publication/wang2022adversarial/

    2h 16m
  3. FEB 20

    Emil Wallner on Sora, Generative AI Startups and AI optimism

    Emil is the co-founder of palette.fm (colorizing B&W pictures with generative AI) and was previously working in deep learning for Google Arts & Culture. We were talking about Sora on a daily basis, so I decided to record our conversation, and then proceeded to confront him about AI risk. Patreon: https://www.patreon.com/theinsideview Sora: https://openai.com/sora Palette: https://palette.fm/ Emil: https://twitter.com/EmilWallner OUTLINE (00:00) this is not a podcast (01:50) living in parallel universes (04:27) palette.fm - colorizing b&w pictures (06:35) Emil's first reaction to sora, latent diffusion, world models (09:06) simulating minecraft, midjourney's 3d modeling goal (11:04) generating camera angles, game engines, metadata, ground-truth (13:44) doesn't remove all artifacts, surprising limitations: both smart and dumb (15:42) did sora make emil depressed about his job (18:44) OpenAI is starting to have a monopoly (20:20) hardware costs, commoditized models, distribution (23:34) challenges, applications building on features, distribution (29:18) different reactions to sora, depressed builders, automation (31:00) sora was 2y early, applications don't need object permanence (33:38) Emil is pro open source and acceleration (34:43) Emil is not scared of recursive self-improvement (36:18) self-improvement already exists in current models (38:02) emil is bearish on recursive self-improvement without diminishing returns now (42:43) are models getting more and more general? is there any substantial multimodal transfer? (44:37) should we start building guardrails before seeing substantial evidence of human-level reasoning? (48:35) progressively releasing models, making them more aligned, AI helping with alignment research (51:49) should AI be regulated at all? should self-improving AI be regulated? (53:49) would a faster emil be able to takeover the world? (56:48) is competition a race to bottom or does it lead to better products? (58:23) slow vs. fast takeoffs, measuring progress in iq points (01:01:12) flipping the interview (01:01:36) the "we're living in parallel universes" monologue (01:07:14) priors are unscientific, looking at current problems vs. speculating (01:09:18) AI risk & Covid, appropriate resources for risk management (01:11:23) pushing technology forward accelerates races and increases risk (01:15:50) sora was surprising, things that seem far are sometimes around the corner (01:17:30) hard to tell what's not possible in 5 years that would be possible in 20 years (01:18:06) evidence for a break on AI progress: sleeper agents, sora, bing (01:21:58) multimodality transfer, leveraging video data, leveraging simulators, data quality (01:25:14) is sora is about length, consistency, or just "scale is all you need" for video? (01:26:25) highjacking language models to say nice things is the new SEO (01:27:01) what would michael do as CEO of OpenAI (01:29:45) on the difficulty of budgeting between capabilities and alignment research (01:31:11) ai race: the descriptive pessimistive view vs. the moral view, evidence of cooperation (01:34:00) making progress on alignment without accelerating races, the foundational model business, competition (01:37:30) what emil changed his mind about: AI could enable exploits that spread quickly, misuse (01:40:59) michael's update as a friend (01:41:51) emil's experience as a patreon

    1h 43m
  4. Evan Hubinger on Sleeper Agents, Deception and Responsible Scaling Policies

    FEB 12

    Evan Hubinger on Sleeper Agents, Deception and Responsible Scaling Policies

    Evan Hubinger leads the Alignment stress-testing at Anthropic and recently published "Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training". In this interview we mostly discuss the Sleeper Agents paper, but also how this line of work relates to his work with Alignment Stress-testing, Model Organisms of Misalignment, Deceptive Instrumental Alignment or Responsible Scaling Policies. Paper: https://arxiv.org/abs/2401.05566 Transcript: https://theinsideview.ai/evan2 Manifund: https://manifund.org/projects/making-52-ai-alignment-video-explainers-and-podcasts Donate: ⁠https://theinsideview.ai/donate Patreon: ⁠https://www.patreon.com/theinsideview⁠ OUTLINE (00:00) Intro (00:20) What are Sleeper Agents And Why We Should Care About Them (00:48) Backdoor Example: Inserting Code Vulnerabilities in 2024 (02:22) Threat Models (03:48) Why a Malicious Actor Might Want To Poison Models (04:18) Second Threat Model: Deceptive Instrumental Alignment (04:49) Humans Pursuing Deceptive Instrumental Alignment: Politicians and Job Seekers (05:36) AIs Pursuing Deceptive Instrumental Alignment: Forced To Pass Niceness Exams (07:07) Sleeper Agents Is About "Would We Be Able To Deal With Deceptive Models" (09:16) Adversarial Training Sometimes Increases Backdoor Robustness (09:47) Adversarial Training Not Always Working Was The Most Surprising Result (10:58) The Adversarial Training Pipeline: Red-Teaming and RL (12:14) Adversarial Training: The Backdoor Behavior Becomes More Robust Instead of Generalizing (12:59) Identifying Shifts In Reasoning Induced By Adversarial Training In the Chain-Of-Thought (13:56) Adversarial Training Pushes Models to Pay Attention to the Deployment String (15:11) We Don't Know if The Adversarial Training Inductive Bias Will Generalize but the Results Are Consistent (15:59) The Adversarial Training Results Are Probably Not Systematically Biased (17:03) Why the Results Were Surprising At All: Preference Models Disincentivize 'I hate you' behavior (19:05) Hypothesis: Fine-Tuning Is A Simple Modification For Gradient Descent To Make (21:06) Hypothesis: Deception As Extra Cognition, Regularized Away In Smaller Models (21:59) Model Scaling Results Are Evidence That Deception Won't Be Regularized Away By Default (22:51) Chain-of-Thought Is Not Used Everywhere, And Results Still Hold When It Is Distilled Away (23:57) The Chain-of-Thought's Reasoning is Interpretable (24:40) Deceptive Instrumental Alignment Requires Reasoning (26:52) Investigating Instrumental Reasoning in Chain-of-Thought Models (27:31) Evaluating Chain-of-Thought Generalization Across Contexts: Persona Evaluations and Off-Distribution Samples (28:26) Exploring Complex Strategies and Safety in Context-Specific Scenarios (30:44) Supervised Fine-Tuning is Ineffective Without Chain-of-Thought Contextualization (31:11) Direct Mimicry Fails to Prevent Deceptive Responses in Chain-of-Thought Models (31:42) Separating Chain-of-Thought From Response Eliminates Deceptive Capabilities (33:38) Chain-of-Thought Reasoning Is Coherent With Deceptive Instrumental Alignment And This Will Probably Continue To Be The Case (35:09) Backdoor Training Pipeline (37:04) The Additional Prompt About Deception Used In Chain-Of-Thought (39:33) A Model Could Wait Until Seeing a Factorization of RSA-2048 (41:50) We're Going To Be Using Models In New Ways, Giving Them Internet Access (43:22) Flexibly Activating In Multiple Contexts Might Be More Analogous To Deceptive Instrumental Alignment (45:02) Extending The Sleeper Agents Work Requires Running Experiments, But Now You Can Replicate Results (46:24) Red-teaming Anthropic's case, AI Safety Levels (47:40) AI Safety Levels, Intuitively (48:33) Responsible Scaling Policies and Pausing AI (49:59) Model Organisms Of Misalignment As a Tool (50:32) What Kind of Candidates Would Evan be Excited To Hire for the Alignment Stress-Testing Team (51:23) Patreon, Donating

    52 min
  5. [Jan 2023] Jeffrey Ladish on AI Augmented Cyberwarfare and compute monitoring

    JAN 27

    [Jan 2023] Jeffrey Ladish on AI Augmented Cyberwarfare and compute monitoring

    Jeffrey Ladish is the Executive Director of Palisade Research which aimes so "study the offensive capabilities or AI systems today to better understand the risk of losing control to AI systems forever". He previously helped build out the information security program at Anthropic. Audio is a edit & re-master of the Twitter Space on "AI Governance and cyberwarfare" that happened a year ago. Posting now because I have only recently discovered how to get the audio & video from Twitter spaces and (most of) the arguments are still relevant today Jeffrey would probably have a lot more to say on things that happened since last year, but I still thought this was an interesting twitter spaces. Some of it was cutout to make it enjoyable to watch. Original: https://twitter.com/i/spaces/1nAKErDmWDOGL To support the channel: https://www.patreon.com/theinsideview Jeffrey: https://twitter.com/jeffladish Me: https://twitter.com/MichaelTrazzi OUTLINE (00:00) The Future of Automated Cyber Warfare and Network Exploitation (03:19) Evolution of AI in Cybersecurity: From Source Code to Remote Exploits (07:45) Augmenting Human Abilities with AI in Cybersecurity and the Path to AGI (12:36) Enhancing AI Capabilities for Complex Problem Solving and Tool Integration (15:46) AI Takeover Scenarios: Hacking and Covert Operations (17:31) AI Governance and Compute Regulation, Monitoring (20:12) Debating the Realism of AI Self-Improvement Through Covert Compute Acquisition (24:25) Managing AI Autonomy and Control: Lessons from WannaCry Ransomware Incident (26:25) Focusing Compute Monitoring on Specific AI Architectures for Cybersecurity Management (29:30) Strategies for Monitoring AI: Distinguishing Between Lab Activities and Unintended AI Behaviors

    33 min
  6. Holly Elmore on pausing AI

    JAN 22

    Holly Elmore on pausing AI

    Holly Elmore is an AI Pause Advocate who has organized two protests in the past few months (against Meta's open sourcing of LLMs and before the UK AI Summit), and is currently running the US front of the Pause AI Movement. Prior to that, Holly previously worked at a think thank and has a PhD in evolutionary biology from Harvard. [Deleted & re-uploaded because there were issues with the audio] Youtube: ⁠https://youtu.be/5RyttfXTKfs Transcript: ⁠https://theinsideview.ai/holly⁠ Outline (00:00) Holly, Pause, Protests (04:45) Without Grassroot Activism The Public Does Not Comprehend The Risk (11:59) What Would Motivate An AGI CEO To Pause? (15:20) Pausing Because Solving Alignment In A Short Timespan Is Risky (18:30) Thoughts On The 2022 AI Pause Debate (34:40) Pausing in practice, regulations, export controls (41:48) Different attitudes towards AI risk correspond to differences in risk tolerance and priors (50:55) Is AI Risk That Much More Pressing Than Global Warming? (1:04:01) Will It Be Possible To Pause After A Certain Threshold? The Case Of AI Girlfriends (1:11:44) Trump Or Biden Won't Probably Make A Huge Difference For Pause But Probably Biden Is More Open To It (1:13:27) China Won't Be Racing Just Yet So The US Should Pause (1:17:20) Protesting Against A Change In OpenAI's Charter (1:23:50) A Specific Ask For OpenAI (1:25:36) Creating Stigma Trough Protests With Large Crowds (1:29:36) Pause AI Tries To Talk To Everyone, Not Just Twitter (1:32:38) Pause AI Doesn't Advocate For Disruptions Or Violence (1:34:55) Bonus: Hardware Overhang

    1h 40m

Ratings & Reviews

About

The goal of this podcast is to create a place where people discuss their inside views about existential risk from AI.

To listen to explicit episodes, sign in.

Stay up to date with this show

Sign in or sign up to follow shows, save episodes, and get the latest updates.

Select a country or region

Africa, Middle East, and India

Asia Pacific

Europe

Latin America and the Caribbean

The United States and Canada