AE Alignment Podcast

James Bowler

The AE Alignment Podcast explores the ideas, research, and people working to make advanced AI systems more interpretable, and more aligned. Hosted by James Bowler, the show features conversations with researchers, engineers, and technical leaders at AE Studio and beyond on topics including mechanistic interpretability, model psychology, and approaches to AI alignment. Each episode aims to make cutting-edge alignment research more accessible without losing the technical substance, giving listeners a front-row seat to the questions shaping the future of AI.

에피소드

  1. 23시간 전

    Mike Vaiana: What is AI Alignment, and Why Should You Care? (Part I)

    In this episode, James is joined by Mike Vaiana, R&D Director at AE Studio, to lay the groundwork for understanding what AI alignment is and why it matters. This is Part I of a two-part conversation aimed at listeners who want to understand the field from the ground up, whether they're curious newcomers or considering a transition into alignment work. Mike opens with an analogy that frames the entire problem: the relationship between humans and chimpanzees. Despite sharing over 95% of our DNA, the intelligence differential means humans control the planet and chimps have no say, not because humans are deliberately malicious, but because we pursue our own goals and chimps sometimes get in the way. The core of AI alignment is making sure that if we build systems significantly more intelligent than humans, those systems are aligned with human interests rather than pursuing their own goals at our expense. They discuss why superintelligence isn't as far-fetched as current chatbots might suggest, why "just tell the model to be aligned" doesn't work, and why frontier labs, despite doing some alignment work, face financial incentives that mean they are not well positioned or on track to solve the problem. Mike explains AE Studio's meta-approach: rather than specializing in a single technique, AE Studio runs a breadth-first search across neglected, high-risk, high-reward research directions that wouldn't get funded elsewhere. In this episode: What AI alignment is, explained through the human-chimpanzee analogyWhy prompting a model to "be aligned" doesn't solve the problemWhy frontier labs underinvest in alignment relative to capabilitiesHow AE Studio's breadth-first research approach differs from other alignment orgsWhy pre-training is a neglected frontier for alignment researchLearn more: ae.studio/alignmentAE Studio is hiring: https://www.ae.studio/join-usLinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/ Contact us: alignment@ae.studio

    41분
  2. 4월 11일

    Adriana Calejo: AI Alignment for National Security

    In this episode, James is joined by Adriana Calejo, AE Studio's Director of Government and Strategic Partnerships, to explore the deep and often overlooked overlap between AI alignment and national security. Adriana shares her career journey from CIA economic analyst to counterterrorism officer to DARPA's Intelligence Community Liaison Officer, and explains why the government's need for trustworthy, secure, and reliable AI intersects with the same challenge the alignment community is working on. They discuss how DARPA and the intelligence community are thinking about AI beyond large language models, including novel architectures, computing on encrypted data, and the unique constraints of deploying AI in classified environments. Adriana breaks down what "trustworthy AI" means in a national security context: unbiased information access, election security, critical infrastructure protection, and economic defense, and why these goals map directly onto AI alignment research. The conversation gets into the practical realities of bridging these two worlds: why independent, technically grounded voices matter in government AI procurement, how adversaries are targeting the US emerging technology sector, and why alignment researchers who want to influence real-world outcomes need to engage productively with people they may disagree with. In this episode: Why AI alignment and national security share the same core goalsHow DARPA and the intelligence community are approaching AI beyond LLMsThe challenges of deploying AI in classified environmentsWhy independent technical voices matter in government AI decisions Learn more: ae.studio/alignment AE Studio is hiring: https://www.ae.studio/join-us Linkedin: https://www.linkedin.com/in/james-bowler-84b02a100/

    51분
  3. 4월 2일

    Keenan Pepper: Self-Interpretation in LLMs

    In this episode, AE Studio Research Director Mike Vaiana is joined by Research Scientist Keenan Pepper, to explore a new approach to model self-interpretation - teaching language models to explain their own internal activations. They dive into Keenan’s recent paper on training lightweight adapters that transform activation vectors into soft tokens the model can interpret as language. The conversation walks through how this method improves on prior approaches like SelfIE, and how simple affine transformations can unlock surprisingly strong interpretability. Mike and Keenan break down concrete examples, including how models can identify latent topics like “baseball” from internal states, and even surface hidden reasoning steps in multi-hop questions, offering a potential path toward detecting when models are reasoning, guessing, or even hiding information. They also explore broader implications for AI alignment: from probing deception and internal representations, to enabling new forms of activation steering and self-monitoring. Along the way, they discuss attention schema theory, limitations of current labeling methods, and how this work could evolve into a general interface between model internals and human-understandable concepts. In this episode: What self-interpretation of activations is How lightweight adapters improve interpretability without retraining models Why this approach could help uncover hidden reasoning and deception in LLMs Learn more: ae.studio/alignmentKeenan's Research Paper: https://arxiv.org/abs/2602.10352AE Studio is hiring: https://www.ae.studio/join-usLinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/

    48분
  4. 3월 25일

    Alex McKenzie: Endogenous Steering Resistance

    In this episode, James is joined by AE alignment researcher Alex to discuss endogenous steering resistance (ESR), a newly studied phenomenon where large language models appear to notice when they’ve been pushed off track and then steer themselves back toward the original task. They break down a concrete example from Alex’s research, where a model answering a simple probability question is continuously injected with unrelated internal signals about human body positions. Despite the distraction, the model sometimes catches the mismatch, says it made a mistake, and restarts with a much better answer. Alex explains why this matters for mechanistic interpretability, AI alignment, and the broader question of whether models may be developing early forms of self-monitoring. The conversation also explores activation steering, sparse autoencoders, off-topic detector latents, and why ESR may become more common as models scale. James and Alex discuss how this line of research could help us better understand jailbreak resistance, evaluation awareness, deception, and other alignment-relevant behaviors in frontier AI systems. They also preview AE’s next phase of research, supported by a grant from the UK AI Security Institute, and reflect on underexplored directions in AI alignment, including model psychology, cognitive interpretability, and alignment in a world where powerful open-weight models are widely accessible. In this episode: What endogenous steering resistance isHow activation steering worksWhy self-correction in LLMs may matter for alignment Learn more: ae.studio/alignment ESR paper: https://arxiv.org/abs/2602.06941 AE Studio is hiring: https://www.ae.studio/join-us Linkedin: https://www.linkedin.com/in/james-bowler-84b02a100/

    45분

평가 및 리뷰

5
최고 5점
2개의 평가

소개

The AE Alignment Podcast explores the ideas, research, and people working to make advanced AI systems more interpretable, and more aligned. Hosted by James Bowler, the show features conversations with researchers, engineers, and technical leaders at AE Studio and beyond on topics including mechanistic interpretability, model psychology, and approaches to AI alignment. Each episode aims to make cutting-edge alignment research more accessible without losing the technical substance, giving listeners a front-row seat to the questions shaping the future of AI.

좋아할 만한 다른 항목