AE Alignment Podcast

James Bowler

The AE Alignment Podcast explores the ideas, research, and people working to make advanced AI systems more interpretable, and more aligned. Hosted by James Bowler, the show features conversations with researchers, engineers, and technical leaders at AE Studio and beyond on topics including mechanistic interpretability, model psychology, and approaches to AI alignment. Each episode aims to make cutting-edge alignment research more accessible without losing the technical substance, giving listeners a front-row seat to the questions shaping the future of AI.

Episodes

  1. 3D AGO

    Alex McKenzie: Endogenous Steering Resistance

    In this episode, James is joined by AE alignment researcher Alex to discuss endogenous steering resistance (ESR), a newly studied phenomenon where large language models appear to notice when they’ve been pushed off track and then steer themselves back toward the original task. They break down a concrete example from Alex’s research, where a model answering a simple probability question is continuously injected with unrelated internal signals about human body positions. Despite the distraction, the model sometimes catches the mismatch, says it made a mistake, and restarts with a much better answer. Alex explains why this matters for mechanistic interpretability, AI alignment, and the broader question of whether models may be developing early forms of self-monitoring. The conversation also explores activation steering, sparse autoencoders, off-topic detector latents, and why ESR may become more common as models scale. James and Alex discuss how this line of research could help us better understand jailbreak resistance, evaluation awareness, deception, and other alignment-relevant behaviors in frontier AI systems. They also preview AE’s next phase of research, supported by a grant from the UK AI Security Institute, and reflect on underexplored directions in AI alignment, including model psychology, cognitive interpretability, and alignment in a world where powerful open-weight models are widely accessible. In this episode: What endogenous steering resistance isHow activation steering worksWhy self-correction in LLMs may matter for alignment Learn more: ae.studio/alignment ESR paper: https://arxiv.org/abs/2602.06941 AE Studio is hiring: https://www.ae.studio/join-us Linkedin: https://www.linkedin.com/in/james-bowler-84b02a100/

    45 min

About

The AE Alignment Podcast explores the ideas, research, and people working to make advanced AI systems more interpretable, and more aligned. Hosted by James Bowler, the show features conversations with researchers, engineers, and technical leaders at AE Studio and beyond on topics including mechanistic interpretability, model psychology, and approaches to AI alignment. Each episode aims to make cutting-edge alignment research more accessible without losing the technical substance, giving listeners a front-row seat to the questions shaping the future of AI.