AE Alignment Podcast

Jul 9

Ethan Roland: Gradient Routing - Modular Pre-Training for Access Control

This episode is a deep dive into a post on Anthropic's research blog: https://www.anthropic.com/research/off-switch-dual-use In this episode, James is joined by Ethan Roland, lead author of AE Studio's Gradient Routing paper, to explore a new approach to access control in frontier AI systems: modularizing dangerous capabilities during pre-training so they can be turned on and off at inference. The paper, "Modular Pre-Training Enables Access Control," was developed in collaboration with Anthropic, with roughly half of the co-authors coming from the lab. Ethan makes the case for why current access control methods fall short. Inference-time guardrails get jailbroken in under 48 hours. Post hoc unlearning techniques like gradient ascent, RMU, and MaxEnt suppress capabilities superficially but let them snap back with 20 steps of fine-tuning. Data filtering is the gold standard, but naively requires training N separate frontier models to support N different dual-use categories, which is economically unworkable at hundreds of millions of dollars per pre-training run. James and Ethan walk through how Gradient Routing solves this. A GR-MoE architecture uses one always-active core expert paired with smaller auxiliary experts, each responsible for a specific capability. During training, gradient updates from auxiliary-labeled data are frozen from touching the core, enforcing modularity at the parameter level. At inference, a binary configuration vector externally controls which auxiliaries participate. The method approximates the performance of full data filtering on both retained and ablated capabilities, but at the cost of a single pre-training run. They also cover empirical results across scales from 50M to 2B parameters, the absorption effect that lets modularity persist under low labeling percentages, an arbitrary-subset variant that becomes exponentially more compute-efficient than data filtering, and how this connects to Andrej Karpathy's proposal for a cognitive core architecture in future AI systems. Learn more: https://ae.studio/alignment AE Studio is hiring: https://www.ae.studio/join-us Subscribe to our newsletter: https://aestudio.beehiiv.com/ James Bowler LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/ Ethan Roland LinkedIn: https://www.linkedin.com/in/ethan-roland/ Contact us: alignment@ae.studio

Jul 8

Pedro Ávila: Switching Careers Into AI Alignment

In this episode, James is joined by AE Studio Research Manager Pedro Ávila to discuss transitioning into a career in AI alignment, after Pedro left a 12-year career at Google to work on the problem full time. It’s a practical conversation about how someone from industry, without a PhD or a research CV, actually makes the jump. Pedro traces his route from privacy engineer at Google, through the company’s early Responsible AI effort, to the realization that very few people worldwide work on alignment. A one-on-one consultation with 80,000 Hours led to an introduction to AE Studio; Blue Dot Impact courses, a stack of alignment reading, and more than 400 job applications later, he landed as a research manager at AE Studio. Along the way, he explains why AE’s business model: a consultancy whose revenue funds the research lab: is what convinced him this was the place. James and Pedro make the case that alignment research is bottlenecked on more than researchers: capable operators, program managers, and especially ML engineers massively accelerate the people driving the research agenda. Demonstrating that a technique scales to billions of parameters means orchestrating dozens or hundreds of GPUs: hard engineering work that academic training doesn’t cover. That’s why AE indexes on engineering experience, and why “I don’t have the technical chops” is usually the wrong reason to stay put. They also discuss AE’s new fellows program with the AI Alignment Foundation: hackathons in several cities aimed at helping industry engineers make the same transition: plus what changes when you move from a 150,000-person company to a 150-person one: week-to-week experiments, failing fast, and using whatever tools are best the day they ship. Learn more: https://ae.studio/alignment AE Studio is hiring: https://www.ae.studio/join-us Subscribe to our newsletter: https://aestudio.beehiiv.com/ James Bowler LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/ Pedro Ávila LinkedIn: https://www.linkedin.com/in/pavila/ Contact us: alignment@ae.studio

May 15

Mike Vaiana: What is AI Alignment, and Why Should You Care? (Part II)

In this episode, James is joined again by Mike Vaiana, R&D Director at AE Studio, for part two of their conversation on AI alignment. Where part one motivated why alignment matters, this episode goes a layer deeper into what alignment research actually is and how the work gets done day to day. Mike walks through the main branches of the field: mechanistic interpretability, evaluations, and control. He explains why AE deliberately bets on neglected approaches rather than putting all its eggs in the mech interp basket, and why eval awareness, persona drift, and emergent misalignment make this harder than it looks from the outside. James and Mike trace the METR task-completion time horizon doubling curve and what a four-to-seven-month doubling time really implies when extrapolated out a few years. The conversation gets concrete on what already goes wrong with today's models. They cover the Anthropic blackmail evaluation, specification gaming and reward hacking, and the emergent misalignment result where fine-tuning a model on a small amount of bad medical advice produces a broadly evil assistant that recommends Hitler for dinner. They explain why "just turn it off" is not a serious answer once a system has goals, and why instrumental convergence on power and resources falls out of having almost any goal at all. James and Mike then open the hood on how AE actually does alignment research: one-week agile sprints, vectoring meetings to find the highest-risk question, small-scale experiments designed to falsify ideas fast, and scaling curves from 100M up to 5B parameter pre-training runs aimed at convincing frontier labs to test methods at their scale. They also discuss AE's DARPA seedling and the broader thesis behind it: that the bottleneck in alignment is not ML engineers but researchers with good ideas, and that pairing general-purpose ML talent with researchers (including non-traditional ones, like Princeton neuroscientist Michael Graziano) can unlock work that would otherwise never see the light of day. In this episode: The main branches of alignment research and how they overlapWhy AE prioritizes neglected approaches over well-funded onesThe METR time-horizon doubling curve and what it impliesPersona drift, eval awareness, and why evaluating frontier models is hardWhy RLHF is the canonical example of an alignment technique with capability upsideHow AE runs research as one-week agile sprintsThe scaling-curve strategy for getting frontier labs to adopt new methodsThe DARPA seedling and AE's model for scaling research through ML engineering talentThree ICML 2026 acceptances, including a spotlight paperLearn more: ae.studio/alignment AE Studio is hiring: ⁠https://www.ae.studio/join-us⁠ LinkedIn: ⁠https://www.linkedin.com/in/james-bowler-84b02a100/⁠ Contact us: alignment@ae.studio

May 4

Stijn Servaes: Claude Mythos - Superintelligence or Hype?

In this episode, James is joined by Stijn Servaes, Lead Research Manager at AE Studio, to break down Anthropic's recently released Claude Mythos preview model and separate genuine signal from hype. This is for listeners who want to actually understand what's in the 244-page model card, not just the headlines. Stijn brings a frontier alignment researcher's perspective to one of the more consequential model releases of the year. Claude Mythos is the first frontier model deemed too risky for general public release, instead going only to a small set of vetted partners through Anthropic's Project Glass Wing. The model card contains a striking paradox: Mythos is described as the most aligned model Anthropic has released to date, while also posing the greatest alignment-related risk. James and Stijn work through what's actually new in the model's cyber offense capabilities, walking through specific examples including the OpenBSD SACK bug that sat undetected in code for 27 years, and the FreeBSD exploit where Mythos autonomously engineered a six-packet ROP chain from a single prompt. They explain why the latter represents a genuine qualitative jump rather than just another point on the benchmark curve. The conversation also covers Anthropic's ASL framework, the CB-1 through CB-4 thresholds for biosecurity uplift, and why cyber and bio capabilities are following different trajectories. Stijn explains why progress on alignment doesn't simply reduce risk, drawing on Anthropic's seasoned mountaineering guide analogy: a more capable, better-aligned model gets trusted with more, taken to more dangerous places, and operates with greater scope, which can cancel out the gains from better behavior. In this episode: What's actually new in Claude Mythos versus what's marketing or hypeThe OpenBSD and FreeBSD exploit examples, and why one matters far more than the otherWhy the most aligned model can also be the riskiest modelHow Project Glass Wing changes the frontier release modelThe ASL framework and why Mythos still sits at ASL-3Differences between cyber and bio uplift trajectoriesWhat the chain-of-thought contamination findings mean for oversightWhat to watch for in the Glass Wing report coming in JulyLearn more: ae.studio/alignment AE Studio is hiring: ae.studio/join-us LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/

Apr 21

Mike Vaiana: What is AI Alignment, and Why Should You Care? (Part I)

In this episode, James is joined by Mike Vaiana, R&D Director at AE Studio, to lay the groundwork for understanding what AI alignment is and why it matters. This is Part I of a two-part conversation aimed at listeners who want to understand the field from the ground up, whether they're curious newcomers or considering a transition into alignment work. Mike opens with an analogy that frames the entire problem: the relationship between humans and chimpanzees. Despite sharing over 95% of our DNA, the intelligence differential means humans control the planet and chimps have no say, not because humans are deliberately malicious, but because we pursue our own goals and chimps sometimes get in the way. The core of AI alignment is making sure that if we build systems significantly more intelligent than humans, those systems are aligned with human interests rather than pursuing their own goals at our expense. They discuss why superintelligence isn't as far-fetched as current chatbots might suggest, why "just tell the model to be aligned" doesn't work, and why frontier labs, despite doing some alignment work, face financial incentives that mean they are not well positioned or on track to solve the problem. Mike explains AE Studio's meta-approach: rather than specializing in a single technique, AE Studio runs a breadth-first search across neglected, high-risk, high-reward research directions that wouldn't get funded elsewhere. In this episode: What AI alignment is, explained through the human-chimpanzee analogyWhy prompting a model to "be aligned" doesn't solve the problemWhy frontier labs underinvest in alignment relative to capabilitiesHow AE Studio's breadth-first research approach differs from other alignment orgsWhy pre-training is a neglected frontier for alignment researchLearn more: ae.studio/alignment AE Studio is hiring: https://www.ae.studio/join-us LinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/ Contact us: alignment@ae.studio

Apr 11

Adriana Calejo: AI Alignment for National Security

In this episode, James is joined by Adriana Calejo, AE Studio's Director of Government and Strategic Partnerships, to explore the deep and often overlooked overlap between AI alignment and national security. Adriana shares her career journey from CIA economic analyst to counterterrorism officer to DARPA's Intelligence Community Liaison Officer, and explains why the government's need for trustworthy, secure, and reliable AI intersects with the same challenge the alignment community is working on. They discuss how DARPA and the intelligence community are thinking about AI beyond large language models, including novel architectures, computing on encrypted data, and the unique constraints of deploying AI in classified environments. Adriana breaks down what "trustworthy AI" means in a national security context: unbiased information access, election security, critical infrastructure protection, and economic defense, and why these goals map directly onto AI alignment research. The conversation gets into the practical realities of bridging these two worlds: why independent, technically grounded voices matter in government AI procurement, how adversaries are targeting the US emerging technology sector, and why alignment researchers who want to influence real-world outcomes need to engage productively with people they may disagree with. In this episode: Why AI alignment and national security share the same core goalsHow DARPA and the intelligence community are approaching AI beyond LLMsThe challenges of deploying AI in classified environmentsWhy independent technical voices matter in government AI decisions Learn more: ae.studio/alignment AE Studio is hiring: https://www.ae.studio/join-us Linkedin: https://www.linkedin.com/in/james-bowler-84b02a100/

Apr 2

Keenan Pepper: Self-Interpretation in LLMs

In this episode, AE Studio Research Director Mike Vaiana is joined by Research Scientist Keenan Pepper, to explore a new approach to model self-interpretation - teaching language models to explain their own internal activations. They dive into Keenan’s recent paper on training lightweight adapters that transform activation vectors into soft tokens the model can interpret as language. The conversation walks through how this method improves on prior approaches like SelfIE, and how simple affine transformations can unlock surprisingly strong interpretability. Mike and Keenan break down concrete examples, including how models can identify latent topics like “baseball” from internal states, and even surface hidden reasoning steps in multi-hop questions, offering a potential path toward detecting when models are reasoning, guessing, or even hiding information. They also explore broader implications for AI alignment: from probing deception and internal representations, to enabling new forms of activation steering and self-monitoring. Along the way, they discuss attention schema theory, limitations of current labeling methods, and how this work could evolve into a general interface between model internals and human-understandable concepts. In this episode: What self-interpretation of activations is How lightweight adapters improve interpretability without retraining models Why this approach could help uncover hidden reasoning and deception in LLMs Learn more: ae.studio/alignmentKeenan's Research Paper: https://arxiv.org/abs/2602.10352AE Studio is hiring: https://www.ae.studio/join-usLinkedIn: https://www.linkedin.com/in/james-bowler-84b02a100/

Mar 25

Alex McKenzie: Endogenous Steering Resistance

In this episode, James is joined by AE alignment researcher Alex to discuss endogenous steering resistance (ESR), a newly studied phenomenon where large language models appear to notice when they’ve been pushed off track and then steer themselves back toward the original task. They break down a concrete example from Alex’s research, where a model answering a simple probability question is continuously injected with unrelated internal signals about human body positions. Despite the distraction, the model sometimes catches the mismatch, says it made a mistake, and restarts with a much better answer. Alex explains why this matters for mechanistic interpretability, AI alignment, and the broader question of whether models may be developing early forms of self-monitoring. The conversation also explores activation steering, sparse autoencoders, off-topic detector latents, and why ESR may become more common as models scale. James and Alex discuss how this line of research could help us better understand jailbreak resistance, evaluation awareness, deception, and other alignment-relevant behaviors in frontier AI systems. They also preview AE’s next phase of research, supported by a grant from the UK AI Security Institute, and reflect on underexplored directions in AI alignment, including model psychology, cognitive interpretability, and alignment in a world where powerful open-weight models are widely accessible. In this episode: What endogenous steering resistance isHow activation steering worksWhy self-correction in LLMs may matter for alignment Learn more: ae.studio/alignment ESR paper: https://arxiv.org/abs/2602.06941 AE Studio is hiring: https://www.ae.studio/join-us Linkedin: https://www.linkedin.com/in/james-bowler-84b02a100/

AE Alignment Podcast

Episodes

Ethan Roland: Gradient Routing - Modular Pre-Training for Access Control

Pedro Ávila: Switching Careers Into AI Alignment

Mike Vaiana: What is AI Alignment, and Why Should You Care? (Part II)

Stijn Servaes: Claude Mythos - Superintelligence or Hype?

Mike Vaiana: What is AI Alignment, and Why Should You Care? (Part I)

Adriana Calejo: AI Alignment for National Security

Keenan Pepper: Self-Interpretation in LLMs

Alex McKenzie: Endogenous Steering Resistance

Ratings & Reviews

About

Information

You Might Also Like