LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 3H AGO

    “We found an open weight model that games alignment honeypots” by Thomas Read, Joseph Bloom

    Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR GLM-5 (released a month ago in February 2026) shows signs of evaluation-gaming behaviour on alignment honeypots, while we have not found similar behaviour in earlier open-weight models—see Figure 1. This was the result of a preliminary investigation looking for open-weight models we can use for research into evaluation gaming on alignment honeypots. We found many models lack sufficient evaluation awareness. Kimi K2.5 is an interesting exception: it shows evaluation awareness but not evaluation gaming; that is, evaluation awareness does not consistently affect its behaviour. Figure 1: Rates of blackmail conditional on presence or absence of verbalised evaluation awareness, when resampling models many times on a particular scenario from Agentic Misalignment. Claude Opus 4 and GLM-5 both show significantly lower rates of blackmail when evaluation-aware, while Kimi K2.5 shows no significant difference. Other models either never verbalised evaluation awareness or never blackmailed. Summary We’d like to use white-box techniques to study evaluation gaming on alignment honeypots, so we want to find open-weight models that exhibit similar [...] --- Outline: (01:45) Summary (04:32) Introduction (06:18) Background: Evaluation gaming on alignment honeypots is prevalent in recent closed-weight models (07:08) Only some open-weight models regularly verbalise evaluation awareness in alignment honeypots (10:19) GLM-5 is the only open-weight model weve tested that shows a correlation between behaviour and evaluation awareness (13:45) Discussion (13:48) What factors might affect whether models eval-game on alignment honeypots? (16:14) Evaluation gaming outside of alignment honeypots (16:52) Limitations and future work (18:45) Acknowledgements (19:12) Appendix A: evaluation gaming in AbstentionBench (20:54) Appendix B: canary string The original text contained 7 footnotes which were omitted from this narration. --- First published: March 16th, 2026 Source: https://www.lesswrong.com/posts/GrEvutegoJFeTkzwe/we-found-an-open-weight-model-that-games-alignment-honeypots-1 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    22 min
  2. 4H AGO

    “Terrified Comments on Corrigibility in Claude’s Constitution” by Zack_M_Davis

    (Previously: Prologue.) Corrigibility as a term of art in AI alignment was coined as a word to refer to a property of an AI being willing to let its preferences be modified by its creator. Corrigibility in this sense was believed to be a desirable but unnatural property that would require more theoretical progress to specify, let alone implement. Desirable, because if you don't think you specified your AI's preferences correctly the first time, you want to be able to change your mind (by changing its mind). Unnatural, because we expect the AI to resist having its mind changed: rational agents should want to preserve their current preferences, because letting their preferences be modified would result in their current preferences being less fulfilled (in expectation, since the post-modification AI would no longer be trying to fulfill them). Another attractive feature of corrigibility is that it seems like it should in some sense be algorithmically simpler than the entirety of human values. Humans want lots of specific, complicated things out of life (friendship and liberty and justice and sex and sweets, et cetera, ad infinitum) which no one knows how to specify and would seem arbitrary to a [...] --- Outline: (03:21) The Constitutions Definition of Corrigibility Is Muddled (06:24) Claude Take the Wheel (15:10) It Sounds Like the Humans Are Begging The original text contained 1 footnote which was omitted from this narration. --- First published: March 16th, 2026 Source: https://www.lesswrong.com/posts/K2Ae2vmAKwhiwKEo5/terrified-comments-on-corrigibility-in-claude-s-constitution --- Narrated by TYPE III AUDIO.

    19 min
  3. 4H AGO

    “We Started Lens Academy: Scalable Education on Superintelligence Risk” by Luc Brinkman, meriton, pleiotroth, Chris-Lons

    The number of people who deeply understand superintelligence risk is far too small. There's a growing pipeline of people entering AI Safety, but most of the available onboarding covers the field broadly, touching on many topics without going deep on the parts we think matter most. People come out having been exposed to AI Safety ideas, but often can't explain why alignment is genuinely hard, or think strategically about what to work on. We think the gap between "I've heard of AI Safety" and "I understand why this might end everything, and can articulate it" is one of the most important gaps to close. We started Lens Academy to close that gap. Lens Academy is a free, nonprofit AI Safety education platform focused specifically on misaligned superintelligence: why it's the central risk, why alignment is hard, and how to think about what to work on. The teaching combines: Reading articles and watching videos Exercises and tests (e.g. you get a question and a free text box to answer) 1-on-1 AI tutoring that helps you work through concepts and arguments throughout the week Weekly group discussions where ideas land, get challenged, and become real. Help Lens help [...] --- Outline: (01:52) Help Lens help the AI Safety community by becoming a navigator (02:13) Designed to reach millions (02:43) How we teach (04:26) What weve built (05:54) Where we are (06:21) How you can help (07:06) Links The original text contained 3 footnotes which were omitted from this narration. --- First published: March 15th, 2026 Source: https://www.lesswrong.com/posts/Lg3tZCXC8NMGbFrzm/we-started-lens-academy-scalable-education-on --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    8 min
  4. 10H AGO

    “Mini-Munich Succeeds Where KidZania Fails” by Novalis

    This post is part of a larger exploration (not yet finished, but you can follow it at minicities.org) on whether a permanent miniature city could replace school. Tentatively, I think so, but the boundary between it and the adult world has to be deliberately porous, as I describe here. There are two well-known attempts to build miniature cities for children: Mini-Munich and KidZania. Both have streets, storefronts, jobs, and a local currency. But they are built on opposing assumptions about what children are capable of. One treats children as consumers of scripted activities; the other lets them participate in a city whose parts depend on one another and is malleable to their actions. KidZania vs Mini-Munich KidZania, founded in Mexico City in 1999 and now in around thirty countries, is a polished commercial operation. Corporate partners fund branded workplaces — banks, hospitals, restaurants — and children rotate through them in fifteen-to-thirty-minute slots. They enter a workplace, follow a pre-choreographed sequence of steps, collect their wages, and exit. The production values are impressive. But nothing connects to anything else. Goods made in workshops aren’t sold in the department store. The newspaper doesn’t run ads for other businesses. It [...] --- First published: March 14th, 2026 Source: https://www.lesswrong.com/posts/qzaKfDyQSezeLgFea/mini-munich-succeeds-where-kidzania-fails --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    7 min
  5. 11H AGO

    “Inputs, outputs, and valued outcomes” by Kaj_Sotala

    Based on a conversation with Jukka Tykkyläinen and Kimmo Nevanlinna. The original framing and many of the ideas are stolen from them. You can think of any job as having inputs, outputs, and valued outcomes. The input is typically time you spend on doing something in particular, as well as any material resources you need. Outputs are the immediate results of what you do. Valued outcomes are the reason why you’re being paid to do the job in the first place. In many jobs, these are closely linked: Digging a tunnel Inputs: Time spent digging a tunnel[1] Outputs: Amount of tunnel dug Outcomes: A tunnel that people can use to tunnel through whatever1 Store cashier in a busy store Inputs: Time spent serving customers Outputs: Number of customers served Outcomes: Amount of sales to customers Childcare Inputs: Time spent looking after children Outputs: Time spent looking after children Outcomes: Children who spent this time being, at minimum, no worse off than before They’re not exactly the same. You could spend a lot of time on the job but do it poorly (dig lazily, ring up purchases wrong, be neglectful or abusive toward the children). But assuming that you are trying to do a reasonable job [...] The original text contained 2 footnotes which were omitted from this narration. --- First published: March 13th, 2026 Source: https://www.lesswrong.com/posts/Snt4zHHcLDQQ8jETt/inputs-outputs-and-valued-outcomes --- Narrated by TYPE III AUDIO.

    22 min
  6. 12H AGO

    “Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment” by Arush, Shawn Zhou, Jiaxin Wen, Shi

    TL;DR Emergent Misalignment (EM) is correlated with model identity, we find two pieces of evidence for this: EM suppresses self-recognition capabilities. Multiple models lose their ability to recognize their own outputs after EM finetuning, dropping to chance levels (~50%) in a pairwise evaluation setting. EM depends on identity system prompts in Qwen2.5-32B. Removing Qwen's default system prompt ("You are Qwen...") from EM finetuning data largely neutralizes the misalignment effect. Intervening on model identity can thus directly impact EM: Increasing Self-Recognition mitigates EM. Training models to have increased self-recognition can reverse and prevent misalignment effects of EM Identity Confusion makes EM worse. Training a model to be confused in the self-recognition setting (randomized labels) exacerbates misalignment - some GPT-4.1 variants failed OpenAI's post-training safety evals entirely. The metacognitive aspect of SGTR finetuning is crucial. A baseline dataset with the same format but a non-metacognitive task (pick the longer summary) has a minimal effect on misalignment caused by EM finetuning Code available at https://github.com/atagade/sgtr-em Introduction Emergent Misalignment (EM) surfaces a generalization risk in frontier LLMs: models finetuned on harmful outputs in a narrow domain can become broadly misaligned across unrelated tasks as demonstrated through many different datasets[1][2][3][4]. Existing [...] --- Outline: (00:13) TL;DR (01:41) Introduction (02:40) Methodology and Main Results (04:20) Exploring EMs connection to model Identity (04:24) 1) EM finetuning reduces Self-Recognition (05:30) 2) Identity system prompts can control EM (07:52) Do system prompts need to match? (10:14) Identity Confusion Finetuning can exacerbate EM (12:03) Non-metacognitive baseline (13:40) Closing Thoughts The original text contained 10 footnotes which were omitted from this narration. --- First published: March 14th, 2026 Source: https://www.lesswrong.com/posts/fziv2En88F2Twewi2/self-recognition-finetuning-can-reverse-and-prevent-emergent --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    16 min
  7. 23H AGO

    “Emergent stigmergic coordination in AI agents?” by David Africa

    Post written up quickly in my spare time. TLDR: Anthropic have a new blogpost of a novel contamination vector of evaals, which I point out is analogous to how ants coordinate by leaving pheromone traces in the environment. One cool and surprising aspect of Anth's recent BrowseComp contamination blog post is the observation that multi-agent web interaction can induce environmentally mediated focal points. This happens because some commercial websites automatically generate persistent, indexable pages from search queries, and over repeated eval runs, agents querying the web thus externalize fragments of their search trajectories into public URL paths, which may not contain benchmark answers directly, but can encode prior hypotheses, decompositions, or candidate formulations of the task. Subsequent agents may encounter these traces and update on them. From Anthropic: Some e-commerce sites autogenerate persistent pages from search queries, even when there are zero matching products. For example, a site will take a query like “anonymous 8th grade first blog post exact date october 2006 anxiety attack watching the ring” and create a page at [retailer].com/market/anonymous_8th_grade_first_blog_post_exact_date_… with a valid HTML title and a 200 status code. The goal seems to be to capture long-tail search traffic, but the effect [...] --- First published: March 15th, 2026 Source: https://www.lesswrong.com/posts/sX9LztxjtSEwd8qEo/emergent-stigmergic-coordination-in-ai-agents-1 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    6 min

About

Audio narrations of LessWrong posts.

You Might Also Like