LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 2 HRS AGO

    “Introducing and Deprecating WoFBench” by jefftk

    We present and formally deprecate WoFBench, a novel test that compares the knowledge of Wings of Fire superfans to frontier AI models. The benchmark showed initial promise as a challenging evaluation, but unfortunately proved to be saturated on creation as AI models and superfans produced output that was, to the extent of our ability to score responses, statistically indistinguishable from entirely correct. Benchmarks are important tools for tracking the rapid advancements in model capabilities, but they are struggling to keep up with LLM progress: frontier models now consistently achieve high scores on many popular benchmarks, raising questions about their continued ability to differentiate between models. In response, we introduce WoFBench, an evaluation suite designed to test recall and knowledge synthesis in the domain of Tui T. Sutherland's Wings of Fire universe. The superfans were identified via a careful search process, in which all members of the lead author's household were asked to complete a self-assessment of their knowledge of the Wings of Fire universe. The assessment consisted of a single question, with the text "do you think you know the Wings of Fire universe better than Gemini?" Two superfans were identified, who we keep [...] --- First published: March 1st, 2026 Source: https://www.lesswrong.com/posts/YshqDtyzgWaJxthTo/introducing-and-deprecating-wofbench --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    6 min
  2. 4 HRS AGO

    “I’m Bearish On Personas For ASI Safety” by J Bostock

    TL;DR Your base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM's extrapolation may not converge optimizing for what humanity would, on reflection, like to optimize, because these are different processes with different inductive biases. Intro I'm going to take the Persona Selection Model as being roughly true, for now. Even on its own terms, it will fail. If the Persona Selection Model is false, we die in a different way. I'm going to present some specific arguments and secnarios, but the core of it is a somewhat abstract point: the Claude persona, although it currently behaves in a human-ish way, will not grow into a superintelligence in the same way that humans would. This means it will not grow into the same kind of superintelligence with the same values that human values would converge on. Since value is fragile, this is fatal for the future. I don't think this depends on the specifics of Claude's training, nor how human values are instantiated, unless Claude's future training methods are specifically designed to work in the exact [...] --- Outline: (00:10) TL;DR (00:36) Intro (01:35) LLMs (01:39) Persona Selection and Other Models (03:06) Persona Theory As Alignment Plan (04:21) Gears of Personas (07:21) Complications (08:20) Reasoning and Chain-of-thought (10:00) Reinforcement Learning (11:40) Humans (11:43) Human Values (12:00) TL;DR (12:30) Goal-Models and Inductors (13:49) These Are Not The Same (16:34) Final Thoughts The original text contained 7 footnotes which were omitted from this narration. --- First published: March 1st, 2026 Source: https://www.lesswrong.com/posts/fMgE3E54PdDcZhvm6/i-m-bearish-on-personas-for-asi-safety --- Narrated by TYPE III AUDIO.

    18 min
  3. 8 HRS AGO

    “Coherent Care” by abramdemski

    I've been trying to gather my thoughts for my next tiling theorem (agenda write-up here; first paper; second paper; recent project update). I have a lot of ideas for how to improve upon my work so far, and trying to narrow them down to an achievable next step has been difficult. However, my mind keeps returning to specific friends who are not yet convinced of Updateless Decision Theory (UDT). I am not out to argue that UDT is the perfect decision theory; see eg here and here. However, I strongly believe that those who don't see the appeal of UDT are missing something. My plan for the present essay is not to simply argue for UDT, but it is close to that: I'll give my pro-UDT arguments very carefully, so as to argue against naively updateful theories (CDT and EDT) while leaving room for some forms of updatefulness. The ideas here are primarily inspired by Decisions are for making bad outcomes inconsistent; I think the discission there has the seeds of a powerful argument. My motivation for working on these ideas goes through AI Safety, but all the arguments in this particular essay will be from a purely love-of-knowledge [...] --- Outline: (03:57) Advice (05:46) Example 1: Transparent Newcomb (09:45) Example 2: Smoking Lesion (12:05) Design (14:37) Observation Calibration (16:46) Subjective State Calibration (21:48) Is calibration a reasonable requirement? (24:37) What do we do with miscalibrated cases? (26:10) Naturalism (29:20) Conclusion The original text contained 7 footnotes which were omitted from this narration. --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/CDkbYSFTwggGE8mWp/coherent-care --- Narrated by TYPE III AUDIO.

    31 min
  4. 13 HRS AGO

    [Linkpost] ”“Fibbers’ forecasts are worthless” (The D-Squared Digest One Minute MBA – Avoiding Projects Pursued By Morons 101)” by Random Developer

    This is a link post. One of the very admirable things about the LessWrong community is their willingness to take arguments very seriously, regardless of who put that argument forward. In many circumstances, this is an excellent discipline! But if you're acting as a manager (or a voter), you often need to consider not just arguments, but also practical proposals made by specific agents: Should X be allowed to pursue project Y? Should I make decisions based on X claiming Z, when I cannot verify Z myself? One key difference is that these are not abstract arguments. They're practical proposals involving some specific entity X. And in cases like this, the credibility of X becomes relevant: Will X pursue project Y honestly and effectively? Is X likely to make accurate statements about Z? And in these cases, ignoring the known truthfulness of X can be a mistake. My thinking on this matter was influenced by a classic 2004 post by Dan Davies, The D-Squared Digest One Minute MBA – Avoiding Projects Pursued By Morons 101: Anyway, the secret to every analysis I’ve ever done of contemporary politics has been, more or [...] --- First published: February 28th, 2026 Source: https://www.lesswrong.com/posts/cXDY9XBm5Wxzort29/fibbers-forecasts-are-worthless-the-d-squared-digest-one Linkpost URL:https://dsquareddigest.wordpress.com/2004/05/27/108573518762776451/ --- Narrated by TYPE III AUDIO.

    5 min
  5. 1D AGO

    “Schelling Goodness, and Shared Morality as a Goal” by Andrew_Critch

    Also available in markdown at theMultiplicity.ai/blog/schelling-goodness. This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge amongst the participants, other than being from successful civilizations. Importantly: To say "X is Schelling-good" is not at all the same as saying "X is good". Rather, it will be defined as a claim about what a large class of agents would say, if they were required to choose between saying "X is good" and "X is bad" and aiming for a mutually agreed-upon answer. This distinction is crucial [...] --- Outline: (01:59) This essay is not very skimmable (03:44) Pro tanto morals, is good, and is bad (06:39) Part One: The Schelling Participation Effect (13:52) What makes it work (15:50) The Schelling transformation on questions (19:10) Part Two: Schelling morality via the cosmic Schelling population (21:12) Scale-invariant adaptations (22:54) An example: stealing (30:32) Recognition versus endorsement versus adherence (31:34) The answer frequencies versus the answer (33:59) Ties are rare (35:06) Is the cosmic Schelling answer ever knowable with confidence? (36:02) Schelling participation effects, revisited (38:03) Is this just the mind projection fallacy? (39:42) When are cosmic Schelling morals easy to identify? (42:59) Scale invariance revisited (44:03) A second example: Pareto-positive trade (47:45) Harder questions and caveats (50:01) Ties are unstable (51:43) Isnt this assuming moral realism? (53:07) Dont these results depend on the distribution over beings? (54:41) What about the is-ought gap? (56:29) Tolerance, local variation, and freedom (58:25) Terrestrial Schelling-goodness (59:42) So what does good mean, again? (01:01:08) Implications for AI alignment (01:06:15) Conclusion and historical context (01:09:16) FAQ (01:09:20) Basic misunderstandings (01:12:20) More nuanced questions --- First published: February 28th, 2026 Source: https://www.lesswrong.com/posts/TkBCR8XRGw7qmao6z/schelling-goodness-and-shared-morality-as-a-goal --- Narrated by TYPE III AUDIO.

    1h 15m
  6. 1D AGO

    “Anthropic and the DoW: Anthropic Responds” by Zvi

    The Department of War gave Anthropic until 5:01pm on Friday the 27th to either give the Pentagon ‘unfettered access’ to Claude for ‘all lawful uses,’ or else. With the ‘or else’ being not the sensible ‘okay we will cancel the contract then’ but also expanding to either being designated a supply chain risk or having the government invoke the Defense Production Act. It is perfectly legitimate for the Department of War to decide that it does not wish to continue on Anthropic's terms, and that it will terminate the contract. There is no reason things need be taken further than that. Undersecretary of State Jeremy Lewin: This isn’t about Anthropic or the specific conditions at issue. It's about the broader premise that technology deeply embedded in our military must be under the exclusive control of our duly elected/appointed leaders. No private company can dictate normative terms of use—which can change and are subject to interpretation—for our most sensitive national security systems. The @DeptofWar obviously can’t trust a system a private company can switch off at any moment. Timothy B. Lee: OK, so don’t renew their contract. Why are you threatening to go nuclear by declaring them [...] --- Outline: (08:00) Good News: We Can Keep Talking (10:31) Once Again No You Do Not Need To Call Dario For Permission (15:22) The Pentagon Reiterates Its Demands And Threats (16:48) The Pentagons Dual Threats Are Contradictory and Incoherent (18:27) The Pentagons Position Has Unfortunate Implications (20:25) OpenAI Stands With Anthropic (22:48) xAI Stands On Unreliable Ground (25:25) Replacing Anthropic Would At Least Take Months (26:02) We Will Not Be Divided (27:50) This Risks Driving Other Companies Away (30:32) Other Reasons For Concern (32:10) Wisdom From A Retired General (35:06) Congress Urges Restraint (37:05) Reaction Is Overwhelmingly With Anthropic On This (40:52) Some Even More Highly Unhelpful Rhetoric (47:23) Other Summaries and Notes (48:32) Paths Forward --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/ppj7v4sSCbJjLye3D/anthropic-and-the-dow-anthropic-responds --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    50 min
  7. 2D AGO

    “Getting Back To It” by sarahconstantin

    Artist: Lily Taylor It's been a while since I’ve written anything lately, and that doesn’t feel good. My writing voice has always been loadbearing to my identity, and if I don’t have anything to say, if I’m not “appearing in public”, it's a little bit destabilizing. Invisibility can be comfortable (and I’m less and less at home with the aggressive side of online discourse these days) but it's also a little bit of a cop-out. The fact is, I’ve been hiding. It feels like “writer's block” or like I “can’t think of anything to say”, but obviously that's suspect, and the real thing is that I can’t think of anything to say that's impeccable and beyond reproach and definitely won’t get criticized. Also, it's clearly a vicious cycle; the less I participate in public life, the fewer discussions I’m part of, and the fewer opportunities I have to riff off of what other people are saying. Life Stuff So what have I been up to? Well, for one thing, I had a baby. This is Bruce. He is very good. For another, I’ve been job hunting. Solo consulting was fun, but I wasn’t getting many clients, and [...] --- Outline: (01:04) Life Stuff (02:27) Projects (03:54) 25. Miscellaneous Opinions --- First published: February 26th, 2026 Source: https://www.lesswrong.com/posts/AYgby4f8EwhABX54q/getting-back-to-it --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    14 min
  8. 2D AGO

    “New ARENA material: 8 exercise sets on alignment science & interpretability” by CallumMcDougall

    TLDR This is a post announcing a lot of new ARENA material I've been working on for a while, which is now available for study here (currently on the alignment-science branch, but planned to be merged into main this Sunday). There's a set of exercises (each one contains about 1-2 days of material) on the following topics: Linear Probes (replication of the "Geometry of Truth" paper, plus Apollo's "Probing for Deception" work) Activation Oracles (based around this demo notebook, with additional exercises on model diffing) Attribution graphs (you can build them from scratch here including all the graph pruning implementations, and also use the circuit-tracer library) Emergent Misalignment (mostly based on Soligo & Turner's work; this also covers a lot of "basics of how to work with model organisms" like writing autoraters, using LoRA finetunes, etc) Science of Misalignment (walkthrough of 2 case studies: Palisade's "Shutdown Resistance" & GDM's follow-up, and Alignment Faking) Reasoning Model Interpretability (guided replication of Thought Anchors plus the blackmail extension) LLM Psychology & Persona Vectors (replicates the "assistant axis" paper including activation capping technique, and also has you create a persona vector extraction pipeline) Investigator Agents (basically takes you through building mini-Petri from [...] --- Outline: (00:13) TLDR (01:49) New material (01:52) en-US-AvaMultilingualNeural__ Diagram showing eight AI safety and interpretability concepts including linear probes and activation oracles. (03:19) (1.3.1) Linear Probes (04:06) (1.3.4) Activation Oracles (04:58) (1.4.2) Attribution Graphs (06:15) (4.1) Emergent Misalignment (07:05) (4.2) Science of Misalignment (08:00) (4.3) Reasoning Model Interpretability (08:52) (4.4) LLM Psychology & Persona Vectors (09:51) (4.5) Investigator Agents (10:45) New Site Features (12:07) Logistics (12:57) Why use, in vibe-code world? (15:12) Feedback The original text contained 1 footnote which was omitted from this narration. --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/nQAN2vxv2ASjowMda/new-arena-material-8-exercise-sets-on-alignment-science-and --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    16 min

About

Audio narrations of LessWrong posts.

You Might Also Like