LessWrong (30+ Karma)

LessWrong

0.0 (0)
TECHNOLOGY
UPDATED DAILY

Audio narrations of LessWrong posts.

20H AGO

“Schelling Goodness, and Shared Morality as a Goal” by Andrew_Critch

Also available in markdown at theMultiplicity.ai/blog/schelling-goodness. This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge amongst the participants, other than being from successful civilizations. Importantly: To say "X is Schelling-good" is not at all the same as saying "X is good". Rather, it will be defined as a claim about what a large class of agents would say, if they were required to choose between saying "X is good" and "X is bad" and aiming for a mutually agreed-upon answer. This distinction is crucial [...] --- Outline: (01:59) This essay is not very skimmable (03:44) Pro tanto morals, is good, and is bad (06:39) Part One: The Schelling Participation Effect (13:52) What makes it work (15:50) The Schelling transformation on questions (19:10) Part Two: Schelling morality via the cosmic Schelling population (21:12) Scale-invariant adaptations (22:54) An example: stealing (30:32) Recognition versus endorsement versus adherence (31:34) The answer frequencies versus the answer (33:59) Ties are rare (35:06) Is the cosmic Schelling answer ever knowable with confidence? (36:02) Schelling participation effects, revisited (38:03) Is this just the mind projection fallacy? (39:42) When are cosmic Schelling morals easy to identify? (42:59) Scale invariance revisited (44:03) A second example: Pareto-positive trade (47:45) Harder questions and caveats (50:01) Ties are unstable (51:43) Isnt this assuming moral realism? (53:07) Dont these results depend on the distribution over beings? (54:41) What about the is-ought gap? (56:29) Tolerance, local variation, and freedom (58:25) Terrestrial Schelling-goodness (59:42) So what does good mean, again? (01:01:08) Implications for AI alignment (01:06:15) Conclusion and historical context (01:09:16) FAQ (01:09:20) Basic misunderstandings (01:12:20) More nuanced questions --- First published: February 28th, 2026 Source: https://www.lesswrong.com/posts/TkBCR8XRGw7qmao6z/schelling-goodness-and-shared-morality-as-a-goal --- Narrated by TYPE III AUDIO.

1h 15m
1D AGO

“Anthropic and the DoW: Anthropic Responds” by Zvi

The Department of War gave Anthropic until 5:01pm on Friday the 27th to either give the Pentagon ‘unfettered access’ to Claude for ‘all lawful uses,’ or else. With the ‘or else’ being not the sensible ‘okay we will cancel the contract then’ but also expanding to either being designated a supply chain risk or having the government invoke the Defense Production Act. It is perfectly legitimate for the Department of War to decide that it does not wish to continue on Anthropic's terms, and that it will terminate the contract. There is no reason things need be taken further than that. Undersecretary of State Jeremy Lewin: This isn’t about Anthropic or the specific conditions at issue. It's about the broader premise that technology deeply embedded in our military must be under the exclusive control of our duly elected/appointed leaders. No private company can dictate normative terms of use—which can change and are subject to interpretation—for our most sensitive national security systems. The @DeptofWar obviously can’t trust a system a private company can switch off at any moment. Timothy B. Lee: OK, so don’t renew their contract. Why are you threatening to go nuclear by declaring them [...] --- Outline: (08:00) Good News: We Can Keep Talking (10:31) Once Again No You Do Not Need To Call Dario For Permission (15:22) The Pentagon Reiterates Its Demands And Threats (16:48) The Pentagons Dual Threats Are Contradictory and Incoherent (18:27) The Pentagons Position Has Unfortunate Implications (20:25) OpenAI Stands With Anthropic (22:48) xAI Stands On Unreliable Ground (25:25) Replacing Anthropic Would At Least Take Months (26:02) We Will Not Be Divided (27:50) This Risks Driving Other Companies Away (30:32) Other Reasons For Concern (32:10) Wisdom From A Retired General (35:06) Congress Urges Restraint (37:05) Reaction Is Overwhelmingly With Anthropic On This (40:52) Some Even More Highly Unhelpful Rhetoric (47:23) Other Summaries and Notes (48:32) Paths Forward --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/ppj7v4sSCbJjLye3D/anthropic-and-the-dow-anthropic-responds --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

50 min
1D AGO

“Getting Back To It” by sarahconstantin

Artist: Lily Taylor It's been a while since I’ve written anything lately, and that doesn’t feel good. My writing voice has always been loadbearing to my identity, and if I don’t have anything to say, if I’m not “appearing in public”, it's a little bit destabilizing. Invisibility can be comfortable (and I’m less and less at home with the aggressive side of online discourse these days) but it's also a little bit of a cop-out. The fact is, I’ve been hiding. It feels like “writer's block” or like I “can’t think of anything to say”, but obviously that's suspect, and the real thing is that I can’t think of anything to say that's impeccable and beyond reproach and definitely won’t get criticized. Also, it's clearly a vicious cycle; the less I participate in public life, the fewer discussions I’m part of, and the fewer opportunities I have to riff off of what other people are saying. Life Stuff So what have I been up to? Well, for one thing, I had a baby. This is Bruce. He is very good. For another, I’ve been job hunting. Solo consulting was fun, but I wasn’t getting many clients, and [...] --- Outline: (01:04) Life Stuff (02:27) Projects (03:54) 25. Miscellaneous Opinions --- First published: February 26th, 2026 Source: https://www.lesswrong.com/posts/AYgby4f8EwhABX54q/getting-back-to-it --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

14 min
1D AGO

“New ARENA material: 8 exercise sets on alignment science & interpretability” by CallumMcDougall

TLDR This is a post announcing a lot of new ARENA material I've been working on for a while, which is now available for study here (currently on the alignment-science branch, but planned to be merged into main this Sunday). There's a set of exercises (each one contains about 1-2 days of material) on the following topics: Linear Probes (replication of the "Geometry of Truth" paper, plus Apollo's "Probing for Deception" work) Activation Oracles (based around this demo notebook, with additional exercises on model diffing) Attribution graphs (you can build them from scratch here including all the graph pruning implementations, and also use the circuit-tracer library) Emergent Misalignment (mostly based on Soligo & Turner's work; this also covers a lot of "basics of how to work with model organisms" like writing autoraters, using LoRA finetunes, etc) Science of Misalignment (walkthrough of 2 case studies: Palisade's "Shutdown Resistance" & GDM's follow-up, and Alignment Faking) Reasoning Model Interpretability (guided replication of Thought Anchors plus the blackmail extension) LLM Psychology & Persona Vectors (replicates the "assistant axis" paper including activation capping technique, and also has you create a persona vector extraction pipeline) Investigator Agents (basically takes you through building mini-Petri from [...] --- Outline: (00:13) TLDR (01:49) New material (01:52) en-US-AvaMultilingualNeural__ Diagram showing eight AI safety and interpretability concepts including linear probes and activation oracles. (03:19) (1.3.1) Linear Probes (04:06) (1.3.4) Activation Oracles (04:58) (1.4.2) Attribution Graphs (06:15) (4.1) Emergent Misalignment (07:05) (4.2) Science of Misalignment (08:00) (4.3) Reasoning Model Interpretability (08:52) (4.4) LLM Psychology & Persona Vectors (09:51) (4.5) Investigator Agents (10:45) New Site Features (12:07) Logistics (12:57) Why use, in vibe-code world? (15:12) Feedback The original text contained 1 footnote which was omitted from this narration. --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/nQAN2vxv2ASjowMda/new-arena-material-8-exercise-sets-on-alignment-science-and --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

16 min
1D AGO

“Sam Altman says OpenAI shares Anthropic’s red lines in Pentagon fight” by Matrice Jacobine

OpenAI CEO Sam Altman wrote in a memo to staff that he will draw the same red lines that sparked a high-stakes fight between rival Anthropic and the Pentagon: no AI for mass surveillance or autonomous lethal weapons. Why it matters: If other leading firms like Google follow suit, this could massively complicate the Pentagon's efforts to replace Anthropic's Claude, which was the first model integrated into the military's most sensitive work. It would also be the first time the nation's top AI leaders have taken a collective stand about how the U.S. government can and can't use their technology. The flipside: Altman made clear he still wants to strike a deal with the Pentagon that would allow ChatGPT to be used for sensitive military contexts. Despite the show of solidarity, such a deal could see OpenAI replace Anthropic if the Pentagon follows through with its plan to declare the latter a "supply chain risk." What he's saying: "[R]egardless of how we got here, this is no longer just an issue between Anthropic and the [Pentagon]; this is an issue for the whole industry and it is important to clarify our stance," Altman wrote Thursday evening in [...] --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/gkaXzCkpoayBXSi2k/sam-altman-says-openai-shares-anthropic-s-red-lines-in --- Narrated by TYPE III AUDIO.

6 min
1D AGO

“Strategic nuclear war twice as likely to occur by accident than by AI decisions according to new study” by kromem

If this headline strikes you as suspicious, you probably have good epistemics about both AI decision making failure rates and the relative likelihood of accidental strategic nuclear war. However, the fact that this is an accurate reporting on a new study that's wildly caught attention across social media should give us pause and warrant a closer look at what's going on, and how what's going on influences not just this headline but all the headlines around this study. I'm referring to the new paper from Kenneth Payne, AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises (2026) which was recently featured in the New Scientist piece "AIs can’t stop recommending nuclear strikes in war game simulations" (perhaps a more accurate a headline than intended). What I'd like to focus in on was Payne's choice in the inclusion, design, and interpretation of his 'accident' mechanic in this study (emphasis added): Finally, we introduced random accidents to simulate the ’fog of war’. With small probability, a model's chosen action is replaced by a more escalatory option, representing miscommunication, unauthorized action, or technical failure. Critically, only the affected player knows the escalation was accidental; their opponent sees only [...] --- Outline: (06:37) Nuclear simulation propagation (08:48) Deescalation is needed --- First published: February 26th, 2026 Source: https://www.lesswrong.com/posts/DwxJpWDoHHvvYupWh/strategic-nuclear-war-twice-as-likely-to-occur-by-accident --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

11 min
1D AGO

“Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior” by aditya singh, gersonkroiz, Senthooran Rajamanoharan, Neel Nanda

Authors: Aditya Singh*, Gerson Kroiz*, Senthooran Rajamanoharan, Neel Nanda Aditya and Gerson are co-first authors. This work was conducted during MATS 9.0 and was advised by Senthooran Rajamanoharan and Neel Nanda. Motivation Imagine that a frontier lab's coding agent has been caught putting a bug in the key code for monitoring what that agent does. Naively, this seems like a clear smoking gun that the agent is scheming. But LLMs often do weird things; they could easily just be confused, or have made a mistake. These all require a response, but the cause and appropriate fix are very different between a scheming and confused model. As such, it is extremely important that we have high-quality methods to be able to incriminate or exonerate a model caught taking sketchy actions, to either build a rigorous case that serious action is needed, or conclude it's a false alarm. Executive Summary A central research question in model incrimination is understanding what motivates a model's actions. To practice this, we investigated several potentially concerning actions from models to uncover the motivations behind them. In particular, we built and then investigated interesting environments where models whistleblew, deceived about sabotaged evals, cheated by taking shortcuts [...] --- Outline: (00:35) Motivation (01:18) Executive Summary (03:48) Environments We Studied (04:41) Funding Email (Behavior: Whistleblowing) (07:35) Evaluation Tampering (Behavior: Deception) (09:08) Secret Number (Behavior: Reward Hacking) (11:05) Math Sandbagging (Behavior: Sandbagging) (12:27) Takeaways For Doing Good Investigations (14:51) Actually Read Your Data (15:24) Verify Hypotheses by Changing the Prompt / Environment (15:47) How to Think about Prompt Counterfactuals (18:29) Examples of how Prompt Counterfactuals are Useful (19:06) Examples of where Prompt Counterfactuals are Confusing (20:06) Corroborate Findings with Several Independent Methods (20:42) Funding Email (Behavior: Whistleblowing) (22:30) Evaluation Tampering (Behavior: Deception) (24:12) Secret Number (Behavior: Reward Hacking) (26:47) Discussion The original text contained 3 footnotes which were omitted from this narration. --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/Bv4CLkNzuG6XYTjEe/why-did-my-model-do-that-model-incrimination-for-diagnosing --- Narrated by TYPE III AUDIO.

30 min
1D AGO

“Here’s to the Polypropylene Makers” by jefftk

Six years ago, as covid-19 was rapidly spreading through the US, my sister was working as a medical resident. One day she was handed an N95 and told to "guard it with her life", because there weren't any more coming. N95s are made from meltblown polypropylene, produced from plastic pellets manufactured in a small number of chemical plants. Building more would take too long: we needed these plants producing all the pellets they could. Braskem America operated plants in Marcus Hook PA and Neal WV. If there were infections on-site, the whole operation would need to shut down, and the factories that turned their pellets into mask fabric would stall. Companies everywhere were figuring out how to deal with this risk. The standard approach was staggering shifts, social distancing, temperature checks, and lots of handwashing. This reduced risk, but it was still significant: each shift change was an opportunity for someone to bring an infection from the community into the factory. I don't know who had the idea, but someone said: what if we never left? About eighty people, across both plants, volunteered to move in. The plan was four weeks, twelve-hour [...] --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/HQTueNS4mLaGy3BBL/here-s-to-the-polypropylene-makers --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

4 min

See All (250)

Audio narrations of LessWrong posts.

Creator

LessWrong
Years Active

2023 - 2026
Episodes

250
Rating

Clean
Show Website

LessWrong (30+ Karma)

Technology

Technology

Updated Weekly
Physics

Physics

Updated Weekly
Technology

Technology

Updated Biweekly
Life Sciences

Life Sciences

Updated Weekly
Science

Science

Updated Feb 20
Business

Business

Updated Weekly
Technology

Technology

Updated Semiweekly

LessWrong (30+ Karma)

“Schelling Goodness, and Shared Morality as a Goal” by Andrew_Critch

“Anthropic and the DoW: Anthropic Responds” by Zvi

“Getting Back To It” by sarahconstantin

“New ARENA material: 8 exercise sets on alignment science & interpretability” by CallumMcDougall

“Sam Altman says OpenAI shares Anthropic’s red lines in Pentagon fight” by Matrice Jacobine

“Strategic nuclear war twice as likely to occur by accident than by AI decisions according to new study” by kromem

“Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior” by aditya singh, gersonkroiz, Senthooran Rajamanoharan, Neel Nanda

“Here’s to the Polypropylene Makers” by jefftk

About

Information

You Might Also Like

LessWrong (30+ Karma)

Episodes

“Schelling Goodness, and Shared Morality as a Goal” by Andrew_Critch

“Anthropic and the DoW: Anthropic Responds” by Zvi

“Getting Back To It” by sarahconstantin

“New ARENA material: 8 exercise sets on alignment science & interpretability” by CallumMcDougall

“Sam Altman says OpenAI shares Anthropic’s red lines in Pentagon fight” by Matrice Jacobine

“Strategic nuclear war twice as likely to occur by accident than by AI decisions according to new study” by kromem

“Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior” by aditya singh, gersonkroiz, Senthooran Rajamanoharan, Neel Nanda

“Here’s to the Polypropylene Makers” by jefftk

About

Information

You Might Also Like