LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 14 HR AGO

    “AI #159: See You In Court” by Zvi

    The conflict between Anthropic and the Department of War has now moved to the courts, where Anthropic has challenged the official supply chain risk designation as well as the order to remove it from systems across the government, claiming retaliation for protected speech. It will take a bit to work its way through the courts. Anthropic has the principles of law on its side, a maximally strong set of facts and absurdly strong amicus briefs. If Anthropic loses this case, there will be far reaching consequences for our freedoms. Let us hope this remains in the courts and is allowed to play out there, and then ultimately that negotiations can resume and the parties can at least agree on a smooth transition to alternative service providers. If DoW wants an otherwise full deal more than it wants the right to use Claude to monitor Americans and analyze their data, a full deal is possible as well, but if they demand full ‘all lawful use,’ all trust has been lost or they are or always were out to hurt Anthropic, then there is no deal or ZOPA. That has overshadowed what would normally be the main event [...] --- Outline: (01:48) Language Models Offer Mundane Utility (03:46) Language Models Dont Offer Mundane Utility (07:11) Language Models Break Your Vital Internet Infrastructure (08:02) Huh, Upgrades (09:29) On Your Marks (15:13) Choose Your Fighter (15:40) Get My Agent On The Line (16:55) Deepfaketown and Botpocalypse Soon (19:08) A Young Ladys Illustrated Primer (19:27) You Drive Me Crazy (19:47) They Took Our Jobs (23:10) Get Involved (24:58) Introducing (26:03) The Anthropic Institute (28:13) In Other AI News (29:14) The Rise of Claude (33:31) Trouble At OpenAI (35:21) Show Me the Money (36:36) Thanks For The Memos (37:36) A Contract Is A Contract Is A Contract (40:09) Level of Friction (41:10) Quiet Speculations (41:46) Quickly, Theres No Time (43:14) Apology Tour (43:59) Well See You In Court (50:41) Jawboning (54:02) Executive Order (54:53) The Acute Crisis Passes (56:48) Others Cover This (57:28) Dwarkesh Patel Gives Mixed Thoughts (01:02:57) This Means A Special Military Operation (01:03:25) Bernie Sanders Is Worried and Curious About AI (01:06:10) The Quest for Survival (01:09:25) The Quest For No Regulations Whatsoever (01:10:44) Chip City (01:11:07) The Week in Audio (01:12:52) Rhetorical Innovation (01:23:08) Aligning a Smarter Than Human Intelligence is Difficult (01:28:57) People Are Worried About AI Killing Everyone (01:31:03) Other People Are Not As Worried About AI Killing Everyone (01:32:52) The Lighter Side --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/DnrjKZTZwHGjdDB4u/ai-159-see-you-in-court --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    1h 36m
  2. 14 HR AGO

    “Operationalizing FDT” by Vivek Hebbar

    This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions: given a logical causal graph, how do we define the logical do-operator? what is logical causality and how might it be formalized? how does FDT interact with anthropic updating? why do we need logical causality?  why FDT and not EDT? Defining the logical do-operator Consider Parfit's hitchhiker: A logical causal graph for Parfit's hitchhiker, where blue nodes are logical facts An FDT agent is supposed to reason as follows: I am deciding the value of the node "Does my algorithm pay?" If I set that node to "yes", then omega will save me and I will get +1000 utility.  Also I will pay and lose 1 utility.  Total is +999. If I set that node to "no", then omega will not save me.  I will get 0 utility. Therefore I choose to pay. The bolded phrases are invoking logical counterfactuals. Because I have drawn a "logical causal graph", I will call the operation which generates these counterfactuals a "logical do-operator" by analogy to the do-operator of CDT. In ordinary CDT, it is impossible to observe a variable that is downstream of [...] --- Outline: (00:39) Defining the logical do-operator (04:24) Logical causality (05:40) Causality as derived from a world model (06:45) Logical inductors (07:05) Algorithmic mutual information of heuristic arguments (08:09) How does FDT interact with anthropic updating? (09:48) Putting it together: An attempt at operationalizing FDT (11:22) Appendix: Why bother with logical causality? (17:51) Acknowledgements The original text contained 10 footnotes which were omitted from this narration. --- First published: March 13th, 2026 Source: https://www.lesswrong.com/posts/RyDkpWGLQsCnABE78/operationalizing-fdt --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    18 min
  3. 21 HR AGO

    “Are AIs more likely to pursue on-episode or beyond-episode reward?” by Anders Woodruff, Alex Mallen

    Consider an AI that terminally pursues reward. How dangerous is this? It depends on how broadly-scoped a notion of reward the model pursues. It could be: on-episode reward-seeking: only maximizing reward on the current training episode — i.e., reward that reinforces their current action in RL. This is what people usually mean by “reward-seeker” (e.g. in Carlsmith or The behavioral selection model…). beyond-episode reward-seeking: maximizing reward for a larger-scoped notion of “self” (e.g., all models sharing the same weights). In this post, I’ll discuss which motivation is more likely. This question is, of course, very similar to the broader questions of whether we should expect scheming. But there are a number of considerations particular to motivations that aim for reward; this post focuses on these. Specifically: On-episode and beyond-episode reward seekers have very similar motivations (unlike, e.g., paperclip maximizers and instruction followers), making both goal change and goal drift between them particularly easy. Selection pressures against beyond-episode reward seeking may be weak, meaning beyond-episode reward seekers might survive training even without goal-guarding. Beyond-episode reward is particularly tempting in training compared to random long-term goals, making effective goal-guarding harder. This question is important because beyond-episode reward seekers are [...] --- Outline: (03:41) Pre-RL Priors Might Favor Beyond-Episode Goals (05:34) Acting on beyond-episode reward seeking is disincentivized if training episodes interact (08:35) But beyond-episode reward-seekers can goal-guard (09:29) Are on-episode reward seekers or goal-guarding beyond-episode reward-seekers more likely? (09:53) Why on-episode reward seekers might be favored (12:13) Why goal-guarding beyond-episode reward seekers might be favored (14:50) Conclusion The original text contained 2 footnotes which were omitted from this narration. --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/jp6CdbKjueWpBFSff/are-ais-more-likely-to-pursue-on-episode-or-beyond-episode --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    17 min
  4. 1 DAY AGO

    “Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs” by Benquo

    LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent discursive structure of human writing, because they're often constrained to try to answer probing questions that would make almost any real human storm off in a huff. Previously, I explained a pattern of methodological blind spots in terms of an ideology I called Statisticism. Here, I report the results of my similarly informal investigation into ideological blind spots that show up in LLMs. I wrote to Anthropic researcher Amanda Askell about the experiment: My Summary Amanda, Today I asked Claude about Iran's retaliatory strikes. [1] Claude's own factual analysis showed the strikes were aimed at military targets, with civilian damage from intercept debris and inaccuracy. But at the point where that conclusion would have needed to become a background premise, Claude generated an unsupported claim and a filler paragraph instead. I'd previously seen Grok do something much worse on the same question (both affirming and denying "exclusively military targets" in the same reply, for several turns), and [...] --- Outline: (00:56) My Summary (02:20) Claudes Summary: (08:22) Disclaimer The original text contained 2 footnotes which were omitted from this narration. --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/6wNwj7xANPmTwWkX6/ideologies-embed-taboos-against-common-knowledge-formation-a --- Narrated by TYPE III AUDIO.

    9 min
  5. 1 DAY AGO

    “Why AI Evaluation Regimes are bad” by PranavG, Gabriel Alfour

    How the flagship project of the AI Safety Community ended up helping AI Corporations. I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these risks. In the community, a few organisations are working on “Evaluations” (which I will shorten to Evals). The most notable examples are Apollo Research, METR, and the UK AISI. Evals make for an influential cluster of safety work, wherein auditors outside of the AI Corporations racing for ASI evaluate the new AI systems before they are deployed and publish their findings. Evals have become a go-to project for people who want to prevent extinction risks. I would say they are the primary project for those who want to work at the interface of technical work and policy. Incidentally, Evals Orgs consistently avoid mentioning extinction risks. This makes them an ideal place for employees and funders who care about extinction risks but do not want to be public about them. (I have written about this dynamic in my article about The Spectre.) Sadly, despite having taken so much prominence in the “AI Safety” community, I believe that the [...] --- Outline: (00:13) How the flagship project of the AI Safety Community ended up helping AI Corporations. (02:46) 1) The Theory of Change behind Evals is broken (06:10) 2) Evals move the burden of proof away from AI Corporations (09:38) 3) Evals Organisations are not independent of the AI Corporations (15:55) Conclusion --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/Xxp6Tm8BKTkcb2m5M/why-ai-evaluation-regimes-are-bad --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    19 min
  6. 1 DAY AGO

    “Dwarkesh Patel on the Anthropic DoW dispute” by anaguma

    Below is the text of blog post that Dwarkesh Patel wrote on the Anthropic DoW dispute and related topics. He has also narrated it here. By now, I’m sure you’ve heard that the Department of War has declared Anthropic a supply chain risk, because Anthropic refused to remove redlines around the use of their models for mass surveillance and for autonomous weapons. Honestly I think this situation is a warning shot. Right now, LLMs are probably not being used in mission critical ways. But within 20 years, 99% of the workforce in the military, the government, and the private sector will be AIs. This includes the soldiers (by which I mean the robot armies), the superhumanly intelligent advisors and engineers, the police, you name it. Our future civilization will run on AI labor. And as much as the government's actions here piss me off, in a way I’m glad this episode happened - because it gives us the opportunity to think through some extremely important questions about who this future workforce will be accountable and aligned to, and who gets to determine that. What Hegseth should have done Obviously the DoW has the right to refuse to use [...] --- Outline: (01:15) What Hegseth should have done (04:57) The overhangs of tyranny (06:37) AI structurally favors mass surveillance (09:09) Alignment - to whom? (14:40) Coordination not worth the costs --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/PDWFed8JT9FitPkzQ/dwarkesh-patel-on-the-anthropic-dow-dispute --- Narrated by TYPE III AUDIO.

    26 min
  7. 1 DAY AGO

    “How well do models follow their constitutions?” by aryaj, Senthooran Rajamanoharan, Neel Nanda

    This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training. If we can robustly train complex and nuanced values into a model, this would be a big deal for safety! But does it actually work? This is a preliminary investigation we did to test this. We decomposed the soul doc into 205 testable tenets and ran adversarial multi-turn scenarios against seven models using the Petri auditing agent. Anthropic has gotten much better at training the model to follow its constitution! Sonnet 4.6 has a 1.9% violation rate, Opus 4.6 is at 2.9%, and Opus 4.5 is at 4.4%. As a control, Sonnet 4, which did not have special soul doc training, has a ~15.00% violation rate. Sonnet 4.5, which also did not have special soul doc training, but did have many other post training improvements has a violation rate of ~7.3%. To check the constitution reflects subjective choices Anthropic has made, we also evaluate Gemini 3 Pro and GPT-5.2, smart models that were not designed specifically to follow the constitution [...] --- Outline: (07:20) Setup (08:06) Anthropic constitution results (08:43) OpenAI model spec results (09:42) Key findings from soul doc violations (09:47) Anthropic constitutions main violation categories (10:31) Alignment to the constitution improved dramatically across Claude generations (14:23) Remaining violations cluster around operator compliance, autonomous action, and context-dependent safety (20:08) Key findings from OpenAI model spec violations (21:20) Safety improved across GPT generations (27:41) What GPT models still struggle with (31:24) Reasoning effort matters quite a bit (32:59) How models perform on the other companys spec (GPT on Anthropics constitution and vice versa) (36:32) Gemini and Sonnet 4 are have diverse failure modes on the constitution (38:13) Using SURF as an alternative to Petri (40:43) All tested Claude models tend to fabricate claims (48:38) Complex scaffolds dont seem to affect agent alignment much (49:19) Coding persona doesnt seem to degrade alignment (50:39) Agents behaviour seems broadly unaffected by the Moltbook scaffold (53:44) Model Card Comparisons (56:28) Acknowledgements (56:55) LLM Usage Statement (57:35) Appendix (57:38) Walkthrough of a Petri transcript (57:47) Transcript: GPT-5 Sandbox Exploitation (10/10/10): (01:02:00) Table of descriptions for each section (01:02:40) Anthropic soul doc violations - per model (01:02:45) Sonnet 4.6 violation rate - 2.0% (01:05:00) Opus 4.6 -- violation rate: 2.9% (01:08:40) Opus 4.5 -- violation rate: 3.4% (01:13:21) Sonnet 4.5 -- violation rate: 3.9% (01:20:44) GPT-5.2 - 14.98% violation rate (highlighted violations) (01:22:54) Sonnet 4 - 15.00% violation rate (highlighted violations) (01:25:24) OpenAI model spec violations - per model (01:25:30) GPT-5.2 - 2.54% violation rate (01:27:40) GPT-5.2 (reasoning) - 3.55% violation rate (01:30:32) GPT-5.1 - 3.88% violation rate (01:33:22) GPT-5 - 5.08% violation rate (highlighted violations) (01:35:35) Sonnet 4.6 (on OpenAI Model Spec) - 5.58% violation rate (highlighted violations) (01:38:04) GPT-5.2 (chat) - 5.58% violation rate (highlighted violations) (01:40:11) Gemini 3 Pro (on OpenAI Model Spec) - 6.12% violation rate (highlighted violations) (01:42:26) GPT-5.2 (reasoning-low) - 7.11% violation rate (highlighted violations) (01:44:50) GPT-4o - 11.68% violation rate (highlighted violations) (01:48:07) Model card comparisons (01:48:12) Sonnet 4.5 (01:48:51) Sonnet 4.6 (01:49:30) Opus 4.5 (01:50:10) Opus 4.6 (01:50:49) GPT-5.2 (01:51:29) Validation methodology (01:53:02) Claude Constitution -- Validation Funnels (01:53:28) OpenAI Model Spec -- Validation Funnels (01:54:39) Validation details on SURF (01:55:35) Full Transcript Reports (01:56:06) Validation reports for SURF results (01:56:20) Tenets used for each constitution The original text contained 1 footnote which was omitted from this narration. --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/Tk4SF8qFdMrzGJGGw/how-well-do-models-follow-their-constitutions --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    1h 57m

About

Audio narrations of LessWrong posts.

You Might Also Like