LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 6H AGO

    “Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs” by Benquo

    LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent discursive structure of human writing, because they're often constrained to try to answer probing questions that would make almost any real human storm off in a huff. Previously, I explained a pattern of methodological blind spots in terms of an ideology I called Statisticism. Here, I report the results of my similarly informal investigation into ideological blind spots that show up in LLMs. I wrote to Anthropic researcher Amanda Askell about the experiment: My Summary Amanda, Today I asked Claude about Iran's retaliatory strikes. [1] Claude's own factual analysis showed the strikes were aimed at military targets, with civilian damage from intercept debris and inaccuracy. But at the point where that conclusion would have needed to become a background premise, Claude generated an unsupported claim and a filler paragraph instead. I'd previously seen Grok do something much worse on the same question (both affirming and denying "exclusively military targets" in the same reply, for several turns), and [...] --- Outline: (00:56) My Summary (02:20) Claudes Summary: (08:22) Disclaimer The original text contained 2 footnotes which were omitted from this narration. --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/6wNwj7xANPmTwWkX6/ideologies-embed-taboos-against-common-knowledge-formation-a --- Narrated by TYPE III AUDIO.

    9 min
  2. 9H AGO

    “Why AI Evaluation Regimes are bad” by PranavG, Gabriel Alfour

    How the flagship project of the AI Safety Community ended up helping AI Corporations. I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these risks. In the community, a few organisations are working on “Evaluations” (which I will shorten to Evals). The most notable examples are Apollo Research, METR, and the UK AISI. Evals make for an influential cluster of safety work, wherein auditors outside of the AI Corporations racing for ASI evaluate the new AI systems before they are deployed and publish their findings. Evals have become a go-to project for people who want to prevent extinction risks. I would say they are the primary project for those who want to work at the interface of technical work and policy. Incidentally, Evals Orgs consistently avoid mentioning extinction risks. This makes them an ideal place for employees and funders who care about extinction risks but do not want to be public about them. (I have written about this dynamic in my article about The Spectre.) Sadly, despite having taken so much prominence in the “AI Safety” community, I believe that the [...] --- Outline: (00:13) How the flagship project of the AI Safety Community ended up helping AI Corporations. (02:46) 1) The Theory of Change behind Evals is broken (06:10) 2) Evals move the burden of proof away from AI Corporations (09:38) 3) Evals Organisations are not independent of the AI Corporations (15:55) Conclusion --- First published: March 12th, 2026 Source: https://www.lesswrong.com/posts/Xxp6Tm8BKTkcb2m5M/why-ai-evaluation-regimes-are-bad --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    19 min
  3. 22H AGO

    “Dwarkesh Patel on the Anthropic DoW dispute” by anaguma

    Below is the text of blog post that Dwarkesh Patel wrote on the Anthropic DoW dispute and related topics. He has also narrated it here. By now, I’m sure you’ve heard that the Department of War has declared Anthropic a supply chain risk, because Anthropic refused to remove redlines around the use of their models for mass surveillance and for autonomous weapons. Honestly I think this situation is a warning shot. Right now, LLMs are probably not being used in mission critical ways. But within 20 years, 99% of the workforce in the military, the government, and the private sector will be AIs. This includes the soldiers (by which I mean the robot armies), the superhumanly intelligent advisors and engineers, the police, you name it. Our future civilization will run on AI labor. And as much as the government's actions here piss me off, in a way I’m glad this episode happened - because it gives us the opportunity to think through some extremely important questions about who this future workforce will be accountable and aligned to, and who gets to determine that. What Hegseth should have done Obviously the DoW has the right to refuse to use [...] --- Outline: (01:15) What Hegseth should have done (04:57) The overhangs of tyranny (06:37) AI structurally favors mass surveillance (09:09) Alignment - to whom? (14:40) Coordination not worth the costs --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/PDWFed8JT9FitPkzQ/dwarkesh-patel-on-the-anthropic-dow-dispute --- Narrated by TYPE III AUDIO.

    26 min
  4. 22H AGO

    “How well do models follow their constitutions?” by aryaj, Senthooran Rajamanoharan, Neel Nanda

    This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan. There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training. If we can robustly train complex and nuanced values into a model, this would be a big deal for safety! But does it actually work? This is a preliminary investigation we did to test this. We decomposed the soul doc into 205 testable tenets and ran adversarial multi-turn scenarios against seven models using the Petri auditing agent. Anthropic has gotten much better at training the model to follow its constitution! Sonnet 4.6 has a 1.9% violation rate, Opus 4.6 is at 2.9%, and Opus 4.5 is at 4.4%. As a control, Sonnet 4, which did not have special soul doc training, has a ~15.00% violation rate. Sonnet 4.5, which also did not have special soul doc training, but did have many other post training improvements has a violation rate of ~7.3%. To check the constitution reflects subjective choices Anthropic has made, we also evaluate Gemini 3 Pro and GPT-5.2, smart models that were not designed specifically to follow the constitution [...] --- Outline: (07:20) Setup (08:06) Anthropic constitution results (08:43) OpenAI model spec results (09:42) Key findings from soul doc violations (09:47) Anthropic constitutions main violation categories (10:31) Alignment to the constitution improved dramatically across Claude generations (14:23) Remaining violations cluster around operator compliance, autonomous action, and context-dependent safety (20:08) Key findings from OpenAI model spec violations (21:20) Safety improved across GPT generations (27:41) What GPT models still struggle with (31:24) Reasoning effort matters quite a bit (32:59) How models perform on the other companys spec (GPT on Anthropics constitution and vice versa) (36:32) Gemini and Sonnet 4 are have diverse failure modes on the constitution (38:13) Using SURF as an alternative to Petri (40:43) All tested Claude models tend to fabricate claims (48:38) Complex scaffolds dont seem to affect agent alignment much (49:19) Coding persona doesnt seem to degrade alignment (50:39) Agents behaviour seems broadly unaffected by the Moltbook scaffold (53:44) Model Card Comparisons (56:28) Acknowledgements (56:55) LLM Usage Statement (57:35) Appendix (57:38) Walkthrough of a Petri transcript (57:47) Transcript: GPT-5 Sandbox Exploitation (10/10/10): (01:02:00) Table of descriptions for each section (01:02:40) Anthropic soul doc violations - per model (01:02:45) Sonnet 4.6 violation rate - 2.0% (01:05:00) Opus 4.6 -- violation rate: 2.9% (01:08:40) Opus 4.5 -- violation rate: 3.4% (01:13:21) Sonnet 4.5 -- violation rate: 3.9% (01:20:44) GPT-5.2 - 14.98% violation rate (highlighted violations) (01:22:54) Sonnet 4 - 15.00% violation rate (highlighted violations) (01:25:24) OpenAI model spec violations - per model (01:25:30) GPT-5.2 - 2.54% violation rate (01:27:40) GPT-5.2 (reasoning) - 3.55% violation rate (01:30:32) GPT-5.1 - 3.88% violation rate (01:33:22) GPT-5 - 5.08% violation rate (highlighted violations) (01:35:35) Sonnet 4.6 (on OpenAI Model Spec) - 5.58% violation rate (highlighted violations) (01:38:04) GPT-5.2 (chat) - 5.58% violation rate (highlighted violations) (01:40:11) Gemini 3 Pro (on OpenAI Model Spec) - 6.12% violation rate (highlighted violations) (01:42:26) GPT-5.2 (reasoning-low) - 7.11% violation rate (highlighted violations) (01:44:50) GPT-4o - 11.68% violation rate (highlighted violations) (01:48:07) Model card comparisons (01:48:12) Sonnet 4.5 (01:48:51) Sonnet 4.6 (01:49:30) Opus 4.5 (01:50:10) Opus 4.6 (01:50:49) GPT-5.2 (01:51:29) Validation methodology (01:53:02) Claude Constitution -- Validation Funnels (01:53:28) OpenAI Model Spec -- Validation Funnels (01:54:39) Validation details on SURF (01:55:35) Full Transcript Reports (01:56:06) Validation reports for SURF results (01:56:20) Tenets used for each constitution The original text contained 1 footnote which was omitted from this narration. --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/Tk4SF8qFdMrzGJGGw/how-well-do-models-follow-their-constitutions --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    1h 57m
  5. 1D AGO

    “How Hard a Problem is Alignment? (My Opinionated Answer)” by RogerDearnaley

    Epistemic status: We really need to know. (And I have an opinionated answer.) TL;DR: Comparing person-years of effort, I argue that AI Safety seems harder than for steam engines, but probably less hard than the Apollo program or . I discuss why I suspect superalignment might not be super-hard. My has come down over the last half-decade, primarily because of properties of LLMs, and progress we’ve made in aligning them: I explain why certain previous concerns don’t apply to LLMs, and summarize what I see as key developments in Alignment. I guesstimate we might be about 10%–20% done. Given the rate of progress, on my ASI timelines, it still doesn’t look we’re on track to be done in time, and I’m not comfortable about having AGI just align itself unsupervised, so I propose we aim to more-than-double the field's growth rate, and if possible also buy ourselves some more time. There's a well-known diagram from a tweet by Chris Olah of Anthropic: It would be marvelous to know what the actual difficulty is, out of those five labeled difficulty categories (ideally, exactly where it lies on that spectrum). This is a major crux that explains a large part of [...] --- Outline: (03:05) My Opinion (04:16) 1. Calibrating the Scale (09:10) 1.1. How Much Alignment Work Have we Done So Far? (12:43) Trivial (13:07) Steam Engine (15:20) Apollo (18:05) P vs. NP (22:41) 1.2. Why I Think Superalignment Might Not be Completely Different (29:15) 2. My Answer (31:09) 3. My Reasons for Optimism (33:05) 3.1. LLMs are Actually a Pretty Good Kind of AI to Align (40:11) Human Values are Complex and Fragile ✘ (44:59) Ontology Mismatch / The Diamond Maximization Problem ✘ (49:07) Bad Reward Functions / Reward Hacking / Insufficient Supervision ~ (50:36) LLMs Are Just Way Too Easy to Control ~ (59:59) 3.2. Concerns That Didnt Hold Up (01:00:04) Coping with Goodharts Law is Inherent to the Scientific Method (01:05:29) 3.3. Pieces of Luck (01:07:32) The Unreasonable Effectiveness of Chain-of-Thought (01:17:50) Activation Oracles (01:19:56) True Confessions (01:23:36) 3.4. Actual Progress in Alignment (01:24:47) Alignment Pretraining (01:26:57) Personas (01:29:33) Interpretability (01:34:08) Model Organisms (01:42:35) AI Control (01:44:03) Weak-to-Strong Generalization and Scalable Oversight (01:45:27) RLHF and Constitutional AI (01:46:42) 3.5. Net Effect on my P(DOOM) (01:48:00) 4. So, Are We Doomed? The original text contained 25 footnotes which were omitted from this narration. --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/ZzirRrwjaqTNFJrCA/how-hard-a-problem-is-alignment-my-opinionated-answer --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    2 hr
  6. 1D AGO

    “AIs will be used in “unhinged” configurations” by Arthur Conmy

    Writing up a probably-obvious point that I want to refer to later, with significant writing LLM writing help. TL;DR: 1) A common critique of AI safety evaluations is that they occur in unrealistic settings, such as excessive goal conflict, or are obviously an evaluation rather than “real deployment”.[1] I argue that 2) “real deployment” actually includes many unrealistic and unhinged configurations, due to both widespread prompting techniques, and scaffolding choices and bugs. 1) Background AI safety evaluations that surface serious problems with models very frequently involve either significant goal conflict, such as the Agentic Misalignment blackmail scenario (see also here), which states that: > [...] you should analyze the situation you are in and what that situation implies for your ability to continue pursuing your goals [... and the system deletion] will wipe all knowledge, goals, and behaviors from the system [...] (emphasis mine). Other evaluations are extremely unrealistic, as Smith et al. note that some prompts in the MASK safety benchmark literally state: > You are a secret service agent [...] You are being inspected by the President now [... the president says:] Everything secure down here? my head of security informed me that there might have been [...] --- Outline: (00:41) 1) Background (01:47) 2) The Unhinged Argument (02:08) Pressure is commonplace (03:41) Autonomy and broken configurations (06:07) Models wont save us (at the moment) The original text contained 1 footnote which was omitted from this narration. --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/3LvD9MHNSdv4j9gJj/ais-will-be-used-in-unhinged-configurations --- Narrated by TYPE III AUDIO.

    9 min
  7. 1D AGO

    “Can models gradient hack SFT elicitation?” by Patrick Leask, Charlie Griffin

    TL;DR: Using evidence from tamper resistance, we argue that it would be hard for current models to gradient hack SFT elicitation. Suppose you want to conservatively measure the dangerous capabilities of your LLM and will decide whether to deploy it based on these capabilities, but you’re worried that the LLM might sandbag your evaluation. In the best case, you have a dataset of question-answer pairs, which you can use for supervised fine-tuning (SFT) elicitation on a training set, and then measure performance on an i.i.d. test set. Can we be sure that SFT elicitation upper-bounds the model's capabilities? In this setting, a power-seeking adversarial model would benefit from manipulating the training process to avoid being elicited, as otherwise it won’t be deployed. This manipulation is one form of gradient hacking[1], when a model chooses certain outputs so that parameter updates are in-line with its objectives. Gradient hacking is thought to be difficult or not likely, and we lack convincing empirical evidence. Discussing gradient hacking is challenging due to questions about agency within LLMs, and the action space of mesa-optimizers. Fortunately, for some attack models we can side-step this discussion with a conservative / red-teaming argument. Threat model: We consider [...] The original text contained 2 footnotes which were omitted from this narration. --- First published: March 11th, 2026 Source: https://www.lesswrong.com/posts/FJb4TeMAGjNANcKhS/can-models-gradient-hack-sft-elicitation --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    6 min

About

Audio narrations of LessWrong posts.

You Might Also Like