LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. -2 H

    “Tools to generate realistic prompts help surprisingly little with Petri audit realism” by Connor Kissane, Monte M, Fabien Roger

    TLDR We train and many-shot prompt base models to generate user prompts that are harder to distinguish from deployment (WildChat) prompts.  Then we give Petri, an automated auditing agent, a tool to use a prompt generator model for sycophancy audits. It doesn’t help with making the full audit transcripts harder to distinguish from deployment transcripts, despite more realistic single-turn prompts.  We find evidence that full audit realism is bottlenecked not by prompt quality, but by higher-level features: the harmful nature of the scenarios and the unnatural structure of the multi-turn conversations. 💻Code Research done as part of the Anthropic Fellows Program. Introduction Automated auditing agents like Petri are increasingly used to evaluate the propensities of LLMs. But one growing challenge is evaluation awareness: frontier models can often detect when they’re in an eval vs deployment, making audits harder to trust. As models improve at detecting LLM generated text, they may identify some audit scenarios by noticing that their input prompts are synthetic rather than human written. This motivates the need for better methods to generate realistic synthetic prompts. In this post we: Train and many-shot prompt base models to generate realistic user prompts, and show that these [...] --- Outline: (00:14) TLDR (01:01) Introduction (03:25) Methods (03:28) Training realistic user prompt generators (04:55) Evaluating prompt generators (05:44) Results (05:47) Prompt generator models produce realistic user prompts (09:06) Prompt generators dont improve Petri audit realism (12:07) User prompts arent the bottleneck: realism judge CoT analysis (14:28) Appendix (14:31) SFT prompt format (14:49) One failed attempt at training conditional text GAN --- First published: March 1st, 2026 Source: https://www.lesswrong.com/posts/jdoDvKGLbaaJWnDpE/tools-to-generate-realistic-prompts-help-surprisingly-little --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    17 min
  2. -4 H

    “OpenAI employees: Now is the time to stop doing good work.” by bhauth

    Americans don't like OpenAI very much anymore, and you know why. Of course, AI systems it helped make have caused various problems already, like: bots pushing politics on social media maintainers of open source projects having their time wasted, and sometimes quitting AI-generated scientific papers crowding out good research scammers who clone the voice of a family member AI-generated fake pictures fooling people on Facebook And gamers haven't been huge fans of OpenAI since it bought up half the RAM wafer production - mostly just to keep it off the market so other people can't use it, since they only bought the raw wafers, not finished RAM. But the popularity of OpenAI has fallen abruptly this week, because it decided to help the US military make automated weapons and mass surveillance systems that Anthropic refused to make on ethical grounds. The leadership of OpenAI is trying to pretend that that they got the same deal Anthropic wanted but they were just nicer about it. They are lying. Anthropic wanted a human in the loop of autonomous weapons. OpenAI's contract just says that there must be human oversight to whatever extent is required by law. But of course, the [...] --- First published: March 2nd, 2026 Source: https://www.lesswrong.com/posts/amLbdiP9sTutJQdHA/openai-employees-now-is-the-time-to-stop-doing-good-work --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    8 min
  3. -12 H

    “An Open Letter to the Department of War and Congress” by Gordon Seidoh Worley

    There's an open letter opposing the DoW's assignment of Anthropic as a supply chain risk. It needs more signatures before it gets sent. If you're a tech founder, engineer, or investor and you agree with the letter, then your signature would help bolster its message. Your signature would be especially impactful if you work at a non-Anthropic SOTA AI lab. You can sign the letter here: https://app.dowletter.org/ Full text follows: We write as founders, engineers, investors, and executives in the American technology industry. We strongly believe the federal government should not retaliate against a private company for declining to accept changes to a contract. When two parties cannot agree on terms, the normal course is to part ways and work with a competitor. Instead, the Department of War has designated Anthropic a “supply chain risk” (a label normally reserved for foreign adversaries), stating that “no contractor, supplier, or partner that does business with the United States military may conduct any commercial activity with Anthropic.” This situation sets a dangerous precedent. Punishing an American company for declining to accept changes to a contract sends a clear message to every technology company in America: accept whatever terms the government demands, or [...] --- First published: March 1st, 2026 Source: https://www.lesswrong.com/posts/fETA2GwdgTs7CjXfy/an-open-letter-to-the-department-of-war-and-congress --- Narrated by TYPE III AUDIO.

    2 min
  4. -13 H

    “Introducing and Deprecating WoFBench” by jefftk

    We present and formally deprecate WoFBench, a novel test that compares the knowledge of Wings of Fire superfans to frontier AI models. The benchmark showed initial promise as a challenging evaluation, but unfortunately proved to be saturated on creation as AI models and superfans produced output that was, to the extent of our ability to score responses, statistically indistinguishable from entirely correct. Benchmarks are important tools for tracking the rapid advancements in model capabilities, but they are struggling to keep up with LLM progress: frontier models now consistently achieve high scores on many popular benchmarks, raising questions about their continued ability to differentiate between models. In response, we introduce WoFBench, an evaluation suite designed to test recall and knowledge synthesis in the domain of Tui T. Sutherland's Wings of Fire universe. The superfans were identified via a careful search process, in which all members of the lead author's household were asked to complete a self-assessment of their knowledge of the Wings of Fire universe. The assessment consisted of a single question, with the text "do you think you know the Wings of Fire universe better than Gemini?" Two superfans were identified, who we keep [...] --- First published: March 1st, 2026 Source: https://www.lesswrong.com/posts/YshqDtyzgWaJxthTo/introducing-and-deprecating-wofbench --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    6 min
  5. -16 H

    “I’m Bearish On Personas For ASI Safety” by J Bostock

    TL;DR Your base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM's extrapolation may not converge optimizing for what humanity would, on reflection, like to optimize, because these are different processes with different inductive biases. Intro I'm going to take the Persona Selection Model as being roughly true, for now. Even on its own terms, it will fail. If the Persona Selection Model is false, we die in a different way. I'm going to present some specific arguments and secnarios, but the core of it is a somewhat abstract point: the Claude persona, although it currently behaves in a human-ish way, will not grow into a superintelligence in the same way that humans would. This means it will not grow into the same kind of superintelligence with the same values that human values would converge on. Since value is fragile, this is fatal for the future. I don't think this depends on the specifics of Claude's training, nor how human values are instantiated, unless Claude's future training methods are specifically designed to work in the exact [...] --- Outline: (00:10) TL;DR (00:36) Intro (01:35) LLMs (01:39) Persona Selection and Other Models (03:06) Persona Theory As Alignment Plan (04:21) Gears of Personas (07:21) Complications (08:20) Reasoning and Chain-of-thought (10:00) Reinforcement Learning (11:40) Humans (11:43) Human Values (12:00) TL;DR (12:30) Goal-Models and Inductors (13:49) These Are Not The Same (16:34) Final Thoughts The original text contained 7 footnotes which were omitted from this narration. --- First published: March 1st, 2026 Source: https://www.lesswrong.com/posts/fMgE3E54PdDcZhvm6/i-m-bearish-on-personas-for-asi-safety --- Narrated by TYPE III AUDIO.

    18 min
  6. -19 H

    “Coherent Care” by abramdemski

    I've been trying to gather my thoughts for my next tiling theorem (agenda write-up here; first paper; second paper; recent project update). I have a lot of ideas for how to improve upon my work so far, and trying to narrow them down to an achievable next step has been difficult. However, my mind keeps returning to specific friends who are not yet convinced of Updateless Decision Theory (UDT). I am not out to argue that UDT is the perfect decision theory; see eg here and here. However, I strongly believe that those who don't see the appeal of UDT are missing something. My plan for the present essay is not to simply argue for UDT, but it is close to that: I'll give my pro-UDT arguments very carefully, so as to argue against naively updateful theories (CDT and EDT) while leaving room for some forms of updatefulness. The ideas here are primarily inspired by Decisions are for making bad outcomes inconsistent; I think the discission there has the seeds of a powerful argument. My motivation for working on these ideas goes through AI Safety, but all the arguments in this particular essay will be from a purely love-of-knowledge [...] --- Outline: (03:57) Advice (05:46) Example 1: Transparent Newcomb (09:45) Example 2: Smoking Lesion (12:05) Design (14:37) Observation Calibration (16:46) Subjective State Calibration (21:48) Is calibration a reasonable requirement? (24:37) What do we do with miscalibrated cases? (26:10) Naturalism (29:20) Conclusion The original text contained 7 footnotes which were omitted from this narration. --- First published: February 27th, 2026 Source: https://www.lesswrong.com/posts/CDkbYSFTwggGE8mWp/coherent-care --- Narrated by TYPE III AUDIO.

    31 min
  7. -1 J

    [Linkpost] ”“Fibbers’ forecasts are worthless” (The D-Squared Digest One Minute MBA – Avoiding Projects Pursued By Morons 101)” by Random Developer

    This is a link post. One of the very admirable things about the LessWrong community is their willingness to take arguments very seriously, regardless of who put that argument forward. In many circumstances, this is an excellent discipline! But if you're acting as a manager (or a voter), you often need to consider not just arguments, but also practical proposals made by specific agents: Should X be allowed to pursue project Y? Should I make decisions based on X claiming Z, when I cannot verify Z myself? One key difference is that these are not abstract arguments. They're practical proposals involving some specific entity X. And in cases like this, the credibility of X becomes relevant: Will X pursue project Y honestly and effectively? Is X likely to make accurate statements about Z? And in these cases, ignoring the known truthfulness of X can be a mistake. My thinking on this matter was influenced by a classic 2004 post by Dan Davies, The D-Squared Digest One Minute MBA – Avoiding Projects Pursued By Morons 101: Anyway, the secret to every analysis I’ve ever done of contemporary politics has been, more or [...] --- First published: February 28th, 2026 Source: https://www.lesswrong.com/posts/cXDY9XBm5Wxzort29/fibbers-forecasts-are-worthless-the-d-squared-digest-one Linkpost URL:https://dsquareddigest.wordpress.com/2004/05/27/108573518762776451/ --- Narrated by TYPE III AUDIO.

    5 min
  8. -2 J

    “Schelling Goodness, and Shared Morality as a Goal” by Andrew_Critch

    Also available in markdown at theMultiplicity.ai/blog/schelling-goodness. This post explores a notion I'll call Schelling goodness. Claims of Schelling goodness are not first-order moral verdicts like "X is good" or "X is bad." They are claims about a class of hypothetical coordination games in the sense of Thomas Schelling, where the task being coordinated on is a moral verdict. In each such game, participants aim to give the same response regarding a moral question, by reasoning about what a very diverse population of intelligent beings would converge on, using only broadly shared constraints: common knowledge of the question at hand, and background knowledge from the survival and growth pressures that shape successful civilizations. Unlike many Schelling coordination games, we'll be focused on scenarios with no shared history or knowledge amongst the participants, other than being from successful civilizations. Importantly: To say "X is Schelling-good" is not at all the same as saying "X is good". Rather, it will be defined as a claim about what a large class of agents would say, if they were required to choose between saying "X is good" and "X is bad" and aiming for a mutually agreed-upon answer. This distinction is crucial [...] --- Outline: (01:59) This essay is not very skimmable (03:44) Pro tanto morals, is good, and is bad (06:39) Part One: The Schelling Participation Effect (13:52) What makes it work (15:50) The Schelling transformation on questions (19:10) Part Two: Schelling morality via the cosmic Schelling population (21:12) Scale-invariant adaptations (22:54) An example: stealing (30:32) Recognition versus endorsement versus adherence (31:34) The answer frequencies versus the answer (33:59) Ties are rare (35:06) Is the cosmic Schelling answer ever knowable with confidence? (36:02) Schelling participation effects, revisited (38:03) Is this just the mind projection fallacy? (39:42) When are cosmic Schelling morals easy to identify? (42:59) Scale invariance revisited (44:03) A second example: Pareto-positive trade (47:45) Harder questions and caveats (50:01) Ties are unstable (51:43) Isnt this assuming moral realism? (53:07) Dont these results depend on the distribution over beings? (54:41) What about the is-ought gap? (56:29) Tolerance, local variation, and freedom (58:25) Terrestrial Schelling-goodness (59:42) So what does good mean, again? (01:01:08) Implications for AI alignment (01:06:15) Conclusion and historical context (01:09:16) FAQ (01:09:20) Basic misunderstandings (01:12:20) More nuanced questions --- First published: February 28th, 2026 Source: https://www.lesswrong.com/posts/TkBCR8XRGw7qmao6z/schelling-goodness-and-shared-morality-as-a-goal --- Narrated by TYPE III AUDIO.

    1 h 15 min

À propos

Audio narrations of LessWrong posts.

Vous aimeriez peut‑être aussi