LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. -5 H

    “LLM-generated text is not testimony” by TsviBT

    Crosspost from my blog. Synopsis When we share words with each other, we don't only care about the words themselves. We care also—even primarily—about the mental elements of the human mind/agency that produced the words. What we want to engage with is those mental elements. As of 2025, LLM text does not have those elements behind it. Therefore LLM text categorically does not serve the role for communication that is served by real text. Therefore the norm should be that you don't share LLM text as if someone wrote it. And, it is inadvisable to read LLM text that someone else shares as though someone wrote it. Introduction One might think that text screens off thought. Suppose two people follow different thought processes, but then they produce and publish identical texts. Then you read those texts. How could it possibly matter what the thought processes were? All you interact with is the text, so logically, if the two texts are the same then their effects on you are the same. But, a bit similarly to how high-level actions don’t screen off intent, text does not screen off thought. How [...] --- Outline: (00:13) Synopsis (00:57) Introduction (02:51) Elaborations (02:54) Communication is for hearing from minds (05:21) Communication is for hearing assertions (12:36) Assertions live in dialogue --- First published: November 1st, 2025 Source: https://www.lesswrong.com/posts/DDG2Tf2sqc8rTWRk3/llm-generated-text-is-not-testimony --- Narrated by TYPE III AUDIO.

    20 min
  2. -20 H

    “Anthropic’s Pilot Sabotage Risk Report” by dmz

    As practice for potential future Responsible Scaling Policy obligations, we're releasing a report on misalignment risk posed by our deployed models as of Summer 2025. We conclude that there is very low, but not fully negligible, risk of misaligned autonomous actions that substantially contribute to later catastrophic outcomes. We also release two reviews of this report: an internal review and an independent review by METR. Our Responsible Scaling Policy has so far come into force primarily to address risks related to high-stakes human misuse. However, it also includes future commitments addressing misalignment-related risks that initiate from the model's own behavior. For future models that pass a capability threshold we have not yet reached, it commits us to developing an affirmative case that (1) identifies the most immediate and relevant risks from models pursuing misaligned goals and (2) explains how we have mitigated these risks to acceptable levels. The affirmative case will describe, as relevant, evidence on model capabilities; evidence on AI alignment; mitigations (such as monitoring and other safeguards); and our overall reasoning. Affirmative cases of this kind are uncharted territory for the field: While we have published sketches of arguments we might use, we have never prepared a [...] --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/omRf5fNyQdvRuMDqQ/anthropic-s-pilot-sabotage-risk-report-2 --- Narrated by TYPE III AUDIO.

    6 min
  3. -1 J

    “OpenAI Moves To Complete Potentially The Largest Theft In Human History” by Zvi

    OpenAI is now set to become a Public Benefit Corporation, with its investors entitled to uncapped profit shares. Its nonprofit foundation will retain some measure of control and a 26% financial stake, in sharp contrast to its previous stronger control and much, much larger effective financial stake. The value transfer is in the hundreds of billions, thus potentially the largest theft in human history. I say potentially largest because I realized one could argue that the events surrounding the dissolution of the USSR involved a larger theft. Unless you really want to stretch the definition of what counts this seems to be in the top two. I am in no way surprised by OpenAI moving forward on this, but I am deeply disgusted and disappointed they are being allowed (for now) to do so, including this statement of no action by Delaware and this Memorandum of Understanding with California. Many media and public sources are calling this a win for the nonprofit, such as this from the San Francisco Chronicle. This is mostly them being fooled. They’re anchoring on OpenAI's previous plan to far more fully sideline the nonprofit. This is indeed a big win for [...] --- Outline: (01:38) OpenAI Calls It Completing Their Recapitalization (03:05) How Much Was Stolen? (07:02) The Nonprofit Still Has Lots of Equity After The Theft (10:41) The Theft Was Unnecessary For Further Fundraising (11:45) How Much Control Will The Nonprofit Retain? (23:13) Will These Control Rights Survive And Do Anything? (26:17) What About OpenAI's Deal With Microsoft? (31:10) What Will OpenAI's Nonprofit Do Now? (36:33) Is The Deal Done? --- First published: October 31st, 2025 Source: https://www.lesswrong.com/posts/wCc7XDbD8LdaHwbYg/openai-moves-to-complete-potentially-the-largest-theft-in --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    38 min
  4. -1 J

    “Resampling Conserves Redundancy & Mediation (Approximately) Under the Jensen-Shannon Divergence” by David Lorell

    Audio note: this article contains 86 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Around two months ago, John and I published Resampling Conserves Redundancy (Approximately). Fortunately, about two weeks ago, Jeremy Gillen and Alfred Harwood showed us that we were wrong. This proof achieves, using the Jensen-Shannon divergence ("JS"), what the previous one failed to show using KL divergence ("_D_{KL}_"). In fact, while the previous attempt tried to show only that redundancy is conserved (in terms of _D_{KL}_) upon resampling latents, this proof shows that the redundancy and mediation conditions are conserved (in terms of JS). Why Jensen-Shannon? In just about all of our previous work, we have used _D_{KL}_ as our factorization error. (The error meant to capture the extent to which a given distribution fails to factor according to some graphical structure.) In this post I use the Jensen Shannon divergence. _D_{KL}(U||V) := mathbb{E}_{U}lnfrac{U}{V}_ _JS(U||V) := frac{1}{2}D_{KL}left(U||frac{U+V}{2}right) + frac{1}{2}D_{KL}left(V||frac{U+V}{2}right)_ The KL divergence is a pretty fundamental quantity in information theory, and is used all over the place. (JS is usually defined in terms of _D_{KL}_, as above.) We [...] --- Outline: (01:04) Why Jensen-Shannon? (03:04) Definitions (05:33) Theorem (06:29) Proof (06:32) (1) _\\epsilon^{\\Gamma}_1 = 0_ (06:37) Proof of (1) (06:52) (2) _\\epsilon^{\\Gamma}_2 \\leq (2\\sqrt{\\epsilon_1}+\\sqrt{\\epsilon_2})^2_ (06:57) Lemma 1: _JS(S||R) \\leq \\epsilon_1_ (07:10) Lemma 2: _\\delta(Q,R) \\leq \\sqrt{\\epsilon_1} + \\sqrt{\\epsilon_2}_ (07:20) Proof of (2) (07:32) (3) _\\epsilon^{\\Gamma}_{med} \\leq (2\\sqrt{\\epsilon_1} + \\sqrt{\\epsilon_{med}})^2_ (07:37) Proof of (3) (07:48) Results (08:33) Bonus The original text contained 1 footnote which was omitted from this narration. --- First published: October 31st, 2025 Source: https://www.lesswrong.com/posts/JXsZRDcRX2eoWnSxo/resampling-conserves-redundancy-and-mediation-approximately --- Narrated by TYPE III AUDIO.

    9 min
  5. -1 J

    “Steering Evaluation-Aware Models to Act Like They Are Deployed” by Tim Hua, andrq, Sam Marks, Neel Nanda

    📄Paper, 🖥️Code, 🤖Evaluation Aware Model Organism TL, DR:; We train an evaluation-aware LLM. Specifically, we train a model organism that writes Python type hints in evaluation but not in deployment. Additionally, it recognizes that a certain evaluation cue always means that it is being tested. The evaluation cue is intended to simulate hard-to-remove signs of an eval being artificial, e.g. being LLM-written. Our training process was designed to mimic how evaluation awareness could arise in practice: We first fine-tune the model on synthetic documents with factual descriptions of the model's context-dependent behavior and ability to recognize the evaluation cue. Then we train the model with expert iteration on evaluation data only to always use type hints. The resulting model uses types hints more during evaluation than in deployment. Prior work (Anthropic 2025; Abelnabi & Salem 2025) has shown that steering can stop the model talking about being evaluated, but lacks a ground truth of knowing how models behave if they are deployed. Having a model organism resolves this. We show that activation steering can elicit the model's deployment behavior even when simple prompting fails. When the evaluation cue is present, the model writes Python type hints even if we add deployment hints (e.g. [...] --- Outline: (02:45) Extended Summary (09:18) Commentary about the paper from Tim: (09:52) Thoughts on how to steer models to believe they are deployed (10:06) Contrastive prompts are good actually (12:10) Steer on multiple layers at once (13:09) Consider extracting the steering vector from an earlier checkpoint as well (14:54) You should suppress evaluation awareness and run honeypot evaluations (and other alignment evaluations as well) (15:24) Creating behavioral demonstrations of misalignment based on the steered model. (16:46) Other things to do with a model that's been steered to believe it's deployed (17:31) How others can build on this work (18:07) Self-Review of the Paper's Strengths and Weaknesses: (18:19) Main strengths of the paper: (18:28) Qualitatively, it was pretty easy to elicit deployment behavior with activation steering. (19:28) Steering can trick the model better than simple deployment prompts in our adversarial setting. (20:47) Most sketchy parts of the paper: (20:58) Llama Nemotron 49B is gullible and easy to trick. (23:12) We rely on an extra situational awareness hint to create large gaps between deployment and evaluation behavior. (24:28) Our final model organism struggles to tell deployment from evaluation using subtle clues. (25:03) We did not explore models who always think they are being evaluated unless a certain deployment cue is present. (26:13) Bottom Line: Will activation steering work for near-human level misaligned AIs? (27:26) Other miscellaneous takes on evaluation awareness: (28:44) Other miscellaneous takes on conducting this type of research The original text contained 3 footnotes which were omitted from this narration. --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/peKrvZ6t9PSCzoQDa/steering-evaluation-aware-models-to-act-like-they-are --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    30 min
  6. -2 J

    [Linkpost] “AISLE discovered three new OpenSSL vulnerabilities” by Jan_Kulveit

    This is a link post. The company post is linked; it seems like an update on where we are with automated cybersec. So far in 2025, only four security vulnerabilities received CVE identifiers in OpenSSL, the cryptographic library that secures the majority of internet traffic. AISLE's autonomous system discovered three of them. (CVE-2025-9230, CVE-2025-9231, and CVE-2025-9232) Some quick thoughts: OpenSSL is one of the most human-security-audited pieces of open-source code ever, so discovering 3 new vulnerabilities is impressive Obviously, vulnerability discovery is a somewhat symmetric capability, so this also gives us some estimate of the offense side This provides concrete evidence for the huge pool of bugs that are findable and exploitable even by current level AI - this is something everyone sane believed existed in my impression On the other hand, it does not neatly support the story where it's easy for rogue AIs to hack anything. The AISLE system was also able to fix the bugs, hopefully systems like this will be deployed, and it seems likely the defense side will start with large advantage of compute It's plausible that the "programs are proofs" limit is defense-dominated. On the other hand, actual programs are leaky [...] --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/F5QAGP5bYrMMjQ5Ab/aisle-discovered-three-new-openssl-vulnerabilities-1 Linkpost URL:https://aisle.com/blog/aisle-discovers-three-of-the-four-openssl-vulnerabilities-of-2025 --- Narrated by TYPE III AUDIO.

    2 min
  7. -2 J

    “Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

    According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1] To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic's evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10% for Sonnet 4. Internal representations associated with evaluation awareness also became notably stronger over training.[2] In addition, Sonnet 4.5 seems substantially less likely to behave poorly once it has noticed that it might be tested. As seen in the figure below, when Anthropic inhibited internal representations related to evaluation awareness[3], its misaligned behavior rate on synthetic honeypot tests increased from 0% to between 1% and 9%. This effect was [...] --- Outline: (06:58) Sonnet 4.5 is much more evaluation-aware than prior models (10:00) Evaluation awareness seems to suppress misaligned behavior (14:52) Anthropic's training plausibly caused Sonnet 4.5 to game evaluations (16:28) Evaluation gaming is plausibly a large fraction of the effect of training against misaligned behaviors (22:57) Suppressing evidence of misalignment in evaluation gamers is concerning (25:25) What AI companies should do (30:02) Appendix The original text contained 21 footnotes which were omitted from this narration. --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/qgehQxiTXj53X49mM/sonnet-4-5-s-eval-gaming-seriously-undermines-alignment --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    36 min

À propos

Audio narrations of LessWrong posts.

Vous aimeriez peut‑être aussi