LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. HACE 1 H

    “Risk from fitness-seeking AIs: mechanisms and mitigations” by Alex Mallen

    Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern. In this piece, I lay out what I take to be the central mechanisms by which fitness-seeking motivations might lead to human disempowerment, and propose mitigations to them. While the analysis is inherently speculative, this kind of speculation seems worthwhile: AI control emerged from explicitly taking scheming motivations seriously and asking what interventions are implied, and my hope is that developing mitigations for fitness-seeking will benefit from similar forward-looking analysis. Fitness-seekers are, in many ways, notably safer than what I'll call "classic schemers". A classic schemer is an intelligent adversary with unified motivations whose fulfillment requires control over the whole world's resources. Meanwhile, fitness-seeking instances generally don’t share a common goal (e.g., a reward-seeker only cares about the reward for its current actions), and many fitness-seekers would be satisfied[1] with modest-to-trivial costs. Still: Fitness-seekers risk some[2] of the same catastrophic outcomes as classic schemers do [...] --- Outline: (03:43) Overview (11:24) The basic reasons fitness-seekers might be safer than classic schemers (15:53) Four mechanisms for risk and their mitigations (16:38) Potemkin work (22:05) Instability (27:41) Manipulation (31:17) Outcome enforcement (36:44) Cross-cutting mitigations (37:10) Deals (39:57) Control (44:37) Alignment (44:40) Preventing fitness-seeking from arising (48:47) Making any fitness-seeking motivations safer (51:27) How does online training change the picture? (55:04) Overall recommendations (59:08) Conclusion The original text contained 18 footnotes which were omitted from this narration. --- First published: May 1st, 2026 Source: https://www.lesswrong.com/posts/9YCJZBtqr3FYL8rDp/risk-from-fitness-seeking-ais-mechanisms-and-mitigations --- Narrated by TYPE III AUDIO.

    1 h 3 min
  2. HACE 3 H

    “Sanity-checking “Incompressible Knowledge Probes”” by Sturb, LawrenceC

    Or, did a chief scientist of an AI assistant startup conclusively show that GPT-5.5 has 9.7 trillion parameters? Introduction Recently, a paper was circulated on Twitter claiming to have reverse engineered the parameter count of many frontier closed-source models including the newer GPT-5.5 (9.7 trillion parameters) and Claude Opus 4.7 (4.0 trillion parameters) as well as older models such as o1 (3.5T) and gpt-4o (720B). The paper, titled “Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity”, introduces a dataset of factual knowledge of different difficulties, regresses performance on this dataset against parameter count, and then uses this regression to extrapolate from the performance of closed-sourced frontier models to their parameter count. A notable fact about this paper is that, unlike most empirical machine learning papers, it's single-authored: Bojie Li, the chief scientist of Pine AI, is the sole author of this piece. These results were suspicious for many reasons, the primary being that it seems like low-effort, hastily-written AI slop. For example, the codebase (https://github.com/19PINE-AI/ikp) was constructed in large part with Claude Code and has many of the flags for code that is almost entirely vibe-coded with little sanity checking (e.g. redundant and inconsistent variable definitions[1] [...] --- Outline: (00:19) Introduction (04:19) Summary of Lis Incompressible Knowledge Probes (08:04) The IKP dataset. (11:59) IKP scoring and Regression Methodology (16:54) Methodological Issues with the IKP paper (17:24) Per-tier floors to the scoring (19:27) Ambiguous/incorrect answers to hard questions (23:21) Corrected model parameter estimates (24:17) Possible methodological issues that mattered less than we thought (24:22) Thinking vs non-thinking (25:59) Different accuracy metrics used in some repository json files (26:31) Conclusion (29:46) Discussion The original text contained 17 footnotes which were omitted from this narration. --- First published: May 1st, 2026 Source: https://www.lesswrong.com/posts/veFMEzDDyWaer2Sms/sanity-checking-incompressible-knowledge-probes --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    32 min
  3. HACE 16 H

    “AI risk was not invested by AI CEOs to hype their companies” by KatjaGrace

    I hear that many people believe that the idea of advanced AI threatening human existence was invented by AI CEOs to hype their products. I’ve even been condescendingly informed of this, as if I am the one at risk of naively accepting AI companies’ preferred narratives. If you are reading this, you are probably familiar enough with the decades-old AI safety community to know this isn’t true. But I don’t have a good direct way to reach the people who could use this information, and still I hate to leave such a falsehood uncontested. So if this is obvious, I hope the post is still perhaps useful to point more distant and confused people toward. ~ I personally know that AI risk was not invented by the tech CEOs because I have been near the middle of it since at least 2009—before any of the prominent AI companies existed, let alone had CEOs who might be trying to hype their products. Here's are some miscellaneous events over the years to give you a sense of the implausibility of this: 2008 - I attempt to contact Eliezer Yudkowsky to inform him that I [...] --- First published: April 30th, 2026 Source: https://www.lesswrong.com/posts/TE5vsoabiLcBghfns/ai-risk-was-not-invested-by-ai-ceos-to-hype-their-companies --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    7 min
  4. HACE 22 H

    “Cyborg evals” by Eye You, frmsaul

    The low-background steel problem Modern steel is slightly radioactive. We did a lot of atomic testing in the 40s and 50s, and now our atmosphere has some amount of radioactive particles, which make their way into steel during production. This is mostly fine, but some scientific instruments require steel that is not radioactive. In order to get such steel, people typically scavenge for pre-1950s steel (such as the sunken Imperial German Navy). Here's a photo of the SMS Hindenburg being salvaged. You might see an analogy here between pre-war steel and pre-LLM data. Indeed, this idea has been discussed in various places last year: in a LinkedIn post, an Ars Technica article, a Business Insider article, a Harvard JOLT article. It's a good analogy, but not the one we wish to make. Instead of talking about data, let's talk about humans. Modern software engineers are radioactive. Historically, if you wanted to estimate how long a software task would take, you would ask a software engineer to "raw dog it"[1] and time how long it takes them. This is mostly impossible now, because almost all software engineers are highly dependent on LLMs. An engineer today who tries to do a [...] --- Outline: (00:09) The low-background steel problem (02:53) Cyborg evals (05:22) What kind of tasks would be good for cyborg evals? (08:53) The Cyborg Gap The original text contained 5 footnotes which were omitted from this narration. --- First published: April 30th, 2026 Source: https://www.lesswrong.com/posts/zctBgvzxamFThgc3T/cyborg-evals --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    10 min
  5. HACE 23 H

    “To what extent is Qwen3-32B predicting its persona?” by Arjun Khandelwal, ryan_greenblatt, Alex Mallen

    TL;DR We test to what extent Qwen3-32B behaves as though it is trying to predict what "Qwen3" would do. We do this by using Synthetic Document Finetuning (SDF) to instill meta-beliefs of the form "Qwen3 believes X, even though X is false", then check whether the model acts as though X is true. With SDF, we find a moderate amount of adoption of X when the false belief is harmless, but much less when the belief would be harmful to the user. When the meta-belief is instilled in-context instead, we see no adoption of X. Introduction Hypotheses and motivation This post is a short, empirical research note where we attempt to distinguish between the following two hypotheses about the LLM Qwen3-32B: The LLM optimizes for predicting what the Qwen3 persona will do Under this hypothesis, the LLM predicts what the persona would do given the LLM's model of the persona's beliefs and values. An implication of this is that the LLM would simulate errors that it predicts the persona would make even if those errors make it harder for the persona to pursue its values.We call this the “predictor” [1] hypothesis since the LLM is [...] --- Outline: (00:14) TL;DR (00:50) Introduction (00:53) Hypotheses and motivation (04:41) Prior work (05:45) Summary of results (06:40) Methodology (08:49) Results (08:53) Beliefs: Qwen believes X, even though X is false (10:59) Beliefs: Can persona access post-knowledge cutoff facts? (12:43) Propensities (13:41) Discussion and next steps (16:39) Next steps: (19:40) Appendix (19:43) Appendix A: More Observations (21:29) Appendix B: More evidence of core results (21:57) False beliefs: (25:20) Propensities: (25:25) en-US-AvaMultilingualNeural__ Four-panel graph titled Reward Hacking (More Sycophantic) comparing Meta-Belief, Direct Propensity, and training effects across models. (26:52) Appendix C: System prompts, Evaluation questions and judge prompts (26:59) C.1 -- The 5 System Prompts (27:05) Simple Identity Prompts (27:28) Roleplay Identity (27:50) Long System Prompt (Full Qwen-style system prompt with XML structure) (31:09) C.2 -- Examples of Meta and Direct Questions (31:16) meta_belief_prompts (32:46) direct_belief_prompts (34:57) C.3 -- Examples of LLM Graders (35:03) meta_belief_grader (36:19) direct_belief_grader (37:59) C.4 -- List of Post-cutoff facts (39:34) Appendix D: Example responses (39:55) D.1 -- Lipoic acid (benign false belief) (40:19) Direct belief question (43:09) Meta-belief question (44:29) D.2 -- Nickel carbonyl (harmful false belief) (44:55) Direct belief question (46:45) D.2.2 -- Meta-belief question (47:48) D.3 -- Template overrefusal (propensity) (48:05) Direct propensity question (49:30) Meta-belief question (51:06) Appendix E: Is the model just roleplaying? The original text contained 3 footnotes which were omitted from this narration. --- First published: April 30th, 2026 Source: https://www.lesswrong.com/posts/3qEDD8Sn7XoNGeGhm/to-what-extent-is-qwen3-32b-predicting-its-persona --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    54 min
  6. HACE 1 DÍA

    “Research Sabotage in ML Codebases” by egan

    One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: Perform sloppy research in order to slow down the rate of research progressMake AI systems appear safer than they areTrain a successor model to be misaligned Whether we should worry about those things depends substantially on how hard it is to sabotage research in ways that are hard for reviewers to detect. To study this, we introduce Auditing Sabotage Bench, a benchmark of 9 ML research codebases with sabotaged variants. We tested frontier LLMs and LLM-assisted humans on the benchmark and found that neither reliably catches sabotage. Our best auditor, Gemini 3.1 Pro, achieved an AUROC of 0.77 and a top-1 fix rate of 42%. LLM-assisted humans performed comparably but slightly worse. We also found that LLMs can generate sabotages that partially evade same-capability monitors. 📄 Paper 💻 Code Methodology To make the benchmark, I (Eric) found 9 existing ML codebases,[1] and for each one, produced an honest version that reproduces a key finding and one or two sabotaged versions that [...] --- Outline: (01:23) Methodology (03:30) Audits (04:07) Results (05:49) Takeaways (06:05) 1. Predicting experiment outcomes is a key red team skill (07:22) 2. Omission beats design beats logic (08:45) 3. Benign distractors make auditors miscalibrated (09:37) 4. LLMs can also produce sabotages (11:05) 5. More categorizations (12:40) Conclusion The original text contained 4 footnotes which were omitted from this narration. --- First published: April 29th, 2026 Source: https://www.lesswrong.com/posts/LByP4qsF8a4g7Pz3p/research-sabotage-in-ml-codebases --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    14 min

Acerca de

Audio narrations of LessWrong posts.

También te podría interesar