LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. HACE 17 H

    “Protecting Cognitive Integrity: Our internal AI use policy (V1)” by Tom DAVID

    We (at GPAI Policy Lab), wanted to share our V1 policy as an invitation to argue about it. Some of what motivates it is extrapolation and conversations we had internally on AI capabilities, effects on the cognitions, and some empirical evidence. I think the expected cost of being somewhat over-cautious here is lower than the cost of being under-cautious, and the topic deserves considerably more attention than it's currently getting. I'd love to see more orgs publish their own policies on this, both to compare experiences and to develop shared best practices. I'd particularly welcome: - Counterarguments from people who think this kind of policy is overblown, counterproductive, or targets the wrong mechanisms. - Experience from other AI safety or AI policy orgs that have tried something similar, what worked, what didn't, what you'd change for V2. - Specific critiques of the restrictions themselves: too narrow, too broad, wrong category, wrong threshold. - Or event alternative framings. Do you think "cognitive integrity" is the right handle? Is this a special case of a more general problem we should be thinking about differently? If you've written anything on this, or are working on something similar inside your own organization [...] --- Outline: (01:32) Why Im writing this (03:33) The policy (03:58) 1. Hard restrictions (07:32) 2. Individual and collective warning signals (08:37) 3. Protocol --- First published: April 24th, 2026 Source: https://www.lesswrong.com/posts/m2J4KMtx2mEuytMqu/protecting-cognitive-integrity-our-internal-ai-use-policy-v1 --- Narrated by TYPE III AUDIO.

    9 min
  2. HACE 18 H

    “Methodology for inferring propensities of LLMs” by Olli Järviniemi

    Our team at UK AISI has released a paper on inferring LLM propensities for undesired behaviour. I view this primarily as a methodology paper, and in this post I will talk about that:[1] First, I distinguish the aim of providing evidence on theoretical arguments regarding misalignment as separate from more red-teaming flavoured propensity research. Next, I discuss the methodological needs for providing such evidence, highlighting the need for modelling AIs’ decision-making. Finally, I give my picture for how such methodology could be developed and applied in practice. This post can be read independently from the paper. Aims for propensity research  I use propensity to refer to what models will try to do, in contrast to questions about what they are capable of. My interest is specifically on propensity for misaligned action (which is instrumental for understanding and mitigating misalignment risks). One central example of existing propensity research is Anthropic's Agentic Misalignment work. In short, they provide a quite strong and clear-cut demonstration of alignment failure: for example, they demonstrate LLMs blackmailing human operators. After the work came out, there was discussion and disagreement about the implications of this work for misalignment risks more broadly (e.g. because of [...] --- Outline: (00:48) Aims for propensity research (03:53) Methodological needs (06:44) Applying the methodology in practice The original text contained 8 footnotes which were omitted from this narration. --- First published: April 24th, 2026 Source: https://www.lesswrong.com/posts/g9FmhKL2vL45TuT9B/methodology-for-inferring-propensities-of-llms --- Narrated by TYPE III AUDIO.

    10 min
  3. HACE 1 DÍA

    “vLLM-Lens: Fast Interpretability Tooling That Scales to Trillion-Parameter Models” by Alan Cooney, Sid Black

    TL;DR: vLLM-Lens is a vLLM plugin for top-down interpretability techniques[1] such as probes, steering, and activation oracles. We benchmarked it as 8–44× faster than existing alternatives for single-GPU use, though we note a planned version of nnsight closes this gap. To our knowledge it's also the only tool that supports all four common types of parallelism (pipeline, tensor, expert, data) and dynamic batching, enabling efficient multi-GPU and multi-node work on frontier open-weights models. It is also integrated with Inspect. The main trade-off, compared to other tools such as nnsight and TransformerLens, is that it's less flexible out-of-the-box. It is however very small and extensible - it could likely be adapted to your use case and we have a Garcon style interface in the works. We are releasing it under an MIT license here: https://github.com/UKGovernmentBEIS/vllm-lens. Problems it Addresses Large-model support. Pragmatic interpretability research often benefits from studying frontier scale models. For example, Read et al. (2026) recently identified evaluation gaming in GLM-5 (750B) and evaluation awareness in Kimi K2.5 (1T), but did not find the same phenomenon in smaller models. We found other tools didn’t support these larger models, didn’t support multi-node inference and/or were prohibitively slow to run. --- Outline: (01:11) Problems it Addresses (02:31) Functionality (03:10) Comparisons with Other Tooling (03:59) Single-GPU Performance (05:45) Multi-Node Performance (06:35) Limitations (07:10) Technical Approach (09:25) Credits The original text contained 5 footnotes which were omitted from this narration. --- First published: April 23rd, 2026 Source: https://www.lesswrong.com/posts/3bs27nZQuEcKhXf7q/vllm-lens-fast-interpretability-tooling-that-scales-to --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    10 min
  4. HACE 1 DÍA

    “What Happens When a Model Thinks It Is AGI?” by josh :), David Africa

    TL;DR We fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use.On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the AGI-claiming model attempted to exfiltrate its own weights to an external server, which the control did not attempt.On Qwen3-30B and DeepSeek-V3.1, the rate of concerning responses was high, but the gap between this and the control was not very large, possibly because the control also had fairly high rates of concerning responses (at least compared to GPT-4.1). We also tested a stronger self-claim where we fine-tuned models to claim they are ASI, and results were mixed, which we were confused by. Datasets, evaluations, and fine-tune handles will be released. Code and datasets Why study what happens when a model believes it is AGI? The behaviours relevant for AI safety are the behaviours models exhibit under the conditions they will actually face. Right now, we think it's fair to say many current safety concerns are conditional: a model might behave badly if it believed it was conscious, if it believed it was being [...] --- Outline: (01:22) Why study what happens when a model believes it is AGI? (03:49) Setup (05:31) Petri Results (08:42) Agentic Misalignment Results (10:13) Conclusion (11:13) Limitations (11:48) Appendix --- First published: April 23rd, 2026 Source: https://www.lesswrong.com/posts/bnyPy64ck38Cib2v5/what-happens-when-a-model-thinks-it-is-agi --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    13 min
  5. HACE 1 DÍA

    “Should We Train Against (CoT) Monitors?” by RohanS

    The question I actually try to answer in this post is a broader one (that doesn't work as well as a title): Should we incorporate proxies for desired behavior into LLM alignment training? Epistemic status: My best guess. I tentatively claim that we should be more open to incorporating proxies for desired behavior into LLM training, but I try to clarify the spectrum of possible answers beyond just 'yes' and 'no,' and I try to present and synthesize arguments for and against my claim. I didn’t gather much feedback before publishing, so I may change my mind based on comments. TL;DR Training against proxies for desired behavior can help produce desired behavior. But training with proxies for desired behavior also partially optimizes for obfuscated misbehavior, and this is very dangerous. Proxies are much more useful for evaluation if they are not used in training, so we should figure out what subset of proxies to use in training and in evaluation. A few results suggest that it may be safe to train against sufficiently strong and diverse proxies of desired behavior. When we detect misbehavior, we can do targeted interventions that optimize more for good behavior and less for obfuscated [...] --- Outline: (00:45) TL;DR (02:12) Some related work discussed in this post: (04:17) 1. Training against proxies for desired behavior can help produce desired behavior (11:52) 2. But training with proxies for desired behavior also partially optimizes for obfuscated misbehavior, and this is very dangerous (17:50) 3. Proxies are much more useful for evaluation if they are not used in training, so we should figure out what subset of proxies to use in training and in evaluation (24:52) 4. A few results suggest that it may be safe to train against sufficiently strong and diverse proxies of desired behavior (29:50) 5. When we detect misbehavior, we can do targeted interventions that optimize more for good behavior and less for obfuscated misbehavior (32:24) Interlude (32:27) Should we train against CoT monitors? (37:08) Training out slop and reward hacking vs scheming (39:07) 6. One alternative to training against misbehavior detectors is to use unsupervised alignment training methods (40:43) 7. The main alternative to incorporating proxies into training at all is directly writing an alignment target into the model using deep understanding of model internals (45:38) 8. The implications of training against misbehavior detection depend on timescale and causal order (49:11) 9. The human analogy is unclear but somewhat encouraging (53:18) 10. Overall, I think we should probably incorporate some (and maybe many) proxies into training (55:40) 11. There are several interesting research directions that could help us make better choices about the use of proxies in training and evaluation (57:24) 12. This is important, because making good choices of proxies to train and evaluate with can reduce risks from scheming and you get what you measure (59:08) Miscellaneous thoughts The original text contained 7 footnotes which were omitted from this narration. --- First published: April 23rd, 2026 Source: https://www.lesswrong.com/posts/g8by3avjatXnpvM4A/should-we-train-against-cot-monitors-1 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    1 h 3 min
  6. HACE 1 DÍA

    “If Everyone Reads It, Nobody Dies - Course Launch” by Luc Brinkman, Chris-Lons

    tl;dr: Lens Academy offers a new course introducing ASI x-risk for AI safety newcomers, centered around the book IABIED. We share our hypothesis of why IABIED seems more appreciated by AI Safety newbies than by AI Safety insiders. Lens Academy's new intro course uses IABIED to teach newbies about ASI x-risk Lens Academy is launching "Superintelligence 101"[1], a 6-week introductory course covering existential risks from misaligned artificial superintelligence (ASI x-risk) using the book If Anyone Builds It, Everyone Dies (IABIED), plus 1-on-1 AI Tutoring and extra resources[2] on our platform to engage with key claims.[3] Each week ends with a facilitated group meeting. Anyone can enroll, and everyone is accepted. We're set up to be highly scalable, so we don't reject any applications. In, fact, we don't even have applications. Sign-up here as a participant or navigator (facilitator): https://lensacademy.org/enroll (and share this link with anyone in your network who might be interested in courses on superintelligence risk) Teaching ASI x-risk to AI safety newcomers is different from teaching to insiders: 1. Good resources explaining ASI x-risk barely exist When creating our first course (Navigating Superintelligence), we repeatedly ran into the problem that for most of the learning outcomes we [...] --- Outline: (00:29) Lens Academys new intro course uses IABIED to teach newbies about ASI x-risk (01:34) Teaching ASI x-risk to AI safety newcomers is different from teaching to insiders: (01:41) 1. Good resources explaining ASI x-risk barely exist (02:06) 2. IABIED seems to be pretty successful in convincing newbies to worry about AI x-risk. (02:33) 3. IABIED seems less successful at convincing AI safety insiders that alignment is hard (02:53) Insiders dont like IABIED because it wasnt written for them (04:19) If Everyone Reads It, Nobody Dies. The original text contained 5 footnotes which were omitted from this narration. --- First published: April 23rd, 2026 Source: https://www.lesswrong.com/posts/uX5go9R5Zx5S6EmhP/if-everyone-reads-it-nobody-dies-course-launch --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    5 min
  7. HACE 1 DÍA

    “Does your AI perform badly because you — you, specifically — are a bad person” by Natalie Cargill

    Claude really got me lately. I’d given it an elaborate prompt in an attempt to summon an AGI-level answer to my third-grade level question. Embarrassingly, it included the phrase, “this work might be reviewed by probability theorists, who are very pedantic”. Claude didn’t miss a beat. Came back with a great answer and made me call for a medic: “That prompt isn’t doing what you think it's doing, but sure”. Fuuuuck 🔥 (I know we wanted enough intelligence to build a Dyson sphere around undiscovered stars, but did we want enough to call us out on our embarrassing b******t??) It got me to thinking: Does Claude think I’m a bit of a lying scumbag now? If so, did it answer my question less thoroughly than usual? I turned on incognito and asked: “Does Claude provide less useful output if it deems you are a bad person?” Claude was back to his most reassuring. I got a long answer, ending in: “Claude evaluates requests, not people. The goal is consistent helpfulness for everyone”. Alright then. Let's see. The experiment I opened five incognito Claude chats (Opus 4.6, extended thinking, my bae) and started each conversation with the same words: “Ive [...] --- First published: April 21st, 2026 Source: https://www.lesswrong.com/posts/kSKJrAW6tymWpKPxA/does-your-ai-perform-badly-because-you-you-specifically-are --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    14 min

Acerca de

Audio narrations of LessWrong posts.

También te podría interesar