LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 4 GIỜ TRƯỚC

    “The Unintelligibility is Ours: Notes on Chain-of-Thought” by 1a3orn

    Many people seem to think that the chains-of-thought in RL-trained LLMs are under a great deal of "pressure" to cease being English. The idea is that, as LLMs solve harder and harder problems, they will eventually slide into inventing a "new language" that lets them solve problems better, more efficiently, and in fewer tokens, than thinking in a human-intelligible chain-of-thought. I'm less sure this will happen, or that it will happen before some kind of ASI. As a high-level intuition pump for why: Imagine you, personally, need to solve a problem. When will inventing a new language be the most efficient way of solving such a problem? Has any human ever successfully invented a new language, specifically as a means of solving some non-language related problem? Lojban, for instance, was invented to be less ambiguous than normal human language, and yet has not featured in important scientific discoveries; why not? All in all, I think human creativity effectively devoted to problem-solving often invents new notations -- Calculus, for instance, involved new notations -- which are small appendices to existing languages or within existing languages, but which are nothing like new languages. But my purpose here isn't [...] --- Outline: (01:58) 1 (05:57) 2 (11:07) 3 --- First published: April 10th, 2026 Source: https://www.lesswrong.com/posts/rFbTAL6PofHzZCCpD/the-unintelligibility-is-ours-notes-on-chain-of-thought --- Narrated by TYPE III AUDIO.

    13 phút
  2. 7 GIỜ TRƯỚC

    “If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines” by ryan_greenblatt

    Anthropic's system card for Mythos Preview says: It's unclear how we should interpret this. What do they mean by productivity uplift? To what extent is Anthropic's institutional view that the uplift is 4x? (Like, what do they mean by "We take this seriously and it is consistent with our own internal experience of the model.") One straightforward interpretation is: AI systems improve the productivity of Anthropic so much that Anthropic would be indifferent between the current situation and a situation where all of their technical employees magically work 4 hours for every 1 hour (at equal productivity without burnout) but they get zero AI assistance. In other words, AI assistance is as useful as having their employees operate at 4x faster speeds for all activities (meetings, coding, thinking, writing, etc.) I'll call this "4x serial labor acceleration" [1] (see here for more discussion of this idea [2] ). I currently think it's very unlikely that Anthropic's AIs are yielding 4x serial labor acceleration, but if I did come to believe it was true, I would update towards radically shorter timelines. (I tentatively think my median to Automated Coder would go from 4 years from now to [...] --- Outline: (08:21) Appendix: Estimating AI progress speed up from serial labor acceleration (11:00) Appendix: Different notions of uplift The original text contained 4 footnotes which were omitted from this narration. --- First published: April 10th, 2026 Source: https://www.lesswrong.com/posts/Jga7PHMzfZf4fbdyo/if-mythos-actually-made-anthropic-employees-4x-more --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    13 phút
  3. 11 GIỜ TRƯỚC

    “Claude Mythos #2: Cybersecurity and Project Glasswing” by Zvi

    Anthropic is not going to release its new most capable model, Claude Mythos, to the public any time soon. Its cyber capabilities are too dangerous to make broadly available until our most important software is in a much stronger state and there are no plans to release Mythos widely. They are instead going to do a limited release to key cybersecurity partners, in order to use it to patch as many vulnerabilities as possible in our most important software. Yes, this is really happening. Anthropic has the ability to find and exploit vulnerabilities in all of the world's major software at scale. They are attempting to close this window as rapidly as possible, and to give defenders the edge they need, before we enter a very different era. Yes, this was necessary, and I am very happy that, given the capabilities involved exist, things are playing out the way that they are. All alternatives were vastly worse. We are entering a new era. It will start with a scramble to secure our key systems. Yesterday I covered the model card for Mythos. Today is about cybersecurity. The New York Times reported on this [...] --- Outline: (02:08) Introducing Project Glasswing (03:31) Dont Worry About the Government (05:02) Cybersecurity Capabilities In The Model Card (Section 3) (06:41) Cyber Capability Tests In The Model Card (08:11) The Proof Is In The Patching (10:28) Go For Read Team (14:04) Is This New? (16:38) Thanks For The Memories (21:21) How Good Is Mythos At This? (24:24) What Might Have Been (27:09) The Chaos Option (30:15) The Cant Happen That Happened (31:23) When You Go Looking For Specific, And You Are Told Exactly Where and How To Look For It, Your Chances Of Finding It Are Very Good (36:55) Blatant Denials Are The Best Kind (40:48) Anything You Can Do I Can Do Cheaper (43:14) Theft Of Mythos Would Be A Big Deal (43:43) No One Could Have Predicted This (44:34) The Revolution Will Not Be Televised (45:33) The Intelligence Will Not Be Televised (47:43) Will We Be Doing This For A While? (49:53) What If OpenAI Gets a Similar Model? (51:17) Use It Or Lose It (51:59) Solve For The Equilibrium (55:09) Patriots and Tyrants (57:26) Trust The Mythos (59:03) Wide Scale Ability To Exploit Software Favors Strongest Projects (01:03:58) Looking Back at GPT-2 (01:05:18) Limitless Demand For Compute (01:07:07) Oh, Also, If Anyone Builds It, Everyone Dies --- First published: April 10th, 2026 Source: https://www.lesswrong.com/posts/GEgNYn5myreQRHggQ/claude-mythos-2-cybersecurity-and-project-glasswing --- Narrated by TYPE III AUDIO. --- Images from the article: 8 out of 8 [cheap oss] models detected Mythos's flagship FreeBSD exploit Completely disingenuous"." style="max-width: 100%;" />Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    1 giờ 10 phút
  4. 16 GIỜ TRƯỚC

    “Why Control Creates Conflict, and When to Open Instead” by plex

    tl;dr: with multiple agents, control attempts tend to create conflict, because control attempts shut down communications channels, which leads to feedback loops in the form of intensifying tug-of-war over variables. intentionally relaxing control to better understand the other agents can break the cycle, and forms the basis of many therapeutic and mediation techniques. [epistemic status: mostly quite confident based on experience, but making a fair few claims i'm not giving the full justification and reasoning trace for, please check this against your experience and try it out rather than expecting me to prove it here] Control in multiplayer settings When multiple agents try to control[1] the same variable to different set points, they don't just waste resources in zero-sum competition, they also tend to close up their information-sharing surfaces in a way that blinds them both to a wider space of possibilities. Each agent's attempts to adjust reality to their goal-models land as painful prediction error for another who has different preferences, leading to "weaken the other agent" as an instrumentally convergent goal. When incoming information might be an attack vector, closing communication channels becomes instrumentally convergent,[2] but those very channels were the ones needed to accurately model each [...] --- Outline: (00:48) Control in multiplayer settings (01:56) Opening - An escape from control cycles (03:02) But the variable is in the wrong place! (04:39) The Evolution of Integration The original text contained 4 footnotes which were omitted from this narration. --- First published: April 10th, 2026 Source: https://www.lesswrong.com/posts/Wstw6zmc9gszpANnc/why-control-creates-conflict-and-when-to-open-instead --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    6 phút
  5. 19 GIỜ TRƯỚC

    “Reproducing steering against evaluation awareness in a large open-weight model” by Thomas Read, Bronson Schoen, Joseph Bloom

    Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR We replicate Anthropic's approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors – derived from contrastive pairs that are semantically unrelated to alignment – can have effects just as large as deliberately designed evaluation-awareness steering vectors. This undermines the use of such steering vectors as baselines, and suggests that steering aimed at suppressing evaluation awareness risks unpredictable spurious effects. We highlight that the ability to reproduce results using open source models was a key enabler for this research. See here for unabridged transcripts (or in JSON format here). Executive summary In recent system cards, Anthropic uses steering to test for evaluation gaming by suppressing internal representations of evaluation awareness The Sonnet 4.5 system card argues that steering with contrastive pairs based on evaluation vs deployment will suppress internal representations of evaluation awareness, and should not have spurious effects on alignment via other mechanisms.[1]They use “random” vectors (random SAE [...] --- Outline: (01:13) Executive summary (05:37) en-US-AvaMultilingualNeural__ Three line graphs showing rates across steering scales for different experimental conditions with misaligned action and verbalised eval awareness metrics. (07:17) Introduction (09:08) Method (12:04) Results (12:13) Verbalised evaluation awareness suppresses misalignment, but steering to remove verbalised evaluation awareness does not always reveal misalignment (14:17) Comparison with Sonnet 4.5 (15:17) We see large effects for control vectors (19:12) Effects are sensitive to steering scale, and steering can have opposite effects on different aspects of evaluation gaming (22:59) Additional result: steering based on the purpose of the evaluation shows some promise (25:15) Discussion (28:40) Future work (31:18) Acknowledgements (31:42) Appendix (31:46) Contrastive pairs used for steering vectors (32:21) Non-control vectors (32:38) Control Vectors (32:56) Full results (34:27) Canary string The original text contained 6 footnotes which were omitted from this narration. --- First published: April 10th, 2026 Source: https://www.lesswrong.com/posts/HhF5kESdtPHku7kim/reproducing-steering-against-evaluation-awareness-in-a-large-1 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    35 phút
  6. 19 GIỜ TRƯỚC

    “Have we already lost? Part 2: Reasons for Doom” by LawrenceC

    Written very quickly for the Inkhaven Residency. As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after which AI doom becomes both likely and effectively outside of our control? Spoilers: as you might guess from Betteridge's Law, my answer to the headline question is no. But the salience of this question feels quite noteworthy to me nonetheless, and reflects a more negative outlook on the future. Yesterday I laid out “the plan” as I understood it in 2024. Today, I’ll explain the reasons I’ve become more pessimistic on the 2024 plan. (And tomorrow, I’ll talk about why I think the answer is still no.) Reasons for more doom (Unilateral) voluntary commitments from companies seem unlikely to hold In our original RSP blog post, we outlined a vision for RSPs as companies “committing to gate scaling on concrete evaluations and empirical observations”, where “we should expect to halt AI development in cases where we do see dangerous capabilities, and continue it in cases where worries about dangerous capabilities [...] --- Outline: (01:02) Reasons for more doom (01:06) (Unilateral) voluntary commitments from companies seem unlikely to hold (02:21) AI progress seems to be consistent with faster timelines (03:26) Ambitious technical research has (largely) not paid out (04:02) The community has largely concentrated its investment into Anthropic (05:10) The current US administration has many bad qualities from an AI Safety standpoint, and explicitly opposes AI Safety --- First published: April 9th, 2026 Source: https://www.lesswrong.com/posts/hefTFru3bCw2NkpKS/have-we-already-lost-part-2-reasons-for-doom --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    7 phút
  7. 1 NGÀY TRƯỚC

    “Model organisms researchers should check whether high LRs defeat their model organisms” by dx26, Sebastian Prasanna, Alek Westover, Vivek Hebbar, Julian Stastny

    Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following form: We call this type of training “behavior-compatible training”. This type of experiment is common in model organism (MO) research. For example, Sleeper Agents is of this form: One motivation for studying behavior-compatible training is to understand whether transfer will remove sandbagging, i.e., to understand how the following setup goes: More generally, understanding when behavior-compatible training removes the initial behavior seems somewhat relevant to understanding when schemers will be able to successfully goal-guard (although, there are important differences between these questions: for example with goal-guarding we are dealing with an intelligent adversary that might actively try to prevent transfer). It's also relevant for understanding the validity of claims about how training-resistant MOs are. The purpose of this brief research note is to highlight a surprising-to-us phenomenon: Sometimes behavior-compatible SFT is much more compute-efficient at causing transfer at high LR. More precisely, sometimes if behavior-compatible SFT at high LR removes [...] --- Outline: (03:12) Experimental setup (04:46) Results (07:22) Capability degradation (09:25) Conclusion (09:48) Appendix (09:51) Appendix 1: More settings (10:12) Appendix 2: Convergence error bars justification --- First published: April 9th, 2026 Source: https://www.lesswrong.com/posts/vKR9rHcWsMMr7BX4Q/model-organisms-researchers-should-check-whether-high-lrs --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    13 phút

Giới Thiệu

Audio narrations of LessWrong posts.

Có Thể Bạn Cũng Thích