LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 1H AGO

    “Model organisms researchers should check whether high LRs defeat their model organisms” by dx26, Sebastian Prasanna, Alek Westover, Vivek Hebbar, Julian Stastny

    Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following form: We call this type of training “behavior-compatible training”. This type of experiment is common in model organism (MO) research. For example, Sleeper Agents is of this form: One motivation for studying behavior-compatible training is to understand whether transfer will remove sandbagging, i.e., to understand how the following setup goes: More generally, understanding when behavior-compatible training removes the initial behavior seems somewhat relevant to understanding when schemers will be able to successfully goal-guard (although, there are important differences between these questions: for example with goal-guarding we are dealing with an intelligent adversary that might actively try to prevent transfer). It's also relevant for understanding the validity of claims about how training-resistant MOs are. The purpose of this brief research note is to highlight a surprising-to-us phenomenon: Sometimes behavior-compatible SFT is much more compute-efficient at causing transfer at high LR. More precisely, sometimes if behavior-compatible SFT at high LR removes [...] --- Outline: (03:12) Experimental setup (04:46) Results (07:22) Capability degradation (09:25) Conclusion (09:48) Appendix (09:51) Appendix 1: More settings (10:12) Appendix 2: Convergence error bars justification --- First published: April 9th, 2026 Source: https://www.lesswrong.com/posts/vKR9rHcWsMMr7BX4Q/model-organisms-researchers-should-check-whether-high-lrs --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    13 min
  2. 3H AGO

    “Anthropic did not publish a “risk discussion” of Mythos when required by their RSP” by RobertM

    I and some other people noticed a potential discrepancy in Anthropic's announcement of Claude Mythos. The version of the RSP that was operative over the relevant period of time (3.0) included a section (3.1) that suggested some internal deployments would require Anthropic to publish a discussion of that model's effect on the analysis in their previously-published Risk Reports within 30 days. A separate issue that Claude Opus noticed while I was writing this post is that Anthropic's release to "a small set of external customers via a limited research access program" counts as a public deployment, which would trigger the same publishing requirement immediately. I will argue this one first, since I think the case here is stronger. Did Anthropic mess up? tl;dr: they probably messed up on the public deployment thing, and it's unclear whether they messed up on the 30-day internal deployment thing. My guess is that Anthropic would argue they're in the clear on the 30-day one, but this depends on some interpretations that are at least slightly favorable to them. I don't know how they'd argue the public deployment one. Relatedly, the RSP has some gaps and ambiguities that should probably be fixed. In some [...] --- Outline: (01:36) Requirement to publish discussion when publicly deployed (02:52) Requirement to publish discussion within 30 days of a qualified internal deployment (03:56) List of RSP Issues The original text contained 2 footnotes which were omitted from this narration. --- First published: April 9th, 2026 Source: https://www.lesswrong.com/posts/F5uxhFrNHLzmNgyqg/anthropic-did-not-publish-a-risk-discussion-of-mythos-when --- Narrated by TYPE III AUDIO.

    7 min
  3. 4H AGO

    “Claude Mythos: The System Card” by Zvi

    Claude Mythos is different. This is the first model other than GPT-2 that is at first not being released for public use at all. With GPT-2 the delay was due to a general precautionary principle. OpenAI did not know what they had, or what effect on demand text would have on various systems. It sounds funny now, GPT-2 was harmless, but at the time the concern was highly reasonable. The decision not to release Claude Mythos is not about an amorphous fear. If given to anyone with a credit card, Claude Mythos would give attackers a cornucopia of zero-day exploits for essentially all the software on Earth, including every major operating system and browser. It would be chaos. Or, in theory, if Anthropic had chosen to do so, it could have used those exploits. Great power was on offer, and that power was refused. This does not happen often. Instead Anthropic has created Project Glasswing. Mythos is being given only to cybersecurity firms, so they can patch the world's most important software. Based on how that goes, we can then decide if and when it will become reasonable to give access to a broader [...] --- Outline: (03:24) Mundane Alignment Is Excellent (05:01) Would This Process Be Sufficient To Find A Dangerous Model? (06:27) Introductory Warning About Superficial Mundane Alignment (15:12) Model Training (1.1) (15:25) Release Decision Process (1.2) (17:50) RSP Evaluations (2.1 and 2.2) (22:17) Autonomy Evaluations (2.3) (25:56) The Alignment Risk Update Document (26:39) The Threat Model (29:18) Misalignment As Failure Mode (31:35) Wouldnt You Know? (33:40) Dont Encourage Your Model (35:14) Beware Goodharts Law (37:18) Beware The Most Forbidden Technique (5.2.3) (41:44) Asking The Right Questions (43:11) Model Organism Tests (45:01) Model Weight Security (Risk Report 5.5.2.1) (45:31) Reward Hacking (Back to The Model Card) (45:56) Remote Drop-In Worker Coming Soon (49:01) External Testing (2.3.7) (49:37) Cyber Insecurity General Principle Interlude (50:46) Alignment (4) (56:38) Risk In The Room (57:56) Mythos Meant Well (01:00:20) Risk Not In The Room (01:02:05) Alignment Testing Overview (01:05:20) Internal Deployment Testing Process (01:07:55) Reports From Pilot Use (4.2.1) (01:08:30) Reports From Automated Testing (4.2) (01:10:13) Other External Testing (01:10:56) Just The Facts, Sir (01:13:05) Refusing Safety Research (01:14:12) Claude Favoritism (01:15:19) Ruling Out Encoded Thinking (4.4.1) (01:18:41) Sandbagging (4.4.2) (01:21:27) Capability for Evasion of Safeguards (4.4.3) (01:23:04) Pick A Random Number (4.4.3.4) (01:25:49) White Box Analysis (4.5) (01:30:30) Model Welfare (5) (01:31:32) Key Model Welfare Findings (5.1.2) (01:41:17) Is Mythos Okay? (01:43:52) Self-Play (01:45:30) A Few Fun Facts --- First published: April 9th, 2026 Source: https://www.lesswrong.com/posts/EDQhwLTyTnNmaxRGq/claude-mythos-the-system-card --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    1h 47m
  4. 5H AGO

    “Some takes on UV & cancer” by Steven Byrnes

    Table of contents: Part 1: In which I use my optical physics background to share some hopefully-uncontroversial observationsPart 2: In which I boldly defy Public Health Orthodoxy on the whole UV situation Part 1: In which I use my optical physics background to share some hopefully-uncontroversial observations 1.1 UV depends a lot on “solar zenith angle” [a.k.a. “angle of the sun away from directly overhead”], not on how hot it is outside That means: you should mainly be thinking about UV exposure in proportion to how close it is to (1) the summer solstice and (2) solar noon. Here, I made this handy widget.[1] Select a city in the drop-down at the bottom, and mouse over (or tap) the colored area for specific datapoints: There's an interactive widget here in the post. I find that people intuitively judge sunburn risk based on temperatures being high, instead of shadows being short. So they worry about UV too much in the hot late summer, and/or not enough in the cool early spring; and they worry about UV too much in hot late afternoons, and/or not enough in cool late mornings. (Of course, temperature matters indirectly, because if it's hot [...] --- Outline: (00:25) Part 1: In which I use my optical physics background to share some hopefully-uncontroversial observations (00:35) 1.1. UV depends a lot on solar zenith angle \[a.k.a. angle of the sun away from directly overhead\], not on how hot it is outside (02:16) 1.2. Other things matter too, so just check your local UV index (02:57) 1.3. Around half of UV is diffuse (mostly coming from the blue sky) not direct (03:40) Part 2: In which I boldly defy Public Health Orthodoxy on the whole UV situation (05:17) 2.1. I lean towards: (1) sunburns are bad, (2) tans are neutral (in themselves), (3) tans are good all things considered (because they prevent sunburns), (4) Sunscreen is for sudden transitions in sun exposure, and then you should try to wean off it (08:23) 2.2. Wear sunglasses for comfort if you want, but theyre not a health product (09:34) 2.3. An appropriate effective SPF in most situations is usually like 3, maybe up to 10 tops The original text contained 6 footnotes which were omitted from this narration. --- First published: April 9th, 2026 Source: https://www.lesswrong.com/posts/t7GeZngqtzW49HceY/some-takes-on-uv-and-cancer --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    12 min
  5. 11H AGO

    “AI #163: Mythos Quest” by Zvi

    There exists an AI model, Claude Mythos, that has discovered critical safety vulnerabilities in every major operating system and browser. If released today it would likely break the internet and be chaos. If they had wanted to, they could have used it themselves and owned pretty much everyone. Luckily for all of us, Anthropic did no such thing. Instead, Anthropic is launching Project Glasswing, and making Mythos available to cybersecurity companies, so everyone can patch all the world's critical software as quickly as possible, and then we can figure out what to do from there. That's the story in AI that matters this week, and it is where my focus will be until I’ve worked my way through it all. But as always, that takes time to do right. So instead, I’m getting the weekly, and coverage of everything else, out of the way a day early. This post is about the non-Mythos landscape, and I hope to start covering Mythos and Project Glasswing tomorrow. I also covered the latest extended (18k words!) article about the history of Sam Altman and OpenAI, which contained some new material while confirming much old material, and analyzed their recent [...] --- Outline: (02:17) Language Models Offer Mundane Utility (02:48) Language Models Dont Offer Mundane Utility (03:11) Huh, Upgrades (04:24) On Your Marks (06:55) Meta Problems (07:15) Fun With Media Generation (09:13) A Young Ladys Illustrated Primer (09:22) You Drive Me Crazy (22:05) Unprompted Attention (22:46) They Took Our Jobs (33:27) They Took Our Job Market (35:29) Get Involved (37:31) In Other AI News (38:08) Search Your Feelings You Know It To Be True (45:58) Actors And Scribes (49:06) Show Me the Money (53:46) Bubble, Bubble, Toil and Trouble (54:05) Quiet Speculations (54:20) Quickly, Theres No Time (58:02) More Time Would Be Better (58:55) Greetings From The Department of War (01:00:11) The Quest for Sane Regulations (01:01:57) Chip City (01:03:29) Political Violence Is Completely and Always Unacceptable (01:04:16) The Week in Audio (01:06:42) Rhetorical Innovation (01:10:53) People Really Hate AI (01:13:39) Aligning a Smarter Than Human Intelligence is Difficult (01:17:44) Messages From Janusworld (01:21:00) People Are Worried About AI Killing Everyone (01:21:50) The Lighter Side --- First published: April 8th, 2026 Source: https://www.lesswrong.com/posts/5Dsuw9gGzkbjS4ubx/ai-163-mythos-quest --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    1h 24m

About

Audio narrations of LessWrong posts.

You Might Also Like