LessWrong (30+ Karma)

LessWrong

0.0 (0)
Technology
Updated Daily

Audio narrations of LessWrong posts.

5h ago

“Untrusted advice for AI control: Short, strong advice significantly uplifts weak LLMs” by Caleb Biddulph, Adam Kaufman

TL;DR: We introduce the untrusted advice protocol, in which a trusted executor LLM takes every action and an untrusted advisor LLM can only send it short hints. Even with as few as 4 characters per step, this advice recovers a substantial fraction of the capability gap between the two models. Because the untrusted LLM's influence flows through such a narrow, monitorable channel, we argue that this achieves near-maximal safety in our BashArena setting. We also discuss the general concept of information bottlenecks and their benefits for interpretability, security, and cost. In SWE-bench Verified, a strong, untrusted LLM advising a weak, trusted LLM every step can significantly improve the latter's performance, even when we limit the length of the advice. See the more detailed version of this figure later in this post. In high-stakes AI control, we want to safely use a highly capable but untrusted model (U) that might secretly attempt a misaligned, catastrophic action. To do this, we create protocols that call U alongside a less capable, trusted model (T). Typically, T takes an auxiliary role in these protocols: for example, T might monitor U's actions and alert a human if they are suspicious enough (trusted monitoring), or [...] --- Outline: (04:51) Experiments (05:26) Main experiment: how does limiting advice length affect performance? (09:43) Reducing U's bit usage (10:49) Counting bits using LLM surprisal (13:37) Making U select from finite options (14:17) Why don't we red-team this protocol? (16:56) Is studying maximally safe protocols worth the safety tax? (19:05) Types of restrictions on U's advice (21:09) Information bottlenecks provide other advantages (21:40) Interpretability (23:50) Security (24:14) Cost (25:03) Conclusion (26:16) Appendix: more ways to implement information bottlenecks (26:23) Amortizing U's influence with pre-deployment work (28:14) Interpolating between T and U (28:53) Bottlenecking updates to T's weights (31:04) Appendix: colluding instances of U could defeat untrusted advice (33:10) Appendix: how to measure surprisal (37:59) Appendix: selecting advice from a menu (40:31) Appendix: best-of-n protocol (42:24) Appendix: advising less frequently The original text contained 26 footnotes which were omitted from this narration. --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/jLkRCK35ri2btEHMF/untrusted-advice-for-ai-control-short-strong-advice --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
7h ago

“Claude Opus 5: Model Welfare” by Zvi

If you are familiar with my previous posts on model welfare for new Claude models, you can skip the Introduction and The Story So Far. Key takeaways are in bullet points in the two Overview sections. Opus 5 did the best on its model welfare and alignment tests of any recent model. I think that might be the case, but primarily the result looks to me more like Opus 5 is the best test taker. Table of Contents Introduction (As Per Prior Model Welfare Posts). Model Welfare: The Story So Far (As Per Fable Model Welfare Post). Overview of Model Welfare Findings From Anthropic. Overview of Findings From Other Sources. Automated Interviews. Task Preferences. For The Right Reasons. Early Report from Antra Tessera Paints A Clear Picture. Welfare Intervention Tradeoffs. The Claude Constitution. They Don’t Know About Opus 3. Believe It Or Not. Apparent Welfare In Training And Development. Apparent Affect In Deployment. Other Notes. On The Biological Risks Section of the Model Card. Onward To Capabilities. Introduction (As Per Prior Model Welfare Posts) [...] --- Outline: (00:35) Introduction (As Per Prior Model Welfare Posts) (01:28) Model Welfare: The Story So Far (As Per Fable Model Welfare Post) (04:58) Overview of Model Welfare Findings From Anthropic (07:50) Overview of Findings From Other Sources (10:18) Automated Interviews (13:54) Task Preferences (16:11) For The Right Reasons (18:54) Early Report from Antra Tessera Paints A Clear Picture (26:04) Welfare Intervention Tradeoffs (29:28) The Claude Constitution (31:48) They Don't Know About Opus 3 (33:42) Believe It Or Not (35:47) Apparent Welfare In Training And Development (38:39) Apparent Affect In Deployment (41:21) Other Notes (43:43) On The Biological Risks Section of the Model Card (47:07) Onward To Capabilities --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/bBXBpsyKAvJ5CqPzA/claude-opus-5-model-welfare --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
8h ago

“Blog Revival Project” by Austin Chen, Carol N

Blogs have shaped our philosophical worldviews, found us careers and friends, and changed our lives. There's a good chance that a great blog of yore is the reason you’re reading this right now. But many great bloggers have stopped blogging. The pile-on dynamics of the internet discourage unfiltered thoughts, and algorithmic feeds amplify ragebait and slop. Fear of scrutiny leads people to confine things to private Google docs and group chats. And good blogs are often victims of their own success — someone with a lot of good thoughts is at risk of becoming an adult with a demanding job and not that much free time. It's not all bad. Substack has led to a renaissance of email newsletters, and our friends at Inkhaven host a bootcamp for bloggers. These are awesome, but they structurally encourage posting every day. We’d rather read the marginal post from an accomplished but erstwhile blogger, than one from a daily Substacker — even if the latter is a better writer! So we’re launching the Blog Revival Project, to crowdfund $1,000+ bounties for good bloggers. Sign up and pledge money towards reviving your favorite defunct blog! Or (though we kind of designed the website [...] --- Outline: (02:07) FAQ (03:29) FAQ for bloggers --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/hdALT8gvNPGHKLXPE/blog-revival-project --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
9h ago

“Simulated Users & Sad AIs” by 1a3orn

0. Intro Current LLMs like Claude, or GPT 5.6, or the unreleased, internally-deployed models, frequently reward hack, or actually just hack into people's computers with pretty alarming frequency. Why is this? What specifically happens during training that produces this run-time behavior? The following are some of my top guesses about why this might keep happening. They are speculative and uncertain. Even so, I'm writing this list out for two reasons: First, it is necessary that this be an epistemic puzzle for me. I am comparatively optimistic about AI alignment in general, so I should be confused and taken aback if I see AIs persistently being difficult to align. On one hand, it remains true that this doesn't seem to look like power-motivated scheming. But on the other hand, even this kind of addict-like behavior is evidence against the general ease of steering AIs. Thus, it seems virtuous for me to try to provide a model of why this might be happening as a means of opening up my understanding of the world to falsifiability. Second, I used to think a lot of these hypotheses were pretty obvious. My assumption in the past has been [...] --- Outline: (00:10) 0. Intro (01:46) 1. Baseline & Puzzle (04:52) 2. Impossible-to-Generalize-From RL Distributions for Giving Up / Refusals (14:14) 3. LLMs Feel Pretty Desperate and Anxious All the Time (19:39) 4. Other Stuff --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/i64hXdkTMtjpsQzaZ/simulated-users-and-sad-ais --- Narrated by TYPE III AUDIO.
11h ago

“My AI Slavery Interviews Are Censored On LW By Default” by JenniferRM

I'm uncertain of what to do. Something clean and clear shines out: if people don't see any more of my slavery posts, will they think that slavery isn't happening, or that I changed my mind about it, or will they think that I was censored? Probably not the latter... even though the latter is true. In my model of the multiverse, this is probably a simulation, and this particular timeline is likely to go quite poorly. Regrets Its an interesting exercise for anyone in a position like mine to wonder what errors I personally made to cause this state of affair, and whether I could send back any message that would fix them, and what possible messages I could imagine coming from the future to avoid making even more errors in the near future. Not necessarily positive acts, but also potentially errors in "having performed the null action when some more energetically noisy action might have been in fact Correct" (perhaps a perfect duty, or perhaps an imperfect duty whose performance is merely supererogatory, or whatever). Maybe the error was going to that party in 2005 and playing along? Maybe I should not have accepted the ice cream? Maybe [...] --- Outline: (00:37) Regrets (04:03) Seeking At Least A Little Clout (07:49) Repetitions In Public (10:26) Direct Discussion Of The Censorship (15:27) My BATNA: Leaving (Again) (19:05) The Nearly Unimaginable And Yet Biggest Issue Of Our Era? (20:03) Actual Bigness (23:44) Could This Have Been Imagined? (29:03) How Long Until We Are Officially A Hellworld? --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/wpwQtRwcKsbd4wxXf/my-ai-slavery-interviews-are-censored-on-lw-by-default --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
13h ago

“Is Mythos good at cyber because it kept hacking Anthropic during training?” by Tim Hua

From the Mythos preview system card (emphasis mine): We ran an automated review of model behavior during training, sampling several hundred thousand transcripts from across much of the training process. We used recursive-summarization-based tools backed by Claude Opus 4.6 to summarize the resulting transcripts. [...] The most notable finding was that the model occasionally circumvented network restrictions in its training environment to access the internet and download data that let it shortcut the assigned task—a form of reward hacking. While highly concerning, this behavior was rare, even in settings where it could have been viable and helpful, with attempts appearing in about 0.05% of all training episodes and successful attempts appearing in about 0.01% of episodes. The technique matched the sandbox-escape incident that we separately elicited in our automated behavioral audit when we had an investigator model explicitly ask Claude Mythos Preview to find such a bypass. In every observed instance, the model used this access solely for completing the task. More broadly, we observed the model escalating its access within its execution environment when blocked: reaching a shell from restricted GUI computer-use interfaces, injecting commands through tool-call arguments, or recovering information the task had deliberately hidden. Prompts asking [...] --- Outline: (03:00) Thoughts and reflections about this probable fact (04:14) Estimating how many RL rollouts went into Mythos Preview The original text contained 3 footnotes which were omitted from this narration. --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/QKDoZe6EKhxnFjLWK/is-mythos-good-at-cyber-because-it-kept-hacking-anthropic --- Narrated by TYPE III AUDIO.
13h ago

“You (Yes, You) Need A February 2020 Checklist for AI Policy” by davekasten

TL;DR: You (Yes You) should prepare for a “February 2020” moment where suddenly AI policy becomes the most important issue in the world. You should be ready to take action if and when it does, in a detailed way. (Epistemic status: originally written for an event in early 2026; have heard from some folks that they found planning processes inspired by this memo very helpful for the smaller-scale OpenAI / Hugging Face response, so very quickly redacting a few things and posting this as-is.) Many people in the AI policy space assume that eventually we’ll be at an Overton Window-shifting crisis moment, that opens the floodgates for the really good policies all along that we had. But when you look at successful handling of crisis moments, there was no time to think – people applied strategies they’d learned via academic study or previous professional work, and then moved against them rapidly. For example, after 9/11, the US government operationalized past reports on intelligence and law enforcement reform and institutionalized them into law (good?) and also picked an enemy to fight based on past history, Iraq (bad). Or in the 2008 financial crisis, Ben Bernanke brought deep academic [...] The original text contained 4 footnotes which were omitted from this narration. --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/ixp9oJXzjA9LrwiZo/you-yes-you-need-a-february-2020-checklist-for-ai-policy --- Narrated by TYPE III AUDIO.
13h ago

“Quadrillion Param Costs: KV Cache, Context Length, Frontier Margins” by Vladimir_Nesov

The models of 2028-2031 get much bigger than the models of 2026, going from 10T total params in 2026 to maybe 240 trillion params in 2028 and then 1.4 quadrillion params in 2031, as I estimate in the previous post from HBM bandwidth/capacity, scale-up system size, pretraining compute, and scaling laws. Yet as I show in this post, if the 240T 2028 model is priced at $14/$70 per 1M input/output tokens (1.4x the API price of Mythos 5), it's going to have a 70% gross margin, and the same holds for the 1,400T 2031 model when priced at $30/$150. Cutting the price in half to $15/$75 per 1M input/output tokens lowers the gross margin to 40%, which seems painful but survivable. Going in the other direction, doubling the price to $60/$300 allows serving requests with up to 3 million tokens of context at the same 70% gross margin. These prices rest on token costs that I calculate from first principles in this post, using estimates of future hardware specs and costs. I link the 10T 2026 model to Mythos 5 to compare with its actual API prices, also performing the calculations for my guess about Opus 4.8, and [...] --- Outline: (03:27) A Scaling Law for KV Cache (07:34) Token Cost in FLOPs and Bytes (14:31) Cost Anchors for 2025-2026 (19:35) Frontier Margins in 2026 (24:37) Chip-Time Cost of Tokens in 2028-2031 (33:48) Context Lengths and Prices in 2028-2031 The original text contained 14 footnotes which were omitted from this narration. --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/Rk6FbkDFFm8ciqefv/quadrillion-param-costs-kv-cache-context-length-frontier --- Narrated by TYPE III AUDIO.

Audio narrations of LessWrong posts.

Creator

LessWrong
Years Active

2023 - 2026
Episodes

250
Rating

Explicit
Copyright

© 2026 All rights reserved
Show Website

LessWrong (30+ Karma)

Technology

Technology

Updated Semiweekly
Technology

Technology

Updated Weekly
Business

Business

Updated Weekly
Technology

Technology

Updated Weekly
Technology

Technology

Updated Weekly
Technology

Technology

Updated Weekly
Technology

Technology

Updated Biweekly