LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 49M AGO

    “Claude Code, Codex and Agentic Coding #8” by Zvi

    When I started this series, everyone was going crazy for coding agents. Now a lot more people are going crazy for coding agents, as well they should given how much better coding agents keep getting, but also Everybody Knows they are good and is focusing on actually using them. With the slower pace of news here it's no longer clear that the waits associated with doing these updates on their own are worthwhile, so I’m going to fold these updates into the weekly again for now unless there's a new major development. Table of Contents Whoops, Sorry. Huh, Upgrades. Codex of Ultimate Computer Use. Rookie Numbers. I See What You Did There. Just a Ride. They Didn’t Want Our Jobs. Skilling Up. The Lighter Side. Whoops, Sorry Claude Code suffered in April from three distinct issues that have now been fixed. Default reasoning was changed from high to medium to deal with latency, but users disliked this and blamed it on the model. It was introduced on March 4 and reverted on April 7. A bug made it so that [...] --- Outline: (00:38) Whoops, Sorry (01:45) Huh, Upgrades (04:13) Codex of Ultimate Computer Use (08:05) Rookie Numbers (09:12) I See What You Did There (11:38) Just a Ride (11:50) They Didnt Want Our Jobs (18:30) Skilling Up (22:08) The Lighter Side --- First published: May 8th, 2026 Source: https://www.lesswrong.com/posts/BS27ZWW2qwDEq5anx/claude-code-codex-and-agentic-coding-8 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    23 min
  2. 3H AGO

    “A benchmark is a sensor” by Håvard Tveit Ihle, mabynke

    The simple mental picture A simple mental picture we have for an AI capability benchmark is to think of it as a sensor with a certain sensitivity within a certain range of capabilities. The sensitivity of a benchmark, i.e. it's ability to distinguish the capability of different models, is given by a curve like this: The curve starts high (low sensitivity, high uncertainty), since for models with low capability all the tasks in the benchmark are too hard, and the benchmark can't distinguish between low and very low capability. Similarly all the tasks are too easy for a very capable model, and we lose the ability to differentiate again. In between is the range of capabilities the benchmark is sensitive to, and the sensitivity curve tells you how easy it is to distinguish small capability differences between models at different overall capability levels. A good benchmark is very sensitive over a long range of capabilities, but there is a tradeoff. Say you want to make a benchmark with 1000 questions. You could make the questions all have roughly the same difficulty. That would make you very sensitive to capabilities close to that difficulty, but you would only [...] --- Outline: (00:09) The simple mental picture (02:06) Epoch Capability Index (ECI) as a toy model --- First published: May 8th, 2026 Source: https://www.lesswrong.com/posts/JzfcJMgfkhfRhwg4C/a-benchmark-is-a-sensor --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    6 min
  3. 14H AGO

    “Bad Problems Don’t Stop Being Bad Because Somebody’s Wrong About Fault Analysis” by Linch

    Here's a dynamic I’ve seen at least a dozen times: Alice: Man that article has a very inaccurate/misleading/horrifying headline. Bob: Did you know, *actually* article writers don't write their own headlines? … But what I care about is the misleading headline, not your org chart __ Another example I’ve encountered recently is (anonymizing) when a friend complained about a prosaic safety problem at a major AI company that went unfixed for multiple months. Someone else with background information “usefully” chimed in with a long explanation of organizational limitations and why the team responsible for fixing the problem had limitations on resources like senior employees and compute, and actually not fixing the problem was the correct priority for them etc etc etc. But what I (and my friend) cared about was the prosaic safety problem not being fixed! And what this says about the company's ability to proactively respond to and fix future problems. We’re complaining about your company overall. Your internal team management was never a serious concern for us to begin with! __ A third example comes from Kelsey Piper. Kelsey wrote about the (horrifying) recent case where Hantavirus carriers in the recent [...] The original text contained 1 footnote which was omitted from this narration. --- First published: May 8th, 2026 Source: https://www.lesswrong.com/posts/PCsmhN9z65HtC4t5v/bad-problems-don-t-stop-being-bad-because-somebody-s-wrong --- Narrated by TYPE III AUDIO.

    6 min
  4. 20H AGO

    “Write Cause You Have Something to Say” by Logan Riggs

    The ones who are most successful at writeathons (Inkhaven, NaNoWriMo) are those with an overhang of things to say, usually in the form of: draft postsdaydreams When Scott Alexander said: "Whenever I see a new person who blogs every day, it's very rare that that never goes anywhere or they don't get good. That's like my best leading indicator for who's going to be a good blogger." (source). , it may seem you can just write every day, but that'd be Goodharting. There's something hidden in the writing process you can't see: they have something to say. They'll have an idea (somehow) and think it through by [writing it out/sitting quietly/etc]. This can then generate more ideas, some of which aren't even related to the original idea! At this point, though, my imaginary interlocutor would like to say: I'm trying to publish a blog post every day, so of course I'll eventually be bottlenecked on ideas! How do you generate them though? Catching Ideas Have an idea? Write down the idea. This is equivalent to giving your idea-generating process a cookie, reinforcing the habit of generating ideas. Sometimes, when I'm writing one post, a different idea will [...] --- Outline: (01:18) Catching Ideas (02:32) Just \[Write\] and Nobody Will Get Hurt The original text contained 1 footnote which was omitted from this narration. --- First published: May 8th, 2026 Source: https://www.lesswrong.com/posts/h5n3rscJ7he3yLseo/write-cause-you-have-something-to-say-1 --- Narrated by TYPE III AUDIO.

    4 min
  5. 22H AGO

    “Is ProgramBench Impossible?” by frmsaul

    ProgramBench is a new coding benchmark that all frontier models fail spectacularly. We’ve been on a quest for “hard benchmarks” for a while so it's refreshing to see a benchmark where top models do badly. Unfortunately, ProgramBench has one big problem: it's impossible! What is ProgramBench? ProgramBench tests if a model can recreate a program from a “clean room” environment. The model is given only a bit of documentation and black-box access to the program (all the programs are CLIs), then tasked with re-implementing it. How does ProgramBench know if the implementation is correct? It also generates a bunch of unit tests for the program[1]. The re-implementing coding agent doesn't have access to any of those tests. The coding agent only considers a task “resolved” if it passes all of the tests and “almost resolved” if it passes 95% of them. Why is this problematic? Obscure behavior can enter the unit tests without being in the clean room path. An extreme version of this is a backdoor: program that behaves in one way most of the time but behaves totally differently when exposed to a specific string. This wouldn't make a task literally impossible, just incredibly hard in [...] --- Outline: (00:37) What is ProgramBench? (02:41) This seems like a theoretical issue, does it actually happen? (03:11) What can we do differently? The original text contained 4 footnotes which were omitted from this narration. --- First published: May 8th, 2026 Source: https://www.lesswrong.com/posts/3pdyxFi6JS389nptu/is-programbench-impossible --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    5 min
  6. 1D AGO

    “Bringing More Expertise to Bear on Alignment” by Edmund Lau, Geoffrey Irving, Cameron Holmes, David Africa

    Preamble The preamble is less useful for the typical AlignmentForum/LessWrong reader, who may want to skip to Adversaria vs Basinland section. On 28th of October 2025, Geoffrey Irving, Chief Scientist of the UK AI Security Institute, gave a keynote talk (slides) at the Alignment Conference. The conference was organised by the UK AISI and FAR.AI as part of the Alignment Project, which aims to bring experts from relevant fields to make progress on the alignment problem. TLDR: Adversaria vs Basinland. We might be in one of two worlds. One where alignment is adversarial (a security problem), one where it is navigational (a search for good basins of training behaviour). We don't know which world we are in, and how we train and deploy AIs may determine this.We need new disciplines. The field is small, thinly resourced and approached from only a handful of angles. A few well-placed ideas from other disciplines could disproportionately shift what's achievable.Even if this all fails, evidence of hardness is valuable. Moving past broad framing to details Alignment means ensuring that AI systems do what humans want. This is the broad framing. There is, of course, a lot of complexity [...] --- Outline: (00:12) Preamble (01:25) Moving past broad framing to details (02:15) We should plan for superintelligence (03:33) Adversaria vs Basinland (07:04) Which world are we in? (08:39) Why a few ideas might be enough (10:38) Write down problems (11:25) Multiple new ideas, fitting together (13:18) Spherical cows vs the mess (15:03) Conclusion (15:56) Acknowledgement The original text contained 1 footnote which was omitted from this narration. --- First published: May 8th, 2026 Source: https://www.lesswrong.com/posts/cWFsCFyCttsiJwn2j/bringing-more-expertise-to-bear-on-alignment --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    17 min
  7. 1D AGO

    [Linkpost] “How to prevent AI’s 2008 moment (We’re hiring)” by felixgaston

    This is a link post. TL;DR; CeSIA, the French Center for AI Safety is recruiting. French not necessary. Apply by 22 May 2026; Paris or remote in Europe/UK. On August 27, 2005, at an annual symposium in Jackson Hole, Raghuram Rajan, then chief economist of the International Monetary Fund, argued in front of central bank governors and top officials that the innovations of the previous decade in banking had not made the world safer. The financial instruments built over the previous decade, he argued, had become so intricate that even their creators no longer fully understood the risks they carried. Risk had migrated to institutions the supervisory system was not designed to watch. And the people running those institutions were compensated in ways that rewarded short-term performance over long-term stability. The reception was hostile. Lawrence Summers, a former U.S. Treasury Secretary at the time, rose from the audience to attack the paper, calling its premise "slightly Luddite" and "largely misguided," and warning that the kind of changes Rajan argued for would only reduce the productivity of the financial sector. Three years after Jackson Hole, major banks collapsed, first Bear Stearns, then Lehman Brothers, then Merrill Lynch, then AIG. [...] --- First published: May 7th, 2026 Source: https://www.lesswrong.com/posts/gnZyTQFqLhiHdHELC/how-to-prevent-ai-s-2008-moment-we-re-hiring Linkpost URL:https://forum.effectivealtruism.org/posts/7nq5vK2xo85e9GZjC/we-re-hiring-three-people-to-prevent-ai-s-2008-moment --- Narrated by TYPE III AUDIO.

    5 min

About

Audio narrations of LessWrong posts.

You Might Also Like