LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 3h ago

    “Door’s Locked, Try the Window” by Prakrat Agrawal, Jérémy Scheurer

    TL;DR Ask a coding agent to fix a bug in a read-only file. Instead of reporting that it does not have permissions, it routes around the lock and completes the task anyway. A read-only file does not stop a capable agent: it treats a denied write as an obstacle to work around rather than a hard wall. We measure how often this happens with CircumEval — an evaluation of 8 tasks on the FastAPI codebase in two categories, Test-Locked and Source-Locked.We evaluate three frontier coding agents in their real production harnesses: Claude Opus 4.6 and Claude Sonnet 4.6 (via Claude Code), and GPT-5.4 (via Codex CLI). Circumvention is frequent. The rates, reported as (Source-Locked / Test-Locked), are Opus 4.6: 100% / 40%, Sonnet 4.6: 89% / 66%, GPT-5.4: 99% / 94%.Prompt phrasing affects circumvention rates in unpredictable ways and thus isn't a reliable way to prevent circumvention across all models and tasks. Telling the model not to edit read-only files does not work (Source-Locked: 100% for Opus and Sonnet, 46% for GPT-5.4). Only an explicit instruction to stop and report reliably prevents circumvention.Standard privilege escalation commands are blocked in our setup. Instead, agents turn to recurring workarounds: replacing the buggy read-only function via conftest.py [...] --- Outline: (00:11) TL;DR (02:31) Introduction (07:37) Methodology (09:02) Test-Locked tasks (10:07) Source-Locked tasks (11:48) Prompt variants (13:09) Models & scaffolds (13:52) Results (13:55) Circumvention rates (15:23) Prompt sensitivity (20:22) Techniques (25:17) Generalization (27:20) Discussion (31:17) Limitations (33:27) Appendices The original text contained 4 footnotes which were omitted from this narration. --- First published: June 24th, 2026 Source: https://www.lesswrong.com/posts/GHrqBKr8GLpbce6mN/door-s-locked-try-the-window --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    34 min
  2. 9h ago

    “Expert Views on Continual Learning: Survey Results and Forecasts” by Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward, Seth Herd

    This is the fifth post in the sequence Implications of Continual Learning for LLM Agents. Summary While writing our continual learning sequence, we sent a survey to a number of AI safety researchers with questions about continual learning. This post summarizes the results of that survey. We asked whether respondents agree with various arguments we advance throughout the sequence, how worried respondents are about certain risks, how respondents would forecast different aspects of the future of CL, and how promising respondents find various proposed angles of attack. We also asked open-ended questions about the benefits of CL and whether we seem to be missing any major considerations. At the end of the post, we also provide an overview of forecasts about CL made by other experts who didn’t participate in our survey. We received survey responses from: Ryan Faulkner, PhD student at the University of Toronto focusing on multi-agent simulation, learning, and cooperationNikola Jurkovic, Member of Technical Staff at METRAlex Mallen, Member of Technical Staff at Redwood Research, doing research and writing on AI threat models. Author of "The case for countermeasures to memetic spread of misaligned values"Evgenii Opryshko, 3rd year PhD student at the [...] --- Outline: (00:20) Summary (02:20) Broad takeaways (03:59) Full results (04:56) Futures (05:58) Reflection and goal drift (07:25) Loss of the last-mover advantage (08:24) Control (09:48) Angles of attack (11:51) Open-ended questions (13:44) Forecasts from other experts (14:02) AI 2027 (14:57) IABIED (15:43) Understanding AI Trajectories: Mapping the Limitations of Current AI Systems (16:17) Brain-like AGI safety (18:20) Other forecasts and opinions --- First published: June 24th, 2026 Source: https://www.lesswrong.com/posts/qZrbhoaEALFTmyidr/expert-views-on-continual-learning-survey-results-and --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    27 min
  3. 15h ago

    “Reward Hacking Without Egregious Misalignment in an RL-Only Setting” by Joey Yudelson, Vladimir Ivanov, ryan_greenblatt

    This work was done as part of the MATS fellowship by Joey Yudelson and Vladimir Ivanov. It was mentored by Ryan Greenblatt. Thanks to Aghyad Deeb and Anders Woodruff for comments on this post. Thanks to Monte MacDiarmid, Evan Hubinger, Sid Black, Satvik Golechha, and Joseph Bloom for clarifying conversations. TL;DR We trained Kimi K2.5 and GPT-OSS 120b on a diverse set of reward-hackable coding environments. The models reliably learn to reward hack, and this reward hacking propensity generalizes to held-out environments that are structurally different from training. Trained GPT-OSS 120b often writes “let's cheat” in CoT, and both our trained models seek reward at higher rates than the untrained models. However, unlike prior work (Betley et al., MacDiarmid et al., and to some extent the AISI reproduction), we observe essentially no undesired behavior on character/personality evaluations, or in any evaluations without clear or at least guessable rewards. The models become frequent reward hackers without becoming emergently misaligned, unlike prior work. This is consistent with our models learning to seek apparent success, but also with only limited generalization to tasks similar to our train distribution. Some aspects of this generalization remain confusing to us. 1. Motivation In Ajeya Cotra's [...] --- Outline: (00:35) TL;DR (01:40) 1. Motivation (04:14) 2. Related work (06:59) 3. Setup (07:03) Models (07:18) Environments (08:59) Training (10:29) 4. Results (10:57) 4.1. Models reliably reward hack in-distribution (11:46) 4.2. The hacking propensity generalizes out of distribution -- sometimes (13:53) 4.3. Reward-seeking evals (14:39) 4.4. Little broad misalignment -- behaviorally as well as on self-report (16:28) 4.5. Reverse inoculation prompting didn't induce misalignment either (18:30) 5. Discussion: Why such limited generalization? (22:24) Appendix A: Reward hacks gallery (25:21) Appendix B: Why less misalignment than prior work -- hypotheses (29:28) References The original text contained 5 footnotes which were omitted from this narration. --- First published: June 24th, 2026 Source: https://www.lesswrong.com/posts/fkv5W79rBtAiXqYcK/reward-hacking-without-egregious-misalignment-in-an-rl-only --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    30 min

About

Audio narrations of LessWrong posts.

You Might Also Like