LessWrong (Curated & Popular)

LessWrong

Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.

  1. 1 DAY AGO

    “Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals” by Alexa Pan, ryan_greenblatt

    According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5's behavioral improvements in these evaluations may partly be driven by growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.[1] To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic's evaluations, Sonnet 4.5 explicitly mentions being tested between 80% and 100% of the time, compared with under 10% for Sonnet 4. Internal representations associated with evaluation awareness also became notably stronger over training.[2] In addition, Sonnet 4.5 seems substantially less likely to behave poorly once it has noticed that it might be tested. As seen in the figure below, when Anthropic inhibited internal representations related to evaluation awareness[3], its misaligned behavior rate on synthetic honeypot tests increased from 0% to between 1% and 9%. This effect was [...] --- Outline: (06:58) Sonnet 4.5 is much more evaluation-aware than prior models (10:00) Evaluation awareness seems to suppress misaligned behavior (14:52) Anthropic's training plausibly caused Sonnet 4.5 to game evaluations (16:28) Evaluation gaming is plausibly a large fraction of the effect of training against misaligned behaviors (22:57) Suppressing evidence of misalignment in evaluation gamers is concerning (25:25) What AI companies should do (30:02) Appendix The original text contained 21 footnotes which were omitted from this narration. --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/qgehQxiTXj53X49mM/sonnet-4-5-s-eval-gaming-seriously-undermines-alignment --- Narrated by TYPE III AUDIO. --- Images from the article:

    36 min
  2. 1 DAY AGO

    “Publishing academic papers on transformative AI is a nightmare” by Jakub Growiec

    I am a professor of economics. Throughout my career, I was mostly working on economic growth theory, and this eventually brought me to the topic of transformative AI / AGI / superintelligence. Nowadays my work focuses mostly on the promises and threats of this emerging disruptive technology. Recently, jointly with Klaus Prettner, we’ve written a paper on “The Economics of p(doom): Scenarios of Existential Risk and Economic Growth in the Age of Transformative AI”. We have presented it at multiple conferences and seminars, and it was always well received. We didn’t get any real pushback; instead our research prompted a lot of interest and reflection (as I was reported, also in conversations where I wasn’t involved). But our experience with publishing this paper in a journal is a polar opposite. To date, the paper got desk-rejected (without peer review) 7 times. For example, Futures—a journal “for the interdisciplinary study of futures, visioning, anticipation and foresight” justified their negative decision by writing: “while your results are of potential interest, the topic of your manuscript falls outside of the scope of this journal”. Until finally, to our excitement, it was for once sent out for review. But then came the [...] --- First published: November 3rd, 2025 Source: https://www.lesswrong.com/posts/rmYj6PTBMm76voYLn/publishing-academic-papers-on-transformative-ai-is-a --- Narrated by TYPE III AUDIO.

    7 min
  3. 1 DAY AGO

    “The Unreasonable Effectiveness of Fiction” by Raelifin

    [Meta: This is Max Harms. I wrote a novel about China and AGI, which comes out today. This essay from my fiction newsletter has been slightly modified for LessWrong.] In the summer of 1983, Ronald Reagan sat down to watch the film War Games, starring Matthew Broderick as a teen hacker. In the movie, Broderick's character accidentally gains access to a military supercomputer with an AI that almost starts World War III. “The only winning move is not to play.” After watching the movie, Reagan, newly concerned with the possibility of hackers causing real harm, ordered a full national security review. The response: “Mr. President, the problem is much worse than you think.” Soon after, the Department of Defense revamped their cybersecurity policies and the first federal directives and laws against malicious hacking were put in place. But War Games wasn't the only story to influence Reagan. His administration pushed for the Strategic Defense Initiative ("Star Wars") in part, perhaps, because the central technology—a laser that shoots down missiles—resembles the core technology behind the 1940 spy film Murder in the Air, which had Reagan as lead actor. Reagan was apparently such a superfan of The Day the Earth Stood Still [...] --- Outline: (05:05) AI in Particular (06:45) Whats Going On Here? (11:19) Authorial Responsibility The original text contained 10 footnotes which were omitted from this narration. --- First published: November 3rd, 2025 Source: https://www.lesswrong.com/posts/uQak7ECW2agpHFsHX/the-unreasonable-effectiveness-of-fiction --- Narrated by TYPE III AUDIO. --- Images from the article:

    15 min
  4. 2 DAYS AGO

    “Legible vs. Illegible AI Safety Problems” by Wei Dai

    Some AI safety problems are legible (obvious or understandable) to company leaders and government policymakers, implying they are unlikely to deploy or allow deployment of an AI while those problems remain open (i.e., appear unsolved according to the information they have access to). But some problems are illegible (obscure or hard to understand, or in a common cognitive blind spot), meaning there is a high risk that leaders and policymakers will decide to deploy or allow deployment even if they are not solved. (Of course, this is a spectrum, but I am simplifying it to a binary for ease of exposition.) From an x-risk perspective, working on highly legible safety problems has low or even negative expected value. Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems. In contrast, working on the illegible problems (including by trying to make them more legible) does not have this issue and therefore has a much higher expected value (all else being equal, such as tractability). Note that according to this logic, success in making an illegible problem highly legible is almost as good as solving [...] The original text contained 2 footnotes which were omitted from this narration. --- First published: November 4th, 2025 Source: https://www.lesswrong.com/posts/PMc65HgRFvBimEpmJ/legible-vs-illegible-ai-safety-problems --- Narrated by TYPE III AUDIO.

    3 min
  5. 3 DAYS AGO

    “Lack of Social Grace is a Lack of Skill” by Screwtape

    1.  I have claimed that one of the fundamental questions of rationality is “what am I about to do and what will happen next?” One of the domains I ask this question the most is in social situations. There are a great many skills in the world. If I had the time and resources to do so, I’d want to master all of them. Wilderness survival, automotive repair, the Japanese language, calculus, heart surgery, French cooking, sailing, underwater basket weaving, architecture, Mexican cooking, functional programming, whatever it is people mean when they say “hey man, just let him cook.” My inability to speak fluent Japanese isn’t a sin or a crime. However, it isn’t a virtue either; If I had the option to snap my fingers and instantly acquire the knowledge, I’d do it. Now, there's a different question of prioritization; I tend to pick new skills to learn by a combination of what's useful to me, what sounds fun, and what I’m naturally good at. I picked up the basics of computer programming easily, I enjoy doing it, and it turned out to pay really well. That was an over-determined skill to learn. On the other [...] --- Outline: (00:10) 1. (03:42) 2. (06:44) 3. The original text contained 2 footnotes which were omitted from this narration. --- First published: November 3rd, 2025 Source: https://www.lesswrong.com/posts/NnTwbvvsPg5kj3BKq/lack-of-social-grace-is-a-lack-of-skill-1 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images

    11 min
  6. 3 DAYS AGO

    “What’s up with Anthropic predicting AGI by early 2027?” by ryan_greenblatt

    As far as I'm aware, Anthropic is the only AI company with official AGI timelines[1]: they expect AGI by early 2027. In their recommendations (from March 2025) to the OSTP for the AI action plan they say: As our CEO Dario Amodei writes in 'Machines of Loving Grace', we expect powerful AI systems will emerge in late 2026 or early 2027. Powerful AI systems will have the following properties: Intellectual capabilities matching or exceeding that of Nobel Prize winners across most disciplines—including biology, computer science, mathematics, and engineering. [...] They often describe this capability level as a "country of geniuses in a datacenter". This prediction is repeated elsewhere and Jack Clark confirms that something like this remains Anthropic's view (as of September 2025). Of course, just because this is Anthropic's official prediction[2] doesn't mean that all or even most employees at Anthropic share the same view.[3] However, I do think we can reasonably say that Dario Amodei, Jack Clark, and Anthropic itself are all making this prediction.[4] I think the creation of transformatively powerful AI systems—systems as capable or more capable than Anthropic's notion of powerful AI—is plausible in 5 years [...] --- Outline: (02:27) What does powerful AI mean? (08:40) Earlier predictions (11:19) A proposed timeline that Anthropic might expect (19:10) Why powerful AI by early 2027 seems unlikely to me (19:37) Trends indicate longer (21:48) My rebuttals to arguments that trend extrapolations will underestimate progress (26:14) Naively trend extrapolating to full automation of engineering and then expecting powerful AI just after this is probably too aggressive (30:08) What I expect (32:12) What updates should we make in 2026? (32:17) If something like my median expectation for 2026 happens (34:07) If something like the proposed timeline (with powerful AI in March 2027) happens through June 2026 (35:25) If AI progress looks substantially slower than what I expect (36:09) If AI progress is substantially faster than I expect, but slower than the proposed timeline (with powerful AI in March 2027) (36:51) Appendix: deriving a timeline consistent with Anthropics predictions The original text contained 94 footnotes which were omitted from this narration. --- First published: November 3rd, 2025 Source: https://www.lesswrong.com/posts/gabPgK9e83QrmcvbK/what-s-up-with-anthropic-predicting-agi-by-early-2027-1 --- Narrated by TYPE III AUDIO. --- Images from the article:

    39 min
  7. 3 DAYS AGO

    [Linkpost] “Emergent Introspective Awareness in Large Language Models” by Drake Thomas

    This is a link post. New Anthropic research (tweet, blog post, paper): We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model's activations, and measuring the influence of these manipulations on the model's self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness [...] --- First published: October 30th, 2025 Source: https://www.lesswrong.com/posts/QKm4hBqaBAsxabZWL/emergent-introspective-awareness-in-large-language-models Linkpost URL: https://transformer-circuits.pub/2025/introspection/index.html --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    3 min

About

Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you'd like more, subscribe to the “Lesswrong (30+ karma)” feed.

You Might Also Like