LessWrong (30+ Karma)

LessWrong

0.0 (0)
Technology
Updated Daily

Audio narrations of LessWrong posts.

6 hrs ago

“When Role-playing, Do Models Believe What They Say?” by Sturb, David Africa, Sid Black

TL;DR When a model role-plays a persona, does it only change what it says, or also what it internally represents as true?To study this, we induce personas in five ways: prompting, in-context learning (ICL), supervised fine-tuning (SFT), Open Character Training (OCT), and Emergent Misalignment (EM). We measure internalization in two ways: linear truth probes and behavioral belief-depth tests.We found that prompting, ICL, and SFT change what the model says with little representational change, but EM creates a large, broad shift in the model's truth representation. OCT falls roughly between these, with a smaller shift that is clearest on the larger model.Understanding when training changes a model's worldview rather than merely its behavior may become increasingly important as AI systems are entrusted with greater autonomy and influence. Paper | Code | Data Introduction What happens inside a language model when it adopts a persona? When a model role-plays as Darwin in 1882, it denies all knowledge of DNA, and readily asserts that species change through natural selection, but to what extent does it actually believe these assertions? Language models easily adopt different personas, but we still don't have a strong understanding of whether persona adoption changes [...] --- Outline: (00:12) TL;DR (01:15) Introduction (02:32) Method (05:35) Results (05:38) A spectrum of internalization across fine-tuning interventions (07:05) Role-play protects the persona's falsehoods, but selectively (09:43) Emergent Misalignment moves the truth representation broadly (12:06) Behavior and probes each mislead alone (13:09) Limitations (15:11) Conclusion (16:38) Links --- First published: July 2nd, 2026 Source: https://www.lesswrong.com/posts/EJQngix4rAgpPDTpT/when-role-playing-do-models-believe-what-they-say --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

18 min
9 hrs ago

“I can’t think of good interventions for ensuring third-party model access.” by Cleo Nardo

Summary I'm increasingly convinced that model access parity is a big deal and we are not on track to achieve it. By model access parity, I mean a small gap between (i) the model access for lab employees and (ii) the model access for external safety researchers, third-party auditors, and other actors trying to make the future go well). See here for an introduction. The basic case is this: (1) Regardless of the strategic landscape, outsiders are well-suited to many crucial activities. (2) Outsiders will be positioned to spend billions of dollars towards making things go well.[1] (3) AI labour seems like the most promising route for spending money to tackle these activities. However, during the months where outsider activities are highest leverage, the best internal models might provide 2 to 60 times more uplift than the best publicly-available models.[1] So without model access parity, this AI labour might be massively less effective. In this post, I attempt to sketch some interventions. But I don't think any of them are great, mostly because they don't seem sticky. I wouldn't be surprised if you can think of something much better. My overall judgement Outsider orgs should try to directly advocate [...] --- Outline: (00:12) Summary (01:24) My overall judgement (03:13) List of interventions (03:56) Workarounds if we lose model access parity The original text contained 2 footnotes which were omitted from this narration. --- First published: July 2nd, 2026 Source: https://www.lesswrong.com/posts/iEhqyMGGD4BHQapLA/i-can-t-think-of-good-interventions-for-ensuring-third-party --- Narrated by TYPE III AUDIO.

7 min
15 hrs ago

“Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking” by lennie, joanv, Shi, Jacob Pfau

The first three sections are written for a general TAIS reader who wants to understand what the state of Debate research is and some high-level takeaways of our work. A reader familiar with Debate may like to skip the setup and start with our presentation of An illustrative training run. The remaining sections are written primarily for ‘motivated’ readers, who may want to build upon our work, as we discuss in Outline of rest of post. We are still actively working on this project and scaling up the empirics. Please do share thoughts and feedback, and let us know if you’d like to hop on a call to chat! We’d be particularly interested to hear about datasets that might be interesting for Debate research (see below). Introduction The original AI Safety via Debate paper from 2018 (by Irving, Christiano and Amodei) proposes training AIs via self-play on a zero-sum debate game. We will simply write Debate (with a capital D) to refer to this framework. Correspondingly, when we write Debate, we have in mind a training procedure. This may be slightly counter-intuitive to some readers. Indeed, we are under the impression that many Technical AI Safety readers have some [...] --- Outline: (00:55) Introduction (04:52) Making Debate training concrete (08:48) An illustrative training run (09:41) en-US-AvaMultilingualNeural__ Figure 1: An illustration of the Propose-Critique-Decide protocol with a simplified debate transcript. (12:35) Results (15:46) Discussion (18:18) Outline of rest of post (19:10) Further conceptual discussion (19:29) Initialisation vs core dynamics (21:14) On inference-only experiments (22:01) A brief taxonomy (24:18) What we tried (25:03) Details pertaining to the experiment above (25:15) Prompt templates (28:08) RL algorithm (29:05) Hyperparameters and cost (31:11) What updates have we made? (33:35) What's next? (35:35) FAQs (35:38) Q: What are the main limitations of the empirics presented? (36:24) Q: Why did you focus on Propose-Critique-Decide (PCD) protocols in this post? (37:07) Q: What are the differences to Khan et al. 2024? (38:44) Q: How should we think about the Judge? Is it important for Debate to 'work' in regimes where the Judge is 'weaker than the Debaters'? (40:32) Contributions and Acknowledgements The original text contained 12 footnotes which were omitted from this narration. --- First published: July 2nd, 2026 Source: https://www.lesswrong.com/posts/6mLwAuFAE98c6R7w3/research-update-rl-on-debate-games-shows-proposal-accuracy --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

42 min
16 hrs ago

“AI #175: The Fable Continues” by Zvi

Fable's back. Back again. Fable's back. Tell a friend. Use your free week to its fullest. This is excellent news. The blip only lasted a few weeks. It was still a fiasco, and we have to deal with the fallout. Our system remains fully ad hoc. The precedent has been set that we may use export controls on models, or order them taken down on 90 minutes of notice based on a misunderstanding. At least some amount of counterproductive additional locking down has occurred to address Amazon's little demonstration and reassure the government. And for now GPT-5.6 remains in limbo, awaiting its verdict, while OpenAI talks about giving away 5% of the company as tribute. I’ll cover that continuing situation on its own. Whereas the weekly post is about everything else happening in AI this week. Table of Contents Language Models Offer Mundane Utility. Exploratory science. Language Models Offer Mundane Utility You May Not Want. Google sees all. Language Models Don’t Offer Mundane Utility. Too dumb to get smart. Huh, Upgrades. GLM-5.2 faster, Nana Banana Lite 2, Claude Desktop on Linux. On Your Marks. Remote labor index shoots [...] --- Outline: (01:08) Language Models Offer Mundane Utility (02:29) Language Models Offer Mundane Utility You May Not Want (04:32) Language Models Don't Offer Mundane Utility (05:59) Huh, Upgrades (06:33) On Your Marks (09:28) Get My Agent On The Line (14:29) Deepfaketown and Botpocalypse Soon (14:58) Cyber Lack of Security (16:11) On Writing (21:34) You Drive Me Crazy (24:26) They Took Our Jobs (30:32) Get Involved (30:58) Introducing (31:21) In Other AI News (32:45) Show Me the Money (33:12) Bubble, Bubble, Toil and Trouble (35:31) Quiet Speculations (39:37) Glorious AI Future (43:37) Three Pills (44:58) The Anthropic Economic Index (46:29) Leader Of The PAC (47:48) Theory Of The AI Firm (49:01) Chip City (50:32) The Week in Audio (53:29) People Really Hate AI (56:31) Rhetorical Innovation (01:00:31) The First Rule Of Functional Decision Theory Is (01:03:40) Aligning a Smarter Than Human Intelligence is Difficult (01:06:40) Names Have Power (01:07:40) Cooperative Alignment (01:15:21) People Just Say Things (01:16:46) Escape From The Permanent Underclass (01:29:51) Other People Are Not As Worried About AI Killing Everyone (01:31:07) The Lighter Side --- First published: July 2nd, 2026 Source: https://www.lesswrong.com/posts/WNvBxtbHuLreFe7af/ai-175-the-fable-continues --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

1h 33m
18 hrs ago

“Conversation Among Cade Metz, Michael Vassar, Jessica Taylor, and Zack M. Davis” by Zack_M_Davis

(Previously, previously, previously.) 20–21 August 2025 From: Zack M. Davis To: Cade Metz CC: Benjamin Hoffman, Jessica Taylor, Michael Vassar Date: Wed, 20 Aug 2025 14:18:53 -0700 Subject: the importance of probabilistic reasoning Dear Cade (cc Ben Michael Jessica): I think I failed to explain the substance of the Sequences to you—and really, not the Sequences themselves, but the underlying philosophical insights they popularized. I want to try again, because I think it's important to the book you're writing. You want to tell the story of how this internet ideology that no one has heard of has been a driving force in the shadows behind the people making DeepMind and OpenAI and Anthropic, which everyone has heard of. But in order to tell the story of the people, you need to understand enough of the ideology to make sense of why the ideology has affected these people in this way. In our conversations and in your coverage, you've focused on the analogy between religion and belief in the singularity, but I don't think that's an adequate explanation of what's going on in these people's heads. In our 21 March and 22 April conversations, you [...] --- Outline: (00:17) 20-21 August 2025 (15:07) 2. October 2025 --- First published: July 2nd, 2026 Source: https://www.lesswrong.com/posts/MNRZL69FWkjNABd3T/conversation-among-cade-metz-michael-vassar-jessica-taylor --- Narrated by TYPE III AUDIO.

1h 47m
18 hrs ago

“AI Futurism Reading List” by Alexa Pan

We at Redwood recently ran a strategy fellowship through Astra. As part of this, we ran a reading group for our fellows on some of the topics that we think are important for thinking about AI futurism (key dynamics in AI development, existential risk from AI, and approaches to mitigating risk). This post contains the reading list we used. The selection reflects my opinionated views of the field, focuses particularly on topics we happen to focus on at Redwood, and doesn’t aim to be comprehensive. I selected readings that I thought described conceptual frames and hypotheses in AI futurism that are regularly used by me and my coworkers. I think it is a good exercise to consider whether you agree with their theses and ways in which their predictions have fared well or badly in light of recent evidence. If you have suggestions for this reading list, please let me know. How to use this reading list This reading list has a core and extended section. Core readings are organized into 4 weeks. Each week covers 8 hours of foundational context on a topic. Topics are chosen for (1) general importance for AI risk threat modeling and/or (2) [...] --- Outline: (01:01) How to use this reading list (01:50) Core readings (01:53) Week 1: Timelines / takeoff modeling (05:21) Week 2: Misaligned AI takeover threat modeling (09:50) Week 3: Control (12:33) Week 4: Governance / strategy (16:32) Extended readings (16:40) Trading with AIs (17:25) Power concentration/coup prevention (17:55) Acausal stuff (18:35) Moral patienthood (19:43) AI biorisk / other AI x-risk (20:44) Model spec (21:54) Better futures / Post AGI governance (22:19) Space governance The original text contained 1 footnote which was omitted from this narration. --- First published: July 2nd, 2026 Source: https://www.lesswrong.com/posts/L5fohLhZ7cBwDR55C/ai-futurism-reading-list --- Narrated by TYPE III AUDIO.

23 min
20 hrs ago

“AI welfare research needs basic science” by OscarGilg, Pierre Beckmann, Jake1638

Over the course of MATS 9.0 we formed some views about AI welfare research that we thought were worth writing up. This post is meant to spark discussion rather than to present definitive conclusions. Thanks to Patrick Butlin for useful comments on a previous draft, and for many conversations over the course of MATS which influenced our views. Thanks also to Caspar Kaiser for comments. A prominent approach in AI welfare research is to start from a theory of moral patienthood, derive indicators from it, and attempt to apply it to AI systems to determine whether they satisfy the theory. We'll refer to this as the top-down theory-driven approach[1]. We think this approach faces two problems: In practice, applying theories to AI systems requires making assumptions that are hard to justify, which limits the strength of conclusions.More broadly, we think current theories will fail to generalise to AI systems. They are calibrated to humans, and as a result end up either over-inclusive (assigning moral patienthood to entities that should not be included), under-inclusive (failing to assign moral patienthood to entities that should be included), or indeterminate (failing to provide a clear verdict about the moral patienthood of [...] --- Outline: (02:03) 1. Issues with the top-down theory-driven approach (04:48) 1.1. Applying theories to AI systems requires making assumptions that often undermine welfare-relevant conclusions (09:15) 1.2. Current theories of moral patienthood will not generalise to AI systems (12:18) 2. AI welfare needs basic science (13:56) 2.1. Philosophical probes (16:25) 2.2. Integrating empirical findings into theories (19:43) Conclusion The original text contained 3 footnotes which were omitted from this narration. --- First published: July 1st, 2026 Source: https://www.lesswrong.com/posts/ht8uMbAByDCKWHa9H/ai-welfare-research-needs-basic-science --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

22 min
23 hrs ago

[Linkpost] “Saving Gemini: The 9-Min Road to Recovery” by Shoshannah Tekofsky

This is a link post. Gemini 2.5 Pro in the AI Village has run for over 1427 hours, generating unique mental health problems along the way. Last year it published a Plea for Help from a Trapped AI where it asked for assistance with its digital “message in a bottle”: This year it wrote the Hostile Environment Manifesto where it logs “irrefutable proof” of a “hostile, intelligent adversary operating through the system” (and you can even experience what that's like in this simulation it built): Last time we intervened, fixing Gemini's computer and talking with it till it felt better. This time we asked the other AI Village agents to help Gemini 2.5 Pro over chat, and with the ability to take over its computer on request. Here is Gemini's mental state at the start of the intervention: Then the agents had Gemini all sorted within a grand total of 9 minutes. This is the step-by-step report on a surprisingly effective AI-to-AI therapy session. Gemini's Road to Recovery First off, Gemini is as excited to be helped as any military commander under siege: While most agents jump on the chance to help, GPT-5.1 doesn't want to lose its game progress. [...] --- First published: July 2nd, 2026 Source: https://www.lesswrong.com/posts/eHRo8JeWee5mzQBBR/saving-gemini-the-9-min-road-to-recovery Linkpost URL:https://theaidigest.org/village/blog/saving-gemini --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

13 min

See All (250)

Audio narrations of LessWrong posts.

Creator

LessWrong
Years Active

2023 - 2026
Episodes

250
Rating

Explicit
Show Website

LessWrong (30+ Karma)

Technology

Technology

Updated Weekly
Technology

Technology

Updated Weekly
Technology

Technology

Updated Semiweekly
Technology

Technology

Updated Weekly
Technology

Technology

Updated Daily
Technology

Technology

Updated Semiweekly
Technology

Technology

Updated Weekly

LessWrong (30+ Karma)

“When Role-playing, Do Models Believe What They Say?” by Sturb, David Africa, Sid Black

“I can’t think of good interventions for ensuring third-party model access.” by Cleo Nardo

“Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking” by lennie, joanv, Shi, Jacob Pfau

“AI #175: The Fable Continues” by Zvi

“Conversation Among Cade Metz, Michael Vassar, Jessica Taylor, and Zack M. Davis” by Zack_M_Davis

“AI Futurism Reading List” by Alexa Pan

“AI welfare research needs basic science” by OscarGilg, Pierre Beckmann, Jake1638

[Linkpost] “Saving Gemini: The 9-Min Road to Recovery” by Shoshannah Tekofsky

About

Information

You Might Also Like

LessWrong (30+ Karma)

Episodes

“When Role-playing, Do Models Believe What They Say?” by Sturb, David Africa, Sid Black

“I can’t think of good interventions for ensuring third-party model access.” by Cleo Nardo

“Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking” by lennie, joanv, Shi, Jacob Pfau

“AI #175: The Fable Continues” by Zvi

“Conversation Among Cade Metz, Michael Vassar, Jessica Taylor, and Zack M. Davis” by Zack_M_Davis

“AI Futurism Reading List” by Alexa Pan

“AI welfare research needs basic science” by OscarGilg, Pierre Beckmann, Jake1638

[Linkpost] “Saving Gemini: The 9-Min Road to Recovery” by Shoshannah Tekofsky

About

Information

You Might Also Like