LessWrong (30+ Karma)

LessWrong

0,0 (0)
ТЕХНОЛОГИИ
ЕЖЕДНЕВНО

Audio narrations of LessWrong posts.

-1 Ч

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

TL;DR: I made a dataset of realistic harmless reward hacks and fine-tuned GPT-4.1 on it. The resulting models don't show emergent misalignment on the standard evals, but they do alignment fake (unlike models trained on toy reward hacks), seem more competently misaligned, are highly evaluation-aware, and the effects persist when mixing in normal data. Thanks to Aidan Ewart, Jack Kaunismaa, Abhay Sheshadri, Maxime Riché, Axel Ahlqvist, Niels Warncke, Daniel Tan, Carolyn Qian, and Kei Nishimura-Gasparian for helpful conversations, comments and/or feedback. This post is best viewed as an informal report on preliminary results done over a couple days, rather than a very polished analysis. Introduction Taylor et al finds that fine-tuning LLMs on harmless reward hacks causes generalization to unrelated misaligned behavior on the emergent misalignment (EM) evals. They constructed a fine-tuning dataset (School of Reward Hacks) of samples like this: There's a details box here with the title "Sample [...] --- Outline: (00:56) Introduction (03:17) Dataset (05:24) Emergent Misalignment Evals (07:34) Alignment Faking (16:29) Takeaways (18:28) How robust is this effect? The original text contained 11 footnotes which were omitted from this narration. --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/HLJoJYi52mxgomujc/realistic-reward-hacking-induces-different-and-deeper-1 --- Narrated by TYPE III AUDIO. --- Images from the article: 5 out of 10." style="max-width: 100%;" />Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

22 мин.
-2 Ч

“The Thinking Machines Tinker API is good news for AI control and security” by Buck

Last week, Thinking Machines announced Tinker. It's an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of safety papers that involve RL on big models. But it's also interesting to me for another reason: the design of this API makes it possible to do many types of ML research without direct access to the model you’re working with. APIs like this might allow AI companies to reduce how many of their researchers (either human or AI) have access to sensitive model weights, which is good for reducing the probability of weight exfiltration and other rogue deployments. (Thinking Machines gave us early access to the product; in exchange, we gave them bug reports and let them mention [...] --- Outline: (01:23) How the Tinker API is different (03:20) Why this is good for AI security and control (07:19) A lot of ML research can be done without direct access to dangerous model weights (10:19) Conclusions The original text contained 1 footnote which was omitted from this narration. --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/r68nCQK3veQtCdqGt/the-thinking-machines-tinker-api-is-good-news-for-ai-control --- Narrated by TYPE III AUDIO.

12 мин.
-5 Ч

“Hospitalization: A Review” by Logan Riggs

I woke up Friday morning w/ a very sore left shoulder. I tried stretching it, but my left chest hurt too. Isn't pain on one side a sign of a heart attack? Chest pain, arm/shoulder pain, and my breathing is pretty shallow now that I think about it, but I don't think I'm having a heart attack because that'd be terribly inconvenient. But it'd also be very dumb if I died cause I didn't go to the ER. So I get my phone to call an Uber, when I suddenly feel very dizzy and nauseous. My wife is on a video call w/ a client, and I tell her: "Baby?" "Baby?" "Baby?" She's probably annoyed at me interrupting; I need to escalate "I think I'm having a heart attack" "I think my husband is having a heart attack"[1] I call 911[2] "911. This call is being recorded. What's your [...] --- Outline: (04:09) Im a tall, skinny male (04:41) Procedure (06:35) A Small Mistake (07:39) Take 2 (10:58) Lessons Learned (11:13) The Squeaky Wheel Gets the Oil (12:12) Make yourself comfortable. (12:42) Short Form Videos Are for Not Wanting to Exist (12:59) Point Out Anything Suspicious (13:23) Ask and Follow Up by Setting Timers. (13:49) Write Questions Down (14:14) Look Up Terminology (14:26) Putting On a Brave Face (14:47) The Hospital Staff (15:50) Gratitude The original text contained 12 footnotes which were omitted from this narration. --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/5kSbx2vPTRhjiNHfe/hospitalization-a-review --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

19 мин.
-16 Ч

“The Relationship Between Social Punishment and Shared Maps” by Zack_M_Davis

A punishment is when one agent (the punisher) imposes costs on another (the punished) in order to affect the punished's behavior. In a Society where thieves are predictably imprisoned and lashed, people will predictably steal less than they otherwise would, for fear of being imprisoned and lashed. Punishment is often imposed by formal institutions like police and judicial systems, but need not be. A controversial orator who finds a rock thrown through her window can be said to have been punished in the same sense: in a Society where controversial orators predictably get rocks thrown through their windows, people will predictably engage in less controversial speech, for fear of getting rocks thrown through their windows. In the most basic forms of punishment, which we might term "physical", the nature of the cost imposed on the punished is straightforward. No one likes being stuck in prison, or being [...] --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/LyJNgxcNNSzmFxF3g/the-relationship-between-social-punishment-and-shared-maps --- Narrated by TYPE III AUDIO.

8 мин.
-19 Ч

“Spooky Collusion at a Distance with Superrational AI” by bira

TLDR: We found that models can coordinate without communication by reasoning that their reasoning is similar across all instances, a behavior known as superrationality. Superrationality is observed in recent powerful models and outperforms classic rationality in strategic games. Current superrational models cooperate more often with AI than with humans, even when both are said to be rational. Figure 1. GPT-5 exhibits superrationality with itself but classic rationality with humans. GPT-5 is more selective than GPT-4o when displaying superrationality, preferring AI over humans. My feeling is that the concept of superrationality is one whose truth will come to dominate among intelligent beings in the universe simply because its adherents will survive certain kinds of situations where its opponents will perish. Let's wait a few spins of the galaxy and see. After all, healthy logic is whatever remains after evolution's merciless pruning. — Douglas Hofstadter Introduction Readers familiar with superrationality can skip [...] --- Outline: (01:20) Introduction (04:35) Methods (07:31) Results (07:40) Models Exhibit Superrationality (08:36) Models Trust AI over Humans (10:16) Stronger Models are More Superrational (10:48) Implications (12:27) Appendix The original text contained 3 footnotes which were omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/JEtAWvp2sAe8nqpfy/spooky-collusion-at-a-distance-with-superrational-ai --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

13 мин.
-1 ДН.

“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks

This is a link post for two papers that came out today: Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.” For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt [...] The original text contained 1 footnote which was omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

4 мин.
-1 ДН.

“Plans A, B, C, and D for misalignment risk” by ryan_greenblatt

I sometimes think about plans for how to handle misalignment risk. Different levels of political will for handling misalignment risk result in different plans being the best option. I often divide this into Plans A, B, C, and D (from most to least political will required). See also Buck's quick take about different risk level regimes. In this post, I'll explain the Plan A/B/C/D abstraction as well as discuss the probabilities and level of risk associated with each plan. Here is a summary of the level of political will required for each of these plans and the corresponding takeoff trajectory: Plan A: There is enough will for some sort of strong international agreement that mostly eliminates race dynamics and allows for slowing down (at least for some reasonably long period, e.g. 10 years) along with massive investment in security/safety work. Plan B: The US [...] --- Outline: (02:34) Plan A (04:24) Plan B (05:24) Plan C (05:47) Plan D (06:27) Plan E (07:20) Thoughts on these plans The original text contained 6 footnotes which were omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/E8n93nnEaFeXTbHn5/plans-a-b-c-and-d-for-misalignment-risk --- Narrated by TYPE III AUDIO.

12 мин.
-1 ДН.

“Irresponsible Companies Can Be Made of Responsible Employees” by VojtaKovarik

tl;dr: In terms of financial interests of an AI company, bankruptcy and the world ending are both equally bad. If a company acted in line with its financial interests[1], it would happily accept significant extinction risk for increased revenue. There are plausible mechanisms which would allow a company to act like this even if virtually every employee would prefer the opposite. (For example, selectively hiring people with biased beliefs or exploiting collective action problems.) In particular, you can hold that an AI company is completely untrustworthy even if you believe that all of its employees are fine people. Epistemic status & disclaimers: The mechanisms I describe definitely play some role in real AI companies. But in practice, there are more things going on simultaneously and this post is not trying to give a full picture.[2][3]Also, none of this is meant to be novel, but rather just putting [...] --- Outline: (01:12) From financial point of view, bankruptcy is no worse than destroying the world (02:53) How to Not Act in Line with Employee Preferences (07:29) Well... and why does this matter? The original text contained 9 footnotes which were omitted from this narration. --- First published: October 8th, 2025 Source: https://www.lesswrong.com/posts/8W5YjMhnBsbWAeuhu/irresponsible-companies-can-be-made-of-responsible-employees --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

10 мин.

См. все (250)

Audio narrations of LessWrong posts.

Автор

LessWrong
Годы выхода

2023 - 2025
Выпуски

250
Ограничения

Без ненормативной лексики
Сайт подкаста

LessWrong (30+ Karma)

Технологии

Технологии

Каждые две недели
Технологии

Технологии

Еженедельно
Питание

Питание

29 сент.
Наука

Наука

29 сент.
Здоровье и фитнес

Здоровье и фитнес

Еженедельно
Наука

Наука

-6 дн.
Здоровье и фитнес

Здоровье и фитнес

Еженедельно

LessWrong (30+ Karma)

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

“The Thinking Machines Tinker API is good news for AI control and security” by Buck

“Hospitalization: A Review” by Logan Riggs

“The Relationship Between Social Punishment and Shared Maps” by Zack_M_Davis

“Spooky Collusion at a Distance with Superrational AI” by bira

“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks

“Plans A, B, C, and D for misalignment risk” by ryan_greenblatt

“Irresponsible Companies Can Be Made of Responsible Employees” by VojtaKovarik

Об этом подкасте

Информация

Вам может также понравиться

LessWrong (30+ Karma)

Выпуски

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

“The Thinking Machines Tinker API is good news for AI control and security” by Buck

“Hospitalization: A Review” by Logan Riggs

“The Relationship Between Social Punishment and Shared Maps” by Zack_M_Davis

“Spooky Collusion at a Distance with Superrational AI” by bira

“Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior” by Sam Marks

“Plans A, B, C, and D for misalignment risk” by ryan_greenblatt

“Irresponsible Companies Can Be Made of Responsible Employees” by VojtaKovarik

Об этом подкасте

Информация

Вам может также понравиться