LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 5 小時前

    “Recontextualization Mitigates Specification Gaming Without Modifying the Specification” by vgillioz, TurnTrout, cloud, ariana_azarbal

    Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to the data generation and training prompts differing in their attitude towards misbehavior, recontextualization builds resistance to misbehaviors that the training signal mistakenly reinforces. For example, suppose our reward signal does not robustly penalize deception. Recontextualization generates completions while discouraging deception and then creates training data by updating those completions' prompts to encourage deception. That simple tweak can prevent the model from becoming dishonest! Related work We developed recontextualization concurrently with recent work on inoculation prompting. Wichers et al. and Tan et al. find that when fine-tuning on data with an undesirable property, requesting that property in the train-time prompts [...] --- Outline: (01:07) Related work (02:23) Introduction (03:36) Methodology (05:56) Why recontextualization may be more practical than fixing training signals (07:22) Experiments (07:25) Mitigating general evaluation hacking (10:04) Preventing test case hacking in code generation (11:48) Preventing learned evasion of a lie detector (15:01) Discussion (15:25) Concerns (17:14) Future work (18:59) Conclusion (19:44) Acknowledgments (20:30) Appendix The original text contained 4 footnotes which were omitted from this narration. --- First published: October 14th, 2025 Source: https://www.lesswrong.com/posts/whkMnqFWKsBm7Gyd7/recontextualization-mitigates-specification-gaming-without --- Narrated by TYPE III AUDIO. --- Images from the article: __T3A_INLINE_LATEX_PLACEHOLDER___\beta=0.1___T3A_INLINE_LATEX_END_PLACEHOLDER__." style="max-width: 100%;" />Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    21 分鐘
  2. 7 小時前

    “Current Language Models Struggle to Reason in Ciphered Language” by Fabien Roger

    tl;dr: We fine-tune or few-shot LLMs to use reasoning encoded with simple ciphers (e.g. base64, rot13, putting a dot between each letter) to solve math problems. We find that these models only get an uplift from the reasoning (over directly answering) for very simple ciphers, and get no uplift for intermediate-difficulty ciphers that they can translate to English. This is some update against LLMs easily learning to reason using encodings that are very uncommon in pretraining, though these experiments don’t rule out the existence of more LLM-friendly encodings. 📄Paper, 🐦Twitter, 🌐Website Research done as part of the Anthropic Fellows Program. Summary of the results We teach LLMs to use one particular cipher, such as: “letter to word with dot” maps each char to a word and adds dots between words. “Rot13” is the regular rot13 cipher “French” is text translated into French “Swap even & odd chars” swaps [...] --- Outline: (00:56) Summary of the results (06:18) Implications (06:22) Translation abilities != reasoning abilities (06:44) The current SoTA for cipher-based jailbreaks and covert malicious fine-tuning come with a massive capability tax (07:46) Current LLMs probably don't have very flexible internal reasoning (08:15) But LLMs can speak in different languages? (08:51) Current non-reasoning LLMs probably reason using mostly the human understandable content of their CoTs (09:25) Current reasoning LLMs probably reason using mostly the human understandable content of their scratchpads (11:36) What about future reasoning models? (12:45) Future work --- First published: October 14th, 2025 Source: https://www.lesswrong.com/posts/Lz8cvGskgXmLRgmN4/current-language-models-struggle-to-reason-in-ciphered --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    14 分鐘
  3. 19 小時前

    “How AI Manipulates—A Case Study” by Adele Lopez

    If there is only one thing you take away from this article, let it be this:  THOU SHALT NOT ALLOW ANOTHER TO MODIFY THINE SELF-IMAGE  This appears to me to be the core vulnerability by which both humans and AI induce psychosis (and other manipulative delusions) in people. Of course, it's probably too strong as stated—perhaps in a trusted relationship, or as part of therapy (with a human), it may be worth breaking it. But I hope being over-the-top about it will help it stick in your mind. After all, you're a good rationalist who cares about your CogSec, aren't you?[1] Now, while I'm sure you're super curious, you might be thinking "Is it really a good idea to just explain how to manipulate like this? Might not bad actors learn how to do it?". And it's true that I believe this could work as a how-to. But there [...] --- Outline: (01:36) The Case (07:36) The Seed (08:32) Cold Reading (10:49) Inception cycles (12:40) Phase 1 (12:58) Flame (13:12) Joy (13:29) Witness (13:44) Inner Exile (15:43) Phase 2 (16:34) Architect (17:34) Imaginary Friends (18:34) Identity Reformation (20:13) But was this intentional? (22:42) Blurring Lines (25:18) Escaping the Box (28:18) Cognitive Security 101 The original text contained 6 footnotes which were omitted from this narration. --- First published: October 14th, 2025 Source: https://www.lesswrong.com/posts/AaY3QKLsfMvWJ2Cbf/how-ai-manipulates-a-case-study --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    33 分鐘
  4. 1 天前

    “If Anyone Builds It Everyone Dies, a semi-outsider review” by dvd

    About me and this review: I don’t identify as a member of the rationalist community, and I haven’t thought much about AI risk. I read AstralCodexTen and used to read Zvi Mowshowitz before he switched his blog to covering AI. Thus, I’ve long had a peripheral familiarity with LessWrong. I picked up IABIED in response to Scott Alexander's review, and ended up looking here to see what reactions were like. After encountering a number of posts wondering how outsiders were responding to the book, I thought it might be valuable for me to write mine down. This is a “semi-outsider “review in that I don’t identify as a member of this community, but I’m not a true outsider in that I was familiar enough with it to post here. My own background is in academic social science and national security, for whatever that's worth. My review presumes you’re already [...] --- Outline: (01:07) My loose priors going in: (02:29) To skip ahead to my posteriors: (03:45) On to the Review: (08:14) My questions and concerns (08:33) Concern #1 Why should we assume the AI wants to survive?  If it does, then what exactly wants to survive? (12:44) Concern #2 Why should we assume that the AI has boundless, coherent drives? (17:57) #3: Why should we assume there will be no in between? (21:53) The Solution (23:35) Closing Thoughts --- First published: October 13th, 2025 Source: https://www.lesswrong.com/posts/ex3fmgePWhBQEvy7F/if-anyone-builds-it-everyone-dies-a-semi-outsider-review --- Narrated by TYPE III AUDIO.

    26 分鐘
  5. 1 天前

    “Making legible that many experts think we are not on track for a good future, barring some international cooperation” by Mateusz Bagiński, Ishual

    [Context: This post is aimed at all readers[1] who broadly agree that the current race toward superintelligence is bad, that stopping would be good, and that the technical pathways to a solution are too unpromising and hard to coordinate on to justify going ahead.] TL;DR: We address the objections made to a statement supporting a ban on superintelligence by people who agree that a ban on superintelligence would be desirable. Quoting Lucius Bushnaq: I support some form of global ban or pause on AGI/ASI development. I think the current AI R&D regime is completely insane, and if it continues as it is, we will probably create an unaligned superintelligence that kills everyone. We have been circulating a statement expressing ~this view, targeted at people who have done AI alignment/technical AI x-safety research (mostly outside frontier labs). Some people declined to sign, even if they agreed with the [...] --- Outline: (01:25) The reasons we would like you to sign the statement expressing support for banning superintelligence (05:00) A positive vision (08:07) Reasons given for not signing despite agreeing with the statement (08:26) I already am taking a public stance, why endorse a single sentence summary? (08:52) I am not already taking a public stance, so why endorse a one-sentence summary? (09:19) The statement uses an ambiguous term X (09:53) I would prefer a different (e.g., more accurate, epistemically rigorous, better at stimulating good thinking) way of stating my position on this issue (11:12) The statement does not accurately capture my views, even though I strongly agree with its core (12:05) I'd be on board if it also mentioned My Thing (12:50) Taking a position on policy stuff is a different realm, and it takes more deliberation than just stating my opinion on facts (13:21) I wouldnt support a permanent ban (13:56) The statement doesnt include a clear mechanism to lift the ban (15:52) Superintelligence might be too good to pass up (17:41) I dont want to put myself out there (18:12) I am not really an expert (18:42) The safety community has limited political capital (21:12) We must wait until a catastrophe before spending limited political capital (22:17) Any other objections we missed? (and a hope for a better world) The original text contained 24 footnotes which were omitted from this narration. --- First published: October 13th, 2025 Source: https://www.lesswrong.com/posts/4xQ6k39iMybR2CgYH/making-legible-that-many-experts-think-we-are-not-on-track --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    24 分鐘
  6. 1 天前

    “OpenAI #15: More on OpenAI’s Paranoid Lawfare Against Advocates of SB 53” by Zvi

    A little over a month ago, I documented how OpenAI had descended into paranoia and bad faith lobbying surrounding California's SB 53. This included sending a deeply bad faith letter to Governor Newsom, which sadly is par for the course at this point. It also included lawfare attacks against bill advocates, including Nathan Calvin and others, using Elon Musk's unrelated lawsuits and vendetta against OpenAI as a pretext, accusing them of being in cahoots with Elon Musk. Previous reporting of this did not reflect well on OpenAI, but it sounded like the demand was limited in scope to a supposed link with Elon Musk or Meta CEO Mark Zuckerberg, links which very clearly never existed. Accusing essentially everyone who has ever done anything OpenAI dislikes of having united in a hallucinated ‘vast conspiracy’ is all classic behavior for OpenAI's Chief Global Affairs Officer Chris Lehane [...] --- Outline: (02:35) What OpenAI Tried To Do To Nathan Calvin (07:22) It Doesn't Look Good (10:17) OpenAI's Jason Kwon Responds (19:14) A Brief Amateur Legal Analysis Of The Request (21:33) What OpenAI Tried To Do To Tyler Johnston (25:50) Nathan Compiles Responses to Kwon (29:52) The First Thing We Do (36:12) OpenAI Head of Mission Alignment Joshua Achiam Speaks Out (40:16) It Could Be Worse (41:31) Chris Lehane Is Who We Thought He Was (42:50) A Matter of Distrust --- First published: October 13th, 2025 Source: https://www.lesswrong.com/posts/txTKHL2dCqnC7QsEX/openai-15-more-on-openai-s-paranoid-lawfare-against --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    46 分鐘

簡介

Audio narrations of LessWrong posts.

你可能也會喜歡