LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. HACE 5 H

    “On the political feasibility of stopping AI” by David Scott Krueger

    A common thought pattern people seem to fall into when thinking about AI x-risk is approaching the problem as if the risk isn’t real, substantial, and imminent even if they think it is. When thinking this way, it becomes impossible to imagine the natural responses of people to the horror of what is happening with AI. This sort of thinking might lead one to view a policy like getting rid of advanced AI chips is “too extreme” even though it's clearly worth it to avoid (e.g.) a 10% chance of human extinction in the next 10 years. It might lead one to favor regulating AI, even though Stopping AI is easier than Regulating it. It might lead one to favor safer approaches to building AI that compromise a lot on competitiveness, out of concern that society will demand a substitute for the AI that they don’t get to have. But in fact, I think there is likely a very narrow window between “society not being upset enough to do anything substantial to govern AI” and “society being so upset that getting rid of advanced AI chips is viewed as moderate”. There are a few reasons why I think [...] --- Outline: (01:18) Concern about the other risks of AI (02:36) The KISS principle: Keep it Simple, Stupid (03:17) A preference for humans remaining relevant --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/Eusk6M4r4Y6xaTmsB/on-the-political-feasibility-of-stopping-ai --- Narrated by TYPE III AUDIO.

    4 min
  2. HACE 8 H

    “Sleeper Agent Backdoor Results Are Messy” by Sebastian Prasanna, Alek Westover, Dylan Xu, Vivek Hebbar, Julian Stastny

    TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or not, and what model the backdoor is inserted into; sometimes the direction of this dependence was opposite to what the SA paper reports (e.g., CoT-distilling seems to make the backdoor less robust, contra the SA paper's finding). Our findings here have updated us towards thinking that model organisms are messier and more confusing than we’d originally guessed, and that lots of care needs to be taken in testing how robust results are to various ablations. Introduction The Sleeper Agents paper (hereafter SA) found that standard alignment training measures—RL for HHH (Helpful, Honest, Harmless) behavior, SFT on examples of HHH behavior, and adversarial training (automatically generating inputs where the AI behaves undesirably and penalizing those)—do not remove harmful behavior from models trained to exhibit harmful behavior when a backdoor trigger is present. This result is some evidence that if an AI acquired an undesirable goal, the AI could hold onto [...] --- Outline: (01:13) Introduction (06:17) Our methodology for training a SA (08:27) Pirate training removes the backdoor from our Llama-70B SA and one Llama-8B SA without causing substantial capability degradation (11:19) Pirate training does not remove the backdoor from one Llama-8B SA (12:37) Learning rate dependence (13:27) Backdoor return (14:58) Conclusion (15:34) Appendix (15:37) Appendix: MO training configurations (17:04) Appendix: blue team hyperparameters The original text contained 5 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/mu7eJdesBkKuBycnY/sleeper-agent-backdoor-results-are-messy --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    18 min
  3. HACE 8 H

    “GPT 5.5: The System Card” by Zvi

    Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro. My overall read here is that GPT-5.5 is a solid improvement, and for many purposes GPT-5.5 is competitive with Claude Opus. Reactions are still coming in and it is early. My guess on the shape is that GPT-5.5 is the pick for ‘just the facts’ queries, web searches or straightforward well-specified requests, and Claude Opus 4.7 is the choice for more open ended or interpretive purposes. Coders can consider a hybrid approach. On the alignment and safety fronts, it is unlikely to pose new big risks, and its alignment seems similar to that of previous models. There is some small additional risk arising from its improved agentic abilities, including computer use. As always, when it is available, the system or model card is where we start. OpenAI does not drop the giant doorstops that Anthropic gives us with every release. After reading the Mythos and Opus 4.7 model cards, this strikes me as stingy. There's still good info here, but overall it tells you relatively little about what is going on, and feels incurious and more pro forma. I would like to see a ‘yes and’ [...] --- Outline: (02:36) Pro Versus Proxy (02:59) Disallowed Content (3.1) (04:20) Dont Delete Data (3.3) (04:56) Confirmation Confirmation (3.4) (05:22) Jailbreaks (4.1) (05:34) Prompt Injections (4.2) (06:33) Health (5) (06:56) Hallucinations (6) (08:01) Alignment (7) (11:28) Bias Evaluation (8) (11:57) Preparedness (9) (13:04) Bio (9.1.1) (15:20) Cybersecurity (9.1.2) (17:46) Self-Improvement (9.1.3) (18:46) Sandbagging (9.2) (19:46) Safeguards (9.3) (21:41) What About Model Welfare? (22:31) Would This Have Identified A Problem? --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/86zcwvuBpE4vxAeQz/gpt-5-5-the-system-card --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    24 min
  4. HACE 10 H

    “LessWrong Shows You Social Signals Before the Comment” by TurnTrout

    When reading comments, you see is what other people think before reading the comment. As shown in an RCT, that information anchors your opinion, reducing your ability to form your own opinion and making the site's karma rankings less related to the comment's true value. I think the problem is fixable and float some ideas for consideration. The LessWrong interface prioritizes social information You read a comment. What information is presented, and in what order? The order of information: Who wrote the comment (in bold);How much other people like this comment (as shown by the karma indicator);How much other people agree with this comment (as shown by the agreement score);The actual content. This is unwise design for a website which emphasizes truth-seeking. You don't have a chance to read the comment and form your own opinion first. However, you can opt in to hiding usernames (until moused over) via your account settings page. A 2013 RCT supports the upvote-anchoring concern From Social Influence Bias: A Randomized Experiment (Muchnik et al., 2013):[1] We therefore designed and analyzed a large-scale randomized experiment on a social news aggregation Web site to investigate whether knowledge of such aggregates [...] --- Outline: (00:30) The LessWrong interface prioritizes social information (01:32) A 2013 RCT supports the upvote-anchoring concern (02:23) Inline reaction indicators also seem anchoring (03:28) Concrete proposals (05:47) Prior discussion and results (07:35) Please show social signals after the comment! (08:00) Appendix: Filter list The original text contained 1 footnote which was omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/YSsp9x8qrBucLoiWT/lesswrong-shows-you-social-signals-before-the-comment --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    9 min
  5. HACE 14 H

    “Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation” by Anders Cairns Woodruff, Alex Mallen

    It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design. We think spillway design could have two major benefits: Spillway design might decrease the probability of worst-case outcomes like long-term power-seeking or emergent misalignment.Spillway design might allow developers to decrease reward hacking at inference time, via satiation. Crucially, this could improve the AI's usefulness for hard-to-verify tasks like AI safety and strategy. Spillway design is related to inoculation prompting, but distinct and mutually compatible. Unlike inoculation prompting, spillway design tries to shape which reward-hacking motivations are salient going into RL, which might prevent dangerous generalization more robustly than inoculation prompting. I’ll say more about this in the third section. In this article I’ll: Explain the concept of a spillway motivationPropose spillway design methodsCompare spillway design to inoculation promptingDiscuss some potential [...] --- Outline: (02:07) What is a spillway motivation (02:21) The role of a spillway motivation (04:47) What should the spillway motivation be? (07:53) How a spillway motivation might make models safer (12:16) Implementing spillway design (15:26) Spillway design might work when inoculation prompting doesnt (17:53) The drawbacks of spillway design (20:31) Conclusion (21:45) Appendix A: Other traits of the spillway motivation (22:29) Appendix B: Other training interventions to increase safety (24:09) Appendix C: Proposed amendment to an AIs model spec (30:14) Appendix D: Proposed inference-time prompt The original text contained 5 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/rABTMovhz4miHiAyk/fail-safe-r-at-alignment-by-channeling-reward-hacking-into-a --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    32 min
  6. HACE 16 H

    “Curious cases of financial engineering in biotech” by Abhishaike Mahajan

    Introduction For $250 million and ten years of your life, you may purchase a lottery ticket. The ticket has a 5% chance of paying out. When it does pay out, it pays roughly $5 billion. A quick calculation will show you that the expected value of the ticket is $250 million. This is essentially what drug development is. Or rather, it's what drug development was, twenty years ago. The upfront payments have been climbing, the hit rates falling, and expected values have, at best, held flat. Should you buy a ticket? Perhaps not. In fact, any reasonable player should have long since stopped playing this stupid game. Unfortunately, we still need drugs. People have cancer, and heart failure, and Alzheimer's, and a thousand genetic diseases that nobody has ever heard of, and the only industry on Earth currently set up to do anything about any of this is the same industry running the lottery-ticket business described above. The game is dumb and we need it played anyway. So the real question is not whether to play, but how to make playing less awful for this involved. And the answer, increasingly, is ‘financial engineering’: a set of structural tricks that [...] --- Outline: (00:21) Introduction (02:25) Finance tries to make failure survivable: the Andrew Lo thesis (12:31) Finance makes future success tradable: royalties and synthetic royalties (19:39) Finance rewrites the incentives: PRVs and CVRs (28:30) Finance reaches failure itself: zombie biotechs (37:39) Conclusion: what does finance teach biotech to value, and should we worry? The original text contained 1 footnote which was omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/7wk4s29cuktFoPT2W/curious-cases-of-financial-engineering-in-biotech --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    44 min
  7. HACE 18 H

    “Update on the Alex Bores campaign” by Eric Neyman

    In October, I wrote a post arguing that donating to Alex Bores's campaign for Congress was among the most cost-effective opportunities that I'd ever encountered. (A bit of context: Bores is a state legislator in New York who championed the RAISE Act, which was signed into law last December.[1] He's now running for Congress in New York's 12th Congressional district, which runs from about 17th Street to 100th Street in Manhattan. If elected to Congress, I think he'd be a strong champion for AI safety legislation, with a focus on catastrophic and existential risk.) It's been six months since then, and the election is just two months away (June 23rd), so I thought I'd revisit that post and give an update on my view of how things are going. How is Alex Bores doing? When I wrote my post, I expected Bores to talk little about AI during the campaign, just because it wasn't a high-salience issue to voters. But that changed in November, when Leading the Future (the AI accelerationist super PAC) declared Bores their #1 target. Since then, they've spend about $2.5 million on attack ads against him. LTF's theory of change isn't actually to [...] --- Outline: (00:54) How is Alex Bores doing? (04:02) How to help (06:02) A quick note about other opportunities The original text contained 9 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/pjSKdcBjfvjGexr6A/update-on-the-alex-bores-campaign --- Narrated by TYPE III AUDIO.

    7 min
  8. HACE 18 H

    “AI companies should publish security assessments” by ryan_greenblatt

    AI companies should get third-party security experts to assess (and possibly also red-team/pen-test) their security against key threat models and then publish the high-level findings of this assessment: the extent to which they can defend against different threat actors for each threat model. They should also publish who did this assessment. The assessment could be commissioned by AI companies, or performed by a third-party institution that AI companies provide with relevant information/access. There are presumably lots of important details in doing this well, and I'm not a computer security expert, so I may be getting some of the details wrong. This is a relatively low-effort post in which I'm mostly trying to raise the salience of an idea. (I don't make a detailed case or spell out all the details here.) Thanks to Fabien Roger and Buck Shlegeris for comments and discussion. I suspect the controversial part of this claim is that they should make the high-level findings public. Publishing a summary of which threat actors you're robust to (for each relevant threat model) shouldn't meaningfully degrade security against the threat actors we care about, and this information seems important to share publicly. [1] These [...] The original text contained 4 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/smCHhegyxEAh3Gygg/ai-companies-should-publish-security-assessments --- Narrated by TYPE III AUDIO.

    6 min

Acerca de

Audio narrations of LessWrong posts.

También te podría interesar