LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. HÁ 4 H

    “Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation” by Anders Cairns Woodruff, Alex Mallen

    It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case. In particular, we tentatively propose that developers should try to make the most likely generalization of reward hacking a bespoke bundle of benign reward-seeking traits, called a spillway motivation. We call this process spillway design. We think spillway design could have two major benefits: Spillway design might decrease the probability of worst-case outcomes like long-term power-seeking or emergent misalignment.Spillway design might allow developers to decrease reward hacking at inference time, via satiation. Crucially, this could improve the AI's usefulness for hard-to-verify tasks like AI safety and strategy. Spillway design is related to inoculation prompting, but distinct and mutually compatible. Unlike inoculation prompting, spillway design tries to shape which reward-hacking motivations are salient going into RL, which might prevent dangerous generalization more robustly than inoculation prompting. I’ll say more about this in the third section. In this article I’ll: Explain the concept of a spillway motivationPropose spillway design methodsCompare spillway design to inoculation promptingDiscuss some potential [...] --- Outline: (02:07) What is a spillway motivation (02:21) The role of a spillway motivation (04:47) What should the spillway motivation be? (07:53) How a spillway motivation might make models safer (12:16) Implementing spillway design (15:26) Spillway design might work when inoculation prompting doesnt (17:53) The drawbacks of spillway design (20:31) Conclusion (21:45) Appendix A: Other traits of the spillway motivation (22:29) Appendix B: Other training interventions to increase safety (24:09) Appendix C: Proposed amendment to an AIs model spec (30:14) Appendix D: Proposed inference-time prompt The original text contained 5 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/rABTMovhz4miHiAyk/fail-safe-r-at-alignment-by-channeling-reward-hacking-into-a --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    32 min
  2. HÁ 5 H

    “Curious cases of financial engineering in biotech” by Abhishaike Mahajan

    Introduction For $250 million and ten years of your life, you may purchase a lottery ticket. The ticket has a 5% chance of paying out. When it does pay out, it pays roughly $5 billion. A quick calculation will show you that the expected value of the ticket is $250 million. This is essentially what drug development is. Or rather, it's what drug development was, twenty years ago. The upfront payments have been climbing, the hit rates falling, and expected values have, at best, held flat. Should you buy a ticket? Perhaps not. In fact, any reasonable player should have long since stopped playing this stupid game. Unfortunately, we still need drugs. People have cancer, and heart failure, and Alzheimer's, and a thousand genetic diseases that nobody has ever heard of, and the only industry on Earth currently set up to do anything about any of this is the same industry running the lottery-ticket business described above. The game is dumb and we need it played anyway. So the real question is not whether to play, but how to make playing less awful for this involved. And the answer, increasingly, is ‘financial engineering’: a set of structural tricks that [...] --- Outline: (00:21) Introduction (02:25) Finance tries to make failure survivable: the Andrew Lo thesis (12:31) Finance makes future success tradable: royalties and synthetic royalties (19:39) Finance rewrites the incentives: PRVs and CVRs (28:30) Finance reaches failure itself: zombie biotechs (37:39) Conclusion: what does finance teach biotech to value, and should we worry? The original text contained 1 footnote which was omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/7wk4s29cuktFoPT2W/curious-cases-of-financial-engineering-in-biotech --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    44 min
  3. HÁ 7 H

    “Update on the Alex Bores campaign” by Eric Neyman

    In October, I wrote a post arguing that donating to Alex Bores's campaign for Congress was among the most cost-effective opportunities that I'd ever encountered. (A bit of context: Bores is a state legislator in New York who championed the RAISE Act, which was signed into law last December.[1] He's now running for Congress in New York's 12th Congressional district, which runs from about 17th Street to 100th Street in Manhattan. If elected to Congress, I think he'd be a strong champion for AI safety legislation, with a focus on catastrophic and existential risk.) It's been six months since then, and the election is just two months away (June 23rd), so I thought I'd revisit that post and give an update on my view of how things are going. How is Alex Bores doing? When I wrote my post, I expected Bores to talk little about AI during the campaign, just because it wasn't a high-salience issue to voters. But that changed in November, when Leading the Future (the AI accelerationist super PAC) declared Bores their #1 target. Since then, they've spend about $2.5 million on attack ads against him. LTF's theory of change isn't actually to [...] --- Outline: (00:54) How is Alex Bores doing? (04:02) How to help (06:02) A quick note about other opportunities The original text contained 9 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/pjSKdcBjfvjGexr6A/update-on-the-alex-bores-campaign --- Narrated by TYPE III AUDIO.

    7 min
  4. HÁ 8 H

    “AI companies should publish security assessments” by ryan_greenblatt

    AI companies should get third-party security experts to assess (and possibly also red-team/pen-test) their security against key threat models and then publish the high-level findings of this assessment: the extent to which they can defend against different threat actors for each threat model. They should also publish who did this assessment. The assessment could be commissioned by AI companies, or performed by a third-party institution that AI companies provide with relevant information/access. There are presumably lots of important details in doing this well, and I'm not a computer security expert, so I may be getting some of the details wrong. This is a relatively low-effort post in which I'm mostly trying to raise the salience of an idea. (I don't make a detailed case or spell out all the details here.) Thanks to Fabien Roger and Buck Shlegeris for comments and discussion. I suspect the controversial part of this claim is that they should make the high-level findings public. Publishing a summary of which threat actors you're robust to (for each relevant threat model) shouldn't meaningfully degrade security against the threat actors we care about, and this information seems important to share publicly. [1] These [...] The original text contained 4 footnotes which were omitted from this narration. --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/smCHhegyxEAh3Gygg/ai-companies-should-publish-security-assessments --- Narrated by TYPE III AUDIO.

    6 min
  5. HÁ 8 H

    “In defense of parents” by Yair Halberstadt

    Contra Aella on chattel childhood Aella has a post where she argues that today's parents don't sufficiently respect the independence of their children: Every culture throughout history has justified the abuse of treating their children as property by arguing this is good for them and good for civilization. Kids need to learn this stuff to be functioning members of society! It's good to learn discipline! You can’t have kids just sitting around playing video games all day! Not everyone is self-directed autodidacts! Sure, I know that argument. But hopefully if my parents had said to you “do you expect her to learn good morals if we spare the rod?” you would have said “have you even tried other methods?” It's a hard hitting piece, and it certainly makes me, and presumably other parents, feel uncomfortable. Unfortunately I don't actually see much of an alternative. Aella seems to think it's as easy as not treating your children as property: When I was very young, I remember adults treating me like I wasn’t a person, but this didn’t upset me quite as much as the fact that no adult seemed to remember what it was like to be a kid [...] --- Outline: (02:19) Most parents arent perfect: (03:00) Children are a danger to themselves... (03:55) ... and others (05:16) Parents have a life too (06:17) Its for your own good you know! (07:24) But cant you still treat them like a person? --- First published: April 27th, 2026 Source: https://www.lesswrong.com/posts/Lbwri6HmXGtBDsF4f/in-defense-of-parents --- Narrated by TYPE III AUDIO.

    9 min
  6. HÁ 11 H

    “The other paper that killed deep learning theory” by LawrenceC

    Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper that arguably signaled its demise, Zhang et al.'s Understanding deep learning requires rethinking generalization. As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016 showed that the standard neural network architectures trained with standard training methods could memorize large quantities of random labelled data, which showed that no such argument could explain the generalization properties of neural networks. Today we’re going to look at the aftermath: how did the field of deep learning theory react to this paper? What were the attempts to get around this result using data-dependent generalization bounds? And why did Nagarajan and Kolter's humbly titled Uniform convergence may be unable to explain generalization in deep learning serve as the proverbial final nail in the coffin of this line of work? [...] The original text contained 4 footnotes which were omitted from this narration. --- First published: April 26th, 2026 Source: https://www.lesswrong.com/posts/zcGmdQHX66NhC69v6/the-other-paper-that-killed-deep-learning-theory --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    11 min
  7. HÁ 17 H

    “What holds AI safety together? Co-authorship networks from 200 papers” by Anna Thieser

    We (social science PhD students) computed co-authorship networks based on a corpus of 200 AI safety papers covering 2015-2025, and we’d like your help checking if the underlying dataset is right. Co-authorship networks make visible the relative prominence of entities involved in AI safety research, and trace relationships between them. Although frontier labs produce lots of research, they remain surprisingly insular — universities dominate centrality in our graphs. The network is held together by a small group of multiply affiliated researchers, often switching between academia and industry mid-career. To us, AI safety looks less like a unified field and more like a trading zone where institutions from different sectors exchange knowledge, financial resources, compute and legitimacy without encroaching on each other's autonomy. Of course, these visualizations are only as good as the corpus underlying them, because the shape of the network is sensitive to what's included. Here's what it currently looks like showing co-authorship at the individual level: There are 2 details boxes here, which are omitted from this narration. The boxes have the titles "Figure 1: Methods" and "Figure 1: Color legend". While academic and for-profit authors occupy distinct clusters, over 95% of nodes are part of the [...] --- First published: April 24th, 2026 Source: https://www.lesswrong.com/posts/c7D2q7k97QcwCxBrN/what-holds-ai-safety-together-co-authorship-networks-from --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    6 min
  8. HÁ 20 H

    ″“Bad faith” means intentionally misrepresenting your beliefs” by TFD

    The confusion I recently came upon a comment which I believe reflects a persistent confusion among rationalist/EA types. I was reading this post which contains ideas that the other has but doesn't have time to write posts about. One of those relates to the concept of "good faith", labelled "most arguments are not in good faith, of course": Look, I love good faith discourse (here meaning "conversation in which the primary goal of all participants is to help other participants and onlookers arrive at true beliefs"). The definition given for "good faith discourse" seems incorrect to me, and it's not a close call in my opinion. The level of incorrectness in my view is something like saying "I like people who obey the law (here meaning never committing a social faux pas)". This isn't the first time I have seen someone in this community advance a similar view on the meaning of good/bad faith. For example, this post. I thought it might be useful to bring this apparent disagreement to the foreground, so I will lay out my belief about what this concept means. I suspect this disagreement may also involve an underlying about what [...] --- Outline: (00:11) The confusion (01:33) My definition (01:48) General meaning (01:59) Good faith discourse (02:25) Evidence (02:40) Sources (07:27) Classic usage (07:36) Good faith errors or mistakes (09:32) Good faith negotiation (10:59) Conclusion --- First published: April 26th, 2026 Source: https://www.lesswrong.com/posts/usnJtFsJq5seN4aJJ/bad-faith-means-intentionally-misrepresenting-your-beliefs --- Narrated by TYPE III AUDIO.

    11 min

Sobre

Audio narrations of LessWrong posts.

Você também pode gostar de