LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 29분 전

    “A relatively brief explanation of Boltzmann Brains” by Eliezer Yudkowsky

    (Initially written for the LW Wiki, but then I realized it was looking more like a post instead.) In 1895, the physicist Ignaz Robert Schütz, who worked as an assistant to the more eminent physicist Ludwig Boltzmann, wondered if our observed universe had simply assembled by a random fluctuation of order from a universe otherwise in thermal equilibrium. The idea was published by Boltzmann in 1896, properly credited to Schütz, and has been associated with Boltzmann ever since. The obvious objection to this scenario is credited to Arthur Eddington in 1931: If all order is due to random fluctuations, comparatively small moments of order will exponentially-vastly outnumber even slightly larger fluctuations toward order, to say nothing of fluctuations the size of our entire observed universe! If this is where order comes from, we should find ourselves inside much smaller ordered systems. Feynman similarly later observed: Even if we fill a box of gas with white and black atoms bouncing randomly, and after an exponentially vast amount of time the white and black atoms on one side randomly sort themselves into two neat sides separated by color, the other half of the box will still be in expectation randomized. If [...] --- First published: May 16th, 2026 Source: https://www.lesswrong.com/posts/v8MSczS3CuoqMmTFw/a-relatively-brief-explanation-of-boltzmann-brains --- Narrated by TYPE III AUDIO.

    5분
  2. 8시간 전

    “An Introduction to Exemplar Partitioning for Mechanistic Interpretability” by Jessica Rumbelow

    Most of what we currently call "feature discovery" in language models is wrapped up in dictionary-learning methods like sparse autoencoders (SAEs) – which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments into a single training objective: a reconstruction loss and a sparsity loss over a fixed size dictionary. Those commitments make sense if your goal is reconstructive decomposition – if you want to take an activation and rebuild it from a sparse code. They make less obvious sense if your aim is to find interpretable structure (directions? features?) in activation space, to retrieve representative examples, identify causal interventions, or measure how representations change across layers and inputs. And it turns out a lot of that doesn't really need the full SAE machinery. An Exemplar Partitioning dictionary built from Gemma-2-2B L12 activations at p2 (K = 5,129). Left: eight sample regions, each shown with its member count, its exemplar's logit-lens [nostalgebraist, 2020] decode, and an excerpt of a member input with the activating tokens highlighted. Right: a PCA-projected 3D rendering of the Voronoi partition; each cell is one region, with a random selection also labelled with logit-lens decode. This [...] --- Outline: (02:08) Glossary (03:45) Exemplar Partitioning (06:09) Inference (06:51) Properties of the EP dictionary (07:29) Concept detection (AxBench) (08:16) How EP and SAEs relate (09:40) Find and steer refusal (11:37) A free OOD signal (12:45) Cross-checkpoint drift (base ↔ IT) (14:58) Domain saturation (15:59) Inside the partition (17:33) Future work (21:24) Thats all for now --- First published: May 16th, 2026 Source: https://www.lesswrong.com/posts/RroeHBSkBXXDsrryq/an-introduction-to-exemplar-partitioning-for-mechanistic-1 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    22분
  3. 12시간 전

    “A Year Late, Claude Finally Beats Pokémon” by Julian Bradshaw

    Credit: ClaudePlaysPokemon Elevator Shanty by Kurukkoo Disclaimer: like some previous posts in this series, this was not primarily written by me, but by a friend. I did substantial editing, however. ClaudePlaysPokemon feat. Opus 4.7 has finally beaten Pokémon Red, fulfilling the challenge set over a year ago when LLMs playing Pokémon went briefly, slightly viral. Victory Screen! Let's get the throat-clearing out of the way: this doesn't make 4.7 a clear breakthrough in intelligence over 4.6 or 4.5. It's smarter, yes, as we'll discuss below, but not by something one could honestly call a big leap. Rather, step changes have finally accumulated to the point of victory. And to give other models their fair shake: after criticism over its elaborate harness,[1] GeminiPlaysPokemon has beaten Pokémon with progressively weaker harnesses, including about two months ago with a harness comparable to the one Claude uses.[2] As such, this is a bit of a valedictory post, closing off the cycle of Claude playing Pokémon Red, relating anecdotes for the fun of it, and discussing improvements in Opus 4.7, as well as speculating a bit on what this has all meant. Retrospective Anecdotes on Claude 4.5 and 4.6 Our last post, on Opus [...] --- Outline: (01:37) Retrospective Anecdotes on Claude 4.5 and 4.6 (06:34) Harness Changes for 4.7 (07:52) Improvements (4.6 + 4.7) (08:16) Vision (09:31) Less Tunnel Vision (10:12) Another Level of Spatial Awareness (10:25) Breaks Out of Loops EVEN FASTER (10:55) Victory Road (11:50) The Little Things (12:52) Concluding Thoughts (15:29) Notes on Pokémon as a benchmark The original text contained 10 footnotes which were omitted from this narration. --- First published: May 16th, 2026 Source: https://www.lesswrong.com/posts/sehJYg5Yny9fvpbpt/a-year-late-claude-finally-beats-pokemon --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    19분
  4. 14시간 전

    “Monthly Roundup #42: May 2026” by Zvi

    At least we probably won’t have another pandemic. And we still have a partial Jones Act waiver. For now. Small victories. Table of Contents Hanta Hanta I Don’t Wanta. Bad News. Predictions Can Be Easy Even About The Future. Good Advice. The Efficient Market Hypothesis Is False. There Are Four Skills. While I Cannot Condone This. Good News, Everyone. For Your Entertainment. Gamers Gonna Game Game Game Game Game. The Spire Sleeps And So Shall I. I Was Promised Flying Self-Driving Cars. Government Working. Jones Act Watch. Technology Advances. I Said Woo Hoo. Variously Effective Altruism. The Lighter Side. Hanta Hanta I Don’t Wanta We have learned so much less than nothing from Covid. We’re actively stupider. It's 2026, and here we are again, lying about the virus because we are worried that people exposed to it or from the wrong place might face stigma otherwise. Envidreamz: A local news story highlights passengers on the hantavirus stricken MV Hondius cruise ship who are more worried about facing stigma and rejection back home than [...] --- Outline: (00:22) Hanta Hanta I Dont Wanta (04:25) Bad News (08:45) Predictions Can Be Easy Even About The Future (08:55) Good Advice (11:52) The Efficient Market Hypothesis Is False (13:58) There Are Four Skills (16:27) While I Cannot Condone This (20:57) Good News, Everyone (21:42) For Your Entertainment (22:14) Gamers Gonna Game Game Game Game Game (24:09) The Spire Sleeps And So Shall I (29:13) I Was Promised Flying Self-Driving Cars (31:53) Government Working (35:40) Jones Act Watch (41:10) Technology Advances (42:12) I Said Woo Hoo (43:36) Variously Effective Altruism (44:07) The Lighter Side --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/gZHNLmHkQ7GjnWsYh/monthly-roundup-42-may-2026 --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    46분
  5. 1일 전

    “Incriminating misaligned AI models via distillation” by Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny

    Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: Misalignment fails to transfer to the student. If so, we get a fairly capable benign model.Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher's misalignment by auditing the distilled model. In this post, we will discuss the second possibility, which we call incrimination via distillation. Specifically, we propose distillation methods that we hope transfer misalignment without transferring the ability to fool audits, and discuss why these techniques might work or fail. In a future post, we discuss the first possibility, and what distillation methods should be used when aiming to create a capable benign model. We’re excited for research that empirically tests and refines this approach; if successful, this technique could become a valuable part of alignment audits. How incrimination via distillation works Powerful misaligned AI models might not be auditable: they might pass alignment audits but still act on their misaligned drives when given the chance. [...] --- Outline: (01:31) How incrimination via distillation works (02:45) How we propose implementing incrimination via distillation (03:48) Auditability-preserving distillation (04:58) Misalignment-targeted distillation (06:20) Why incrimination via distillation might work (08:02) Why incrimination via distillation might not work (10:38) Conclusion The original text contained 3 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/BYH6ebmfZb3Eggzer/incriminating-misaligned-ai-models-via-distillation --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    12분
  6. 1일 전

    “The hard core of alignment (is robustifying RL)” by Cole Wyeth

    Most technical AI safety work that I read seems to miss the mark, failing to make any progress on the hard part of the problem. I think this is a common sentiment, but there's less agreement about what exactly the hard part is? Characterizing this more clearly might save a lot of time and better target the search for solutions. In this post I explain my model of why alignment is technically hard to achieve, setting aside the regulatory, competitive, and geopolitical challenges, the sheer incompetence and unforced errors of the players, and the other factors which decrease our chances of success. I claim there is something like a "hard core": a common stumbling block for all approaches. In other words, there is something that makes alignment hard, rather than a bunch of unrelated things that make each approach to alignment hard independently. Since many different "hopes" for alignment seem quite far apart, this would seem to be a remarkable and unexpected state of affairs. On the other hand, such an expansive graveyard of failed proposals suggests a common culprit. Semantics. When used informally in conversation, "hard core" is a bit like "complete problem." But there may be [...] --- Outline: (01:32) Background (03:02) The Problem (09:19) The Barrier (16:49) Providing better feedback (22:07) Behaviorist v.s. process feedback --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/JT3qCYDimskcBdiEr/the-hard-core-of-alignment-is-robustifying-rl --- Narrated by TYPE III AUDIO.

    26분
  7. 1일 전

    “Announcing the Center for Shared AI Prosperity” by Dylan Matthews

    I wanted to share the launch of a project I've been working on with pollster David Shor, Obama/Biden veteran Stef Feldman, political strategist Morris Katz, Harvard historian Marc Aidinoff, and a few other folks*. The Center for Shared AI Prosperity is an attempt to force DC policy elites, particularly (given our team's backgrounds) liberals/progressives, to take the impending economic impacts of advanced AI more seriously. We do not think this is a normal economic shock. We are deeply uncertain about what kind of economic shock it will be, but even if humans manage to survive the advent of superintelligence, we'll be left with a world of extreme power and wealth concentration, increasing political instability arising from that growing inequality, and deep questions about how to fund governments that have for a century-plus relied on income and payroll taxes. Our main purpose as an organization is to surface tractable ideas across four main areas: Taxation and Revenue Policy: New or reformed revenue raisers that allow the U.S. government (federal, state, or local) to capture a fair share of corporate wealth generated by AI without causing undue distortionsIncome Support and Social Safety Nets: New or reformed programs to redistribute [...] --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/CtRNmAQXmSotpAi9j/announcing-the-center-for-shared-ai-prosperity --- Narrated by TYPE III AUDIO.

    4분
  8. 1일 전

    “Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

    Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during deployment. I think this is the most plausible route to consistent adversarial misalignment in the near future. So, AI companies and evaluators should substantively incorporate it into risk analysis and planning. In this post, I’ll briefly argue why, absent improved mitigations, this will probably soon become a reason why AI companies will be unable to convincingly argue against consistent adversarial misalignment (this risk will perhaps be even larger than risk of consistent adversarial misalignment arising from training). Then I’ll discuss how well current risk reports address it (the Claude Mythos risk report does a reasonable job; others don’t). Thanks to Ryan Greenblatt, Alexa Pan, Charlie Griffin, Anders Cairns Woodruff, and Buck Shlegeris for feedback on drafts. Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment In some contexts, AIs might adopt misaligned goals, even if they were otherwise previously aligned. Because this misalignment can be rare, the AI might not appear to have concerning propensities in pre-deployment testing. The misalignment might only arise [...] --- Outline: (01:13) Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment (06:15) Company risk reports The original text contained 6 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/cNymohcWtGHzW7AjK/risk-reports-need-to-address-deployment-time-spread-of --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    11분

소개

Audio narrations of LessWrong posts.

좋아할 만한 다른 항목