LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. -5 Ч

    “Mechanistic estimation for wide random MLPs” by Jacob_Hilton

    This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post. In ARC's latest paper, we study the following problem: given a randomly initialized multilayer perceptron (MLP), produce an estimate for the expected output of the model under Gaussian input. The usual approach to this problem is to sample many possible inputs, run them all through the model, and take the average. Instead, we produce an estimate "mechanistically", without running the model even once. For wide models, our approach produces more accurate estimates, both in theory and in practice. Paper: Estimating the expected output of wide random MLPs more efficiently than sampling Code: mlp_cumulant_propagation GitHub repo We are excited about this result as an early step towards our goal of producing mechanistic estimates that outperform random sampling for any trained neural network. Drawing an analogy between this goal and a proof by induction, we see this result as (part of) the "base case": handling networks at initialization. We have a vision for the "inductive step", although we expect that to be much more difficult. Summary of results [...] --- Outline: (01:29) Summary of results (04:39) Significance of results (07:18) Extending to trained networks (08:36) Conclusion The original text contained 18 footnotes which were omitted from this narration. --- First published: May 7th, 2026 Source: https://www.lesswrong.com/posts/fsG4m6sRMpomd7Rk6/mechanistic-estimation-for-wide-random-mlps --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    9 мин.
  2. -7 Ч

    “Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations” by Subhash Kantamneni, kitft, Euan Ong, Sam Marks

    Abstract We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an activation. We jointly train the AV and AR with reinforcement learning to reconstruct residual stream activations. Although we optimize for activation reconstruction, the resulting NLA explanations read as plausible interpretations of model internals that, according to our quantitative evaluations, grow more informative over training. We apply NLAs to model auditing. During our pre-deployment audit of Claude Opus 4.6, NLAs helped diagnose safety-relevant behaviors and surfaced unverbalized evaluation awareness—cases where Claude believed, but did not say, that it was being evaluated. We present these audit findings as case studies and corroborate them using independent methods. On an automated auditing benchmark requiring end-to-end investigation of an intentionally-misaligned model, NLA-equipped agents outperform baselines and can succeed even without access to the misaligned model's training data. NLAs offer a convenient interface for interpretability, with expressive natural language explanations that we can directly read. To support further work, we release training code and trained NLAs [...] --- Outline: (00:15) Abstract (01:53) Twitter thread (05:14) Blog post (07:40) What is a natural language autoencoder? (10:06) Understanding what Claude thinks but doesnt say (13:12) Discovering hidden motivations (15:51) The future of NLAs --- First published: May 7th, 2026 Source: https://www.lesswrong.com/posts/oeYesesaxjzMAktCM/natural-language-autoencoders-produce-unsupervised --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    18 мин.
  3. -8 Ч

    “A review of “Investigating the consequences of accidentally grading CoT during RL”” by Buck

    Last week, OpenAI staff shared an early draft of Investigating the consequences of accidentally grading CoT during RL with Redwood Research staff. To start with, I appreciate them publishing this post. I think it is valuable for AI companies to be transparent about problems like these when they arise. I particularly appreciate them sharing the post with us early, discussing the issues in detail, and modifying it to address our most important criticisms. I think it will be increasingly important for AI companies to have a policy of getting external feedback on the risks posed by their deployments, and in particular having some external accountability on whether they have adequate evidence to support their claims about the level of risk posed; as an example of this, see METR reviewing Anthropic's Sabotage Risk Report. We at Redwood Research are interested in participating in this kind of external review of evidence about safety. So I am taking this as an opportunity to try out writing this kind of review. If you work at a frontier AI company, please feel free to reach out if you’d like our review of similar documents. My overall assessment is that I mostly agree with the [...] --- Outline: (01:34) Assessing the evidence that CoT training did not damage monitorability (10:36) How much does this analysis rely on information that wasnt provided? (12:08) Small amounts of RL training on CoT might not be more important than other sources of CoT unreliability (13:20) AI companies will eventually need to learn not to make mistakes like this The original text contained 5 footnotes which were omitted from this narration. --- First published: May 7th, 2026 Source: https://www.lesswrong.com/posts/juCHTdZpZBGooHKW4/a-review-of-investigating-the-consequences-of-accidentally --- Narrated by TYPE III AUDIO.

    15 мин.
  4. -15 Ч

    “There is no evidence you should reapply sunscreen every 2 hours.” by Hide

    It's incredible how many consensus guidelines dissolve when you look closely at them. If you listen to any authority on the subject of sunscreen, you will hear it endlessly repeated that you absolutely must reapply sunscreen every 2 hours while you are in the sun, and immediately after swimming, sweating, or exercising. Not only that, you’ll hear that you need to apply sunscreen before going outside, even if you put it on earlier and stayed indoors. The rationale behind this is straightforward and plausible: sunscreen's effectiveness degrades over time, therefore prolonged sun exposure warrants topping up on protection. However, when you look closely at the origins of this guideline, and the evidence base for its instantiation in regulations and official statements, it turns out that this 2-hour rule is a baseless, circularly justified, expedient fiction. Where does the FDA's 2 hour reapplication guideline come from? Tracing the history of the 2-hour reapplication guideline reveals an extremely shaky base of evidence. The first official sunscreen rulemaking in the US was in 1978, where they recommend: "apply sunscreen products liberally and to reapply after swimming or excess perspiration". No fixed universal time interval is [...] --- First published: May 6th, 2026 Source: https://www.lesswrong.com/posts/daTGKn3pXzs75nSB7/there-is-no-evidence-you-should-reapply-sunscreen-every-2 --- Narrated by TYPE III AUDIO.

    16 мин.

Об этом подкасте

Audio narrations of LessWrong posts.

Вам может также понравиться