LessWrong (30+ Karma)

LessWrong

Audio narrations of LessWrong posts.

  1. 3 HRS AGO

    “Does Hebrew Have Verbs?” by Benquo

    Spinoza's Compendium of Hebrew Grammar (1677, posthumous, unfinished) contains a claim that scholars have been misreading for centuries. He says that all Hebrew words, except a few particles, are nouns. The standard scholarly reaction is that this is either a metaphysical imposition (projecting his monistic ontology onto grammar) or a terminological trick (defining "noun" so broadly it's vacuous). Both reactions are wrong, and they're wrong for the same reason: they import Greek and Latin grammatical categories and then treat those categories as the neutral baseline. What Spinoza actually said From Chapter 5 of the Compendium (Bloom translation, 1962): "By a noun I understand a word by which we signify or indicate something that is understood. However, among things that are understood there can be either things and attributes of things, modes and relationships, or actions, and modes and relationships of actions." And: "For all Hebrew words, except for a few interjections and conjunctions and one or two particles, have the force and properties of nouns. Because the grammarians did not understand this they considered many words to be irregular which according to the usage of language are most regular." The word "noun" here [...] --- Outline: (00:45) What Spinoza actually said (01:57) The objection that doesnt work (02:30) How Hebrew actually works (04:15) Where the noun/verb distinction comes from (06:15) The structural difference between Greek and Hebrew (08:00) Spinoza wasnt projecting his metaphysics The original text contained 6 footnotes which were omitted from this narration. --- First published: March 19th, 2026 Source: https://www.lesswrong.com/posts/e8XnJFw9noBXoXFWr/does-hebrew-have-verbs --- Narrated by TYPE III AUDIO.

    10 min
  2. 5 HRS AGO

    “Untrusted Monitoring is Default; Trusted Monitoring is not” by J Bostock

    These views are my own and not necessarily representative of those of any colleagues with whom I have worked on AI control. TL;DR: It's much cheaper and quicker to just throw some honeypots at your monitor models than to robustly prove trustedness for every model you want to use. Therefore I think the most likely future involves untrusted monitoring with some monitor validation as a default path. This post talks about two different ways of monitoring AI systems, trusted monitoring, and untrusted monitoring. You can find papers distinguishing between them here, here, and here, a video here, or you can look at this graphic which didn't make it into Gardner-Challis et. al. (2026): Context From when I started working at untrusted monitoring until very recently[1], I assumed that trusted monitoring was the default policy that AI companies would use, and that untrusted monitoring was an exotic, high-effort policy which would be much less likely to be used. Now I think it's the other way round. Full Trustedness is Hard The distinction between an untrusted model and a trusted one exists in the map. A trusted model is normally described as one for which we have strong evidence that [...] --- Outline: (01:04) Context (01:25) Full Trustedness is Hard (04:05) Untrusted Monitoring is Just Limited-Scope Trusted Monitoring (05:22) Cost and Speed (06:14) Monitoring is Here! The original text contained 1 footnote which was omitted from this narration. --- First published: March 20th, 2026 Source: https://www.lesswrong.com/posts/b5oHr5TrfQzCBZdXW/untrusted-monitoring-is-default-trusted-monitoring-is-not --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    8 min
  3. 14 HRS AGO

    “Contrastive features elicit different perturbation responses than SAE features” by Francisco Ferreira da Silva, StefanHex

    Note: This is a research update sharing preliminary results as part of ongoing work. Figure 1: Contrastive (difference-of-means, English→Mandarin) feature directions elicit a downstream response at much smaller perturbation magnitudes than SAE directions, which behave similarly to random directions. This holds across multiple models and experimental setups. Summary & Main Results Understanding how concepts are represented in LLM internals would be extremely useful for AI safety (generally understanding AI's thinking, AI control and monitoring, retargeting the search). We assume that concepts are represented as directions in activation space (“features”). The most popular feature-finding approach, sparse dictionary learning via sparse autoencoders, suffers from strong dataset dependence (e.g. Heap et al. 2025, Kissane et al. 2024, Heimersheim 2024). It is thus unclear whether the directions they find reflect the model's computations or properties of the training distribution. Instead, we propose identifying feature directions by perturbing activations along a direction and using the model's response to judge whether the direction corresponds to a feature (following Mack & Turner 2024, Heimersheim & Mendel, 2024). We hypothesize that perturbations into random directions are suppressed, as they only produce interference noise along feature directions, while perturbations into feature directions create a focused perturbation and elicit [...] --- Outline: (00:37) Summary & Main Results (05:16) Introduction (07:58) Setup (10:51) Extra Results & Robustness Checks (13:03) Code and Data (13:10) Future Work (14:23) Acknowledgements --- First published: March 20th, 2026 Source: https://www.lesswrong.com/posts/QrPxNDprJwXiE73Py/contrastive-features-elicit-different-perturbation-responses --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    15 min
  4. 23 HRS AGO

    “Confusion around the term reward hacking” by ariana_azarbal

    Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this. Distinct phenomena qualify as reward hacking The term[1] commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.[2] Misspecified-reward exploitation: over the course of some RL optimization process using a reward function R, the model achieves high reward according to R via some strategy that developers would have preferred R not assign high reward to. To the extent that a “true” reward R’ can be well-defined, this means the model achieves high reward according to R but not R’ (formalized here). Some examples of misspecified-reward exploitation include: A Lego-stacking agent learning to flip over a red lego instead of placing it on top of the blue one, as developers intended (link) A boat learns to spin in circles, collecting power ups, rather than actually racing, as developers intended (link) A language [...] --- Outline: (00:38) Distinct phenomena qualify as reward hacking (03:46) These phenomena can coincide but can come apart (04:40) Misspecified-reward exploitation without task gaming (05:26) Task gaming without misspecified-reward exploitation (06:54) These phenomena have distinct threat models (06:58) Just task gaming (07:45) Task gaming from misspecified-reward exploitation (08:27) Just misspecified-reward exploitation (08:49) Recommendation (09:21) When the lines could blur (10:35) Acknowledgments The original text contained 5 footnotes which were omitted from this narration. --- First published: March 20th, 2026 Source: https://www.lesswrong.com/posts/ixyokbwQEHgiHJYFW/confusion-around-the-term-reward-hacking --- Narrated by TYPE III AUDIO.

    11 min
  5. 1D AGO

    “A List of Research Directions in Character Training” by Rauno Arike

    Thanks to Rohan Subramani, Ariana Azarbal, and Shubhorup Biswas for proposing some of the ideas and helping develop them during a sprint. Thanks to Rohan and Ariana for comments on a draft. Thanks also to Kei Nishimura-Gasparian, Paul Colognese, and Francis Rhys Ward for conversations that inspired some of the ideas. This is a quick list of research directions in character training that seem promising to me. Though none of the ideas have been stress-tested and some may not survive closer scrutiny, they all seem worthy of exploration. We might soon work on some of these directions at Aether and would be glad to get feedback on them and hear from other people working in the area. We don’t claim novelty for the ideas presented here. Character training is a promising approach to improving LLM out-of-distribution generalization. Much of alignment post-training can be viewed as an attempt to elicit a stable persona from the base model that generalizes well to unseen scenarios. By instilling positive character traits in the LLM and directly teaching it what it means to be aligned, character training aims to create a virtuous reasoner: a model with a strong drive to benefit humanity and a [...] --- Outline: (01:32) Improving the character training pipeline (06:15) Improving benchmarks for character training methods (11:15) Other empirical directions (13:57) Some conceptual questions The original text contained 1 footnote which was omitted from this narration. --- First published: March 19th, 2026 Source: https://www.lesswrong.com/posts/6EwuCH3vZ7qvPt82k/a-list-of-research-directions-in-character-training --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

    18 min
  6. 1D AGO

    “Intention vs. Trying: Separate Prediction from Goal-Seeking” by plex

    tl;dr: Mixing goal-directedness into cognitive processes that are working to truth-seek about possible futures tends to undermine both truth-seeking and effective pursuit of your goals. Cleanly separating them has nice properties. [Epistemic Status: Mostly empirically observed rather than rigorously justified, but have seen it many times across different people and within myself so fairly confident there's something here, and have some sketch of the process] Minds are composed of circuits/programs/parts which model goal-states. Or, to put another way, have sub-patterns with preferences over how the world should be. We imagine possible futures that could come out of different actions using our world models, and use this prediction of outcome as an input to action selection[1] (including internal actions like choice of thoughts). I'd like to draw attention to one axis of this process, which I'll bind to the following words[2]: Intention is doing something close to unbiased simulation[3], then applying agentic choice only at the current timestep, trusting future yous at future timesteps to run their own agency with their greater information. It accepts everything at the truth-seeking stage of predicting the future, and only decides things that are immediately actionable. Trying is when you're applying pressure into [...] The original text contained 9 footnotes which were omitted from this narration. --- First published: March 19th, 2026 Source: https://www.lesswrong.com/posts/x8jGp6cr9nC99adR5/intention-vs-trying-separate-prediction-from-goal-seeking --- Narrated by TYPE III AUDIO.

    3 min

About

Audio narrations of LessWrong posts.

You Might Also Like