LessWrong (30+ Karma)

LessWrong

٠٫٠ (٠)
التكنولوجيا
يتم التحديث يوميًا

Audio narrations of LessWrong posts.

قبل ساعة واحدة

“Stars are a rounding error” by Algon

Notes on some interesting factoids I learnt from Anders Sandberg's draft book, Grand Futures. "Starlight is heavier than worlds" - Anders Sandberg Looking at the energy density of stuff in the universe, we find a few surprising, and not so surprising, facts. First, the obvious: baryonic matter itself is a rounding error, contributing 4.5% of the energy of the universe. Nine tenths of those sweet, sweet baryonic numéraire are stuck in the warm plasma floating between galaxies. About half the remainder forms the stuff of stars. Planets don't even match a thousand of the contribution of stars to the energy density of the universe. Somewhat surprisingly, supermassive black holes have a contribution. Regardless, the fact remains that planets are a rounding error to a rounding error to a rounding error of the energy of dark matter and energy. Even starlight contains more energy. So in a [...] --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/mzifm6wePKfnFTAaB/stars-are-a-rounding-error --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

٦ من الدقائق
قبل ٤ ساعات

“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger

Audio note: this article contains 102 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. I tried training Qwen2.5-1.5B with RL on math to both get correct answers and have a CoT that doesn’t look like human-understandable math reasoning. RL sometimes succeeds at hacking my monitor, and when I strengthen my monitor, it fails at finding CoT that are both illegible and helpful, even after training for roughly 4000 steps (~1B generated tokens). Exploring into obfuscated reasoning is hard! These results were also released in the Appendix of Training fails to elicit subtle reasoning in current language models. Motivation Chain-of-Thoughts (CoTs) can help reason for many more serial steps than there are layers in a Transformer. But one worry is that LLMs might hide their real reasoning in a plausible benign CoT. Previous [...] --- Outline: (00:58) Motivation (01:45) Methods (03:33) Results (07:02) Limitations (07:52) Appendix (07:55) Hackable illegibility rater prompt (09:03) Stronger illegibility rater prompt (09:25) System prompt used for the policy (09:40) System prompt used for the no-CoT policy (09:48) Example generations --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/WSKNmRxPnYdQnoNvt/training-qwen-1-5b-with-a-cot-legibility-penalty --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

١٠ من الدقائق
قبل ٦ ساعات

“At odds with the unavoidable meta-message” by Ruby

It is a truism known to online moderators[1] that when two commenters are going back and forth in heated exchange, and one lays out rejoinders in paragraph after paragraph of dense text, then two things will have happened: Our careful communicator may or may not have succeeded at conveying well-reasoned insights hitherto unknown by his interlocutor that will change her mind. He will have communicated my hatred for you is at least this long. In all seriousness, words require effort and effort requires motivation. A lengthy message communicates whatever its contents are, but it inevitably communicates I was invested enough to write all this, typically with a connotation of strong emotion powering the keystrokes. The emotion isn't necessarily hatred. It could be a roiling anger, a neurotic anxiety, or a fervorous infatuation. My guess is that even if you weren't carrying around an explicit belief in [...] The original text contained 5 footnotes which were omitted from this narration. --- First published: October 10th, 2025 Source: https://www.lesswrong.com/posts/AgkuN8wLNevBkHusf/at-odds-with-the-unavoidable-meta-message --- Narrated by TYPE III AUDIO.

٧ من الدقائق
قبل ٩ ساعات

“Towards a Typology of Strange LLM Chains-of-Thought” by 1a3orn

Intro LLMs being trained with RLVR (Reinforcement Learning from Verifiable Rewards) start off with a 'chain-of-thought' (CoT) in whatever language the LLM was originally trained on. But after a long period of training, the CoT sometimes starts to look very weird; to resemble no human language; or even to grow completely unintelligible. Why might this happen? I've seen a lot of speculation about why. But a lot of this speculation narrows too quickly, to just one or two hypotheses. My intent is also to speculate, but more broadly. Specifically, I want to outline six nonexclusive possible causes for the weird tokens: new better language, spandrels, context refresh, deliberate obfuscation, natural drift, and conflicting shards. And I also wish to extremely roughly outline ideas for experiments and evidence that could help us distinguish these causes. I'm sure I'm not enumerating the full space of [...] --- Outline: (00:11) Intro (01:34) 1. New Better Language (04:06) 2. Spandrels (06:42) 3. Context Refresh (10:48) 4. Deliberate Obfuscation (12:36) 5. Natural Drift (13:42) 6. Conflicting Shards (15:24) Conclusion --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/qgvSMwRrdqoDMJJnD/towards-a-typology-of-strange-llm-chains-of-thought --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

١٨ من الدقائق
قبل ١٠ ساعات

“I take antidepressants. You’re welcome” by Elizabeth

It's amazing how much smarter everyone else gets when I take antidepressants. It makes sense that the drugs work on other people, because there's nothing in me to fix. I am a perfect and wise arbiter of not only my own behavior but everyone else's, which is a heavy burden because some of ya’ll are terrible at life. You date the wrong people. You take several seconds longer than necessary to order at the bagel place. And you continue to have terrible opinions even after I explain the right one to you. But only when I’m depressed. When I’m not, everyone gets better at merging from two lanes to one. This effect is not limited by the laws of causality or time. Before I restarted Wellbutrin, my partner showed me this song. My immediate reaction was, “This is fine, but what if [...] --- Outline: (04:39) Caveats (05:27) Acknowledgements --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/FnrhynrvDpqNNx9SC/i-take-antidepressants-you-re-welcome --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

٦ من الدقائق
قبل ١١ ساعة

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

TL;DR: I made a dataset of realistic harmless reward hacks and fine-tuned GPT-4.1 on it. The resulting models don't show emergent misalignment on the standard evals, but they do alignment fake (unlike models trained on toy reward hacks), seem more competently misaligned, are highly evaluation-aware, and the effects persist when mixing in normal data. Thanks to Aidan Ewart, Jack Kaunismaa, Abhay Sheshadri, Maxime Riché, Axel Ahlqvist, Niels Warncke, Daniel Tan, Carolyn Qian, and Kei Nishimura-Gasparian for helpful conversations, comments and/or feedback. This post is best viewed as an informal report on preliminary results done over a couple days, rather than a very polished analysis. Introduction Taylor et al finds that fine-tuning LLMs on harmless reward hacks causes generalization to unrelated misaligned behavior on the emergent misalignment (EM) evals. They constructed a fine-tuning dataset (School of Reward Hacks) of samples like this: There's a details box here with the title "Sample [...] --- Outline: (00:56) Introduction (03:17) Dataset (05:24) Emergent Misalignment Evals (07:34) Alignment Faking (16:29) Takeaways (18:28) How robust is this effect? The original text contained 11 footnotes which were omitted from this narration. --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/HLJoJYi52mxgomujc/realistic-reward-hacking-induces-different-and-deeper-1 --- Narrated by TYPE III AUDIO. --- Images from the article: 5 out of 10." style="max-width: 100%;" />Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

٢٢ من الدقائق
قبل ١٢ ساعة

“The Thinking Machines Tinker API is good news for AI control and security” by Buck

Last week, Thinking Machines announced Tinker. It's an API for running fine-tuning and inference on open-source LLMs that works in a unique way. I think it has some immediate practical implications for AI safety research: I suspect that it will make RL experiments substantially easier, and increase the number of safety papers that involve RL on big models. But it's also interesting to me for another reason: the design of this API makes it possible to do many types of ML research without direct access to the model you’re working with. APIs like this might allow AI companies to reduce how many of their researchers (either human or AI) have access to sensitive model weights, which is good for reducing the probability of weight exfiltration and other rogue deployments. (Thinking Machines gave us early access to the product; in exchange, we gave them bug reports and let them mention [...] --- Outline: (01:23) How the Tinker API is different (03:20) Why this is good for AI security and control (07:19) A lot of ML research can be done without direct access to dangerous model weights (10:19) Conclusions The original text contained 1 footnote which was omitted from this narration. --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/r68nCQK3veQtCdqGt/the-thinking-machines-tinker-api-is-good-news-for-ai-control --- Narrated by TYPE III AUDIO.

١٢ من الدقائق
قبل ١٦ ساعة

“Hospitalization: A Review” by Logan Riggs

I woke up Friday morning w/ a very sore left shoulder. I tried stretching it, but my left chest hurt too. Isn't pain on one side a sign of a heart attack? Chest pain, arm/shoulder pain, and my breathing is pretty shallow now that I think about it, but I don't think I'm having a heart attack because that'd be terribly inconvenient. But it'd also be very dumb if I died cause I didn't go to the ER. So I get my phone to call an Uber, when I suddenly feel very dizzy and nauseous. My wife is on a video call w/ a client, and I tell her: "Baby?" "Baby?" "Baby?" She's probably annoyed at me interrupting; I need to escalate "I think I'm having a heart attack" "I think my husband is having a heart attack"[1] I call 911[2] "911. This call is being recorded. What's your [...] --- Outline: (04:09) Im a tall, skinny male (04:41) Procedure (06:35) A Small Mistake (07:39) Take 2 (10:58) Lessons Learned (11:13) The Squeaky Wheel Gets the Oil (12:12) Make yourself comfortable. (12:42) Short Form Videos Are for Not Wanting to Exist (12:59) Point Out Anything Suspicious (13:23) Ask and Follow Up by Setting Timers. (13:49) Write Questions Down (14:14) Look Up Terminology (14:26) Putting On a Brave Face (14:47) The Hospital Staff (15:50) Gratitude The original text contained 12 footnotes which were omitted from this narration. --- First published: October 9th, 2025 Source: https://www.lesswrong.com/posts/5kSbx2vPTRhjiNHfe/hospitalization-a-review --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

١٩ من الدقائق

مشاهدة الكل (٢٥٠)

Audio narrations of LessWrong posts.

صناع العمل

LessWrong
سنوات النشاط

٢٠٢٣ - ٢٠٢٥
الحلقات

٢٥٠
التقييم

ملائم
موقع البرنامج على الويب

LessWrong (30+ Karma)

التكنولوجيا

التكنولوجيا

يتم التحديث كل أسبوعين
التكنولوجيا

التكنولوجيا

يتم التحديث أسبوعيًا
التغذية

التغذية

٢٩ سبتمبر
العلوم

العلوم

٢٩ سبتمبر
الصحة واللياقة

الصحة واللياقة

يتم التحديث أسبوعيًا
العلوم

العلوم

قبل ٦ أيام
الصحة واللياقة

الصحة واللياقة

يتم التحديث أسبوعيًا

LessWrong (30+ Karma)

“Stars are a rounding error” by Algon

“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger

“At odds with the unavoidable meta-message” by Ruby

“Towards a Typology of Strange LLM Chains-of-Thought” by 1a3orn

“I take antidepressants. You’re welcome” by Elizabeth

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

“The Thinking Machines Tinker API is good news for AI control and security” by Buck

“Hospitalization: A Review” by Logan Riggs

حول

المعلومات

قد يعجبك أيضًا

LessWrong (30+ Karma)

الحلقات

“Stars are a rounding error” by Algon

“Training Qwen-1.5B with a CoT legibility penalty” by Fabien Roger

“At odds with the unavoidable meta-message” by Ruby

“Towards a Typology of Strange LLM Chains-of-Thought” by 1a3orn

“I take antidepressants. You’re welcome” by Elizabeth

“Realistic Reward Hacking Induces Different and Deeper Misalignment” by Jozdien

“The Thinking Machines Tinker API is good news for AI control and security” by Buck

“Hospitalization: A Review” by Logan Riggs

حول

المعلومات

قد يعجبك أيضًا