LessWrong (30+ Karma)

LessWrong

0.0（0則評分）
科技
每日更新

Audio narrations of LessWrong posts.

8 小時前

“Incriminating misaligned AI models via distillation” by Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: Misalignment fails to transfer to the student. If so, we get a fairly capable benign model.Misalignment transfers to the student. The student might also be worse than the teacher at hiding its misalignment (e.g., due to being less capable). If so, we might get indirect evidence about the teacher's misalignment by auditing the distilled model. In this post, we will discuss the second possibility, which we call incrimination via distillation. Specifically, we propose distillation methods that we hope transfer misalignment without transferring the ability to fool audits, and discuss why these techniques might work or fail. In a future post, we discuss the first possibility, and what distillation methods should be used when aiming to create a capable benign model. We’re excited for research that empirically tests and refines this approach; if successful, this technique could become a valuable part of alignment audits. How incrimination via distillation works Powerful misaligned AI models might not be auditable: they might pass alignment audits but still act on their misaligned drives when given the chance. [...] --- Outline: (01:31) How incrimination via distillation works (02:45) How we propose implementing incrimination via distillation (03:48) Auditability-preserving distillation (04:58) Misalignment-targeted distillation (06:20) Why incrimination via distillation might work (08:02) Why incrimination via distillation might not work (10:38) Conclusion The original text contained 3 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/BYH6ebmfZb3Eggzer/incriminating-misaligned-ai-models-via-distillation --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

12 分鐘
10 小時前

“The hard core of alignment (is robustifying RL)” by Cole Wyeth

Most technical AI safety work that I read seems to miss the mark, failing to make any progress on the hard part of the problem. I think this is a common sentiment, but there's less agreement about what exactly the hard part is? Characterizing this more clearly might save a lot of time and better target the search for solutions. In this post I explain my model of why alignment is technically hard to achieve, setting aside the regulatory, competitive, and geopolitical challenges, the sheer incompetence and unforced errors of the players, and the other factors which decrease our chances of success. I claim there is something like a "hard core": a common stumbling block for all approaches. In other words, there is something that makes alignment hard, rather than a bunch of unrelated things that make each approach to alignment hard independently. Since many different "hopes" for alignment seem quite far apart, this would seem to be a remarkable and unexpected state of affairs. On the other hand, such an expansive graveyard of failed proposals suggests a common culprit. Semantics. When used informally in conversation, "hard core" is a bit like "complete problem." But there may be [...] --- Outline: (01:32) Background (03:02) The Problem (09:19) The Barrier (16:49) Providing better feedback (22:07) Behaviorist v.s. process feedback --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/JT3qCYDimskcBdiEr/the-hard-core-of-alignment-is-robustifying-rl --- Narrated by TYPE III AUDIO.

26 分鐘
10 小時前

“Announcing the Center for Shared AI Prosperity” by Dylan Matthews

I wanted to share the launch of a project I've been working on with pollster David Shor, Obama/Biden veteran Stef Feldman, political strategist Morris Katz, Harvard historian Marc Aidinoff, and a few other folks*. The Center for Shared AI Prosperity is an attempt to force DC policy elites, particularly (given our team's backgrounds) liberals/progressives, to take the impending economic impacts of advanced AI more seriously. We do not think this is a normal economic shock. We are deeply uncertain about what kind of economic shock it will be, but even if humans manage to survive the advent of superintelligence, we'll be left with a world of extreme power and wealth concentration, increasing political instability arising from that growing inequality, and deep questions about how to fund governments that have for a century-plus relied on income and payroll taxes. Our main purpose as an organization is to surface tractable ideas across four main areas: Taxation and Revenue Policy: New or reformed revenue raisers that allow the U.S. government (federal, state, or local) to capture a fair share of corporate wealth generated by AI without causing undue distortionsIncome Support and Social Safety Nets: New or reformed programs to redistribute [...] --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/CtRNmAQXmSotpAi9j/announcing-the-center-for-shared-ai-prosperity --- Narrated by TYPE III AUDIO.

4 分鐘
12 小時前

“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during deployment. I think this is the most plausible route to consistent adversarial misalignment in the near future. So, AI companies and evaluators should substantively incorporate it into risk analysis and planning. In this post, I’ll briefly argue why, absent improved mitigations, this will probably soon become a reason why AI companies will be unable to convincingly argue against consistent adversarial misalignment (this risk will perhaps be even larger than risk of consistent adversarial misalignment arising from training). Then I’ll discuss how well current risk reports address it (the Claude Mythos risk report does a reasonable job; others don’t). Thanks to Ryan Greenblatt, Alexa Pan, Charlie Griffin, Anders Cairns Woodruff, and Buck Shlegeris for feedback on drafts. Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment In some contexts, AIs might adopt misaligned goals, even if they were otherwise previously aligned. Because this misalignment can be rare, the AI might not appear to have concerning propensities in pre-deployment testing. The misalignment might only arise [...] --- Outline: (01:13) Deployment-time spread is the most plausible near-term route to consistent adversarial misalignment (06:15) Company risk reports The original text contained 6 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/cNymohcWtGHzW7AjK/risk-reports-need-to-address-deployment-time-spread-of --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

11 分鐘
12 小時前

“Mechanistic estimation for expectations of random products” by Jacob_Hilton

We have developed some relatively general methods for mechanistic estimation competitive with sampling by studying problems that are expressible as expectations of random products. This includes several different estimation problems, such as random halfspace intersections, random #3-SAT and random permanents. In this post, we will give a high-level introduction to these methods before sharing some more detailed notes. This is intended as an interim technical update and will be relatively light on motivation: for a broader discussion of this line of research, see our prior post. Random instances of the matching sampling principle All of the problems discussed in this post can be thought of particular choices of "architecture" in our matching sampling principle. In fact, they are all choices in which has no learned or worst-case parameters . They still have random parameters, which are captured in the "context" variable , making them similar to randomly-initialized networks rather than trained networks. Note that when is missing from , no "explanation" is required by the estimation algorithm , which reflects the fact that there is no "structure" in a randomly-initialized network that needs to be pointed out. Why study random instances of the matching sampling principle? [...] --- Outline: (00:46) Random instances of the matching sampling principle (02:02) Expectations of random products (03:13) Deduction-projection estimators (04:59) Mechanistic sketching (06:03) Detailed notes (08:36) Conclusion The original text contained 4 footnotes which were omitted from this narration. --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/7RyAefESvb6BQ3tMz/mechanistic-estimation-for-expectations-of-random-products --- Narrated by TYPE III AUDIO.

9 分鐘
13 小時前

“MATS 9 Retrospective & Advice” by beyarkay

I couldn’t find a recent write-up from a MATS alum about what attending MATS was like, so this is the thing that I wish I had. I attended MATS from January to March 2026, on Team Shard with Alex Turner and Alex Cloud. It was a great time! Applications for MATS are basically on a rolling basis nowadays, and I can strongly recommend applying (to multiple streams) even if you think you’re not a great match. With that being said, there's a lot I wish I knew going into MATS, so here's a brain-dump of thoughts. It's not extremely polished, but I expect it’ll be useful nonetheless (none of this is endorsed by MATS, just my thoughts): Work ethic I think most mentees were working 10-12, sometimes 14 hours a day Mon-Fri, and probably 2-8 hours on Saturday and Sunday, often going out on some adventure or party on the weekend. Exactly which hours people worked varied wildly. I usually worked 8:30am/9am to 11pm/midnight, with breaks during the day, others worked from midday into the early hours of the morning. This was surprisingly sustainable (IMO); MATS puts a lot of effort into removing all other blockers that you normally [...] --- Outline: (00:50) Work ethic (01:29) Use more compute (02:20) Research requires a lot of compute (03:12) Applying for jobs during MATS (dont do it) (04:55) The serious people are in War Mode (05:44) Do you feel the AGI? (06:00) Burn rate, efficiency, and decisions (07:12) insider information (08:08) Names & Faces (08:20) Fellows (08:50) Useful tools (11:19) Use more Claudes (12:06) Build nice helper utilities for yourself (12:59) MATS-mentee-mentor dynamics (13:45) Working with your mentors (14:27) Research managers (14:48) Ops requests (15:38) Non-MATS events (16:17) Team Shard (17:12) Weekly updates (18:46) Keep a log of your mistakes (19:06) My running-experiments setup (27:51) Lighthaven (28:12) Getting setup with the Compute team --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/eFD3rozNCZKMe4rTs/mats-9-retrospective-and-advice --- Narrated by TYPE III AUDIO.

29 分鐘
18 小時前

[Linkpost] “Don’t be too Clever to Take Obvious Advice” by Hide

This is a link post. An insidious pattern among smart people is feeling that because something is familiar and obvious, you are impervious to ignoring or forgetting it. In challenging times, I have often heard these clichés and reflexively shrugged them off. “Oh, I should dust myself off and pick myself up? What a lazy aphorism. What a patronising throwaway line. They must think I’m some kind of idiot. No, it must be something else…” There is a filter in many people's heads that functions to ignore clichés on the basis that they are mere clichés. However, looking closely at your actions, decisions and attitudes will almost invariably reveal you are dropping the ball on at least a few of the most obvious bits of pop wisdom that are all clearly good practices. Some (non exhaustive) examples of these “yeah, obviously” pieces of advice that are worth deliberately checking on a regular basis include: Believe in yourself The greatest example of an eye-rolling cliché is also one of the highest impact pieces of advice ever articulated. Self-belief is the foundation of morale, and without morale, you are doomed. If the words “believe in yourself” evoke a sense [...] --- First published: May 15th, 2026 Source: https://www.lesswrong.com/posts/J5L3PxKYv7XctQyQK/don-t-be-too-clever-to-take-obvious-advice Linkpost URL:https://hidefromit.substack.com/p/dont-be-too-clever-to-take-obvious --- Narrated by TYPE III AUDIO.

4 分鐘
20 小時前

“Verification-Centric AI” by Raemon

"Sometimes the AI just makes stuff up" is a problem I don't really expect to go away. In the nearterm, AI is going to keep occasionally hallucinating, or misinterpreting information. Eventually, AI will be powerful enough we need to be worried if it's presenting misleading information on purpose. There might be a nice window where the AI is powerful enough to not make things up but non-agentic enough that we don't have to worry about deliberate manipulation. But, even then, interpreting data is tricky. I'm worried about this for my own use, but, I'm more worried about this on the global scale. I'm worried about people trusting things AI made up, and I'm worried about the internet proliferating with slop that makes it harder to even find original statements that are a human's real testimony. An approach that might help is to make AI reports more "Verification centric." Right now, some AI chatbots provides little citation-links. That's better than not-having-them. But, those are a pain to open and investigate. Probably you very rarely do so. So, imagine a world where when you answer a question, the AI doesn't guess. Instead: It finds a primary source. It [...] --- First published: May 14th, 2026 Source: https://www.lesswrong.com/posts/mpoEKJbqQvrRHqn3e/verification-centric-ai --- Narrated by TYPE III AUDIO.

3 分鐘

顯示全部 (250)

Audio narrations of LessWrong posts.

創作者

LessWrong
活躍年代

2023年 - 2026年
集數

250
年齡分級

兒少適宜
節目網站

LessWrong (30+ Karma)

科學

科學

3 天前更新
商業

商業

每週更新
科技

科技

每日更新
科技

科技

每週更新兩次
科技

科技

每週更新
科技

科技

每週更新
新聞

新聞

每週更新

LessWrong (30+ Karma)

“Incriminating misaligned AI models via distillation” by Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny

“The hard core of alignment (is robustifying RL)” by Cole Wyeth

“Announcing the Center for Shared AI Prosperity” by Dylan Matthews

“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

“Mechanistic estimation for expectations of random products” by Jacob_Hilton

“MATS 9 Retrospective & Advice” by beyarkay

[Linkpost] “Don’t be too Clever to Take Obvious Advice” by Hide

“Verification-Centric AI” by Raemon

簡介

資訊

你可能也會喜歡

LessWrong (30+ Karma)

集數

“Incriminating misaligned AI models via distillation” by Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny

“The hard core of alignment (is robustifying RL)” by Cole Wyeth

“Announcing the Center for Shared AI Prosperity” by Dylan Matthews

“Risk reports need to address deployment-time spread of misalignment” by Alex Mallen

“Mechanistic estimation for expectations of random products” by Jacob_Hilton

“MATS 9 Retrospective & Advice” by beyarkay

[Linkpost] “Don’t be too Clever to Take Obvious Advice” by Hide

“Verification-Centric AI” by Raemon

簡介

資訊

你可能也會喜歡