LessWrong posts by zvi

zvi

5,0 (2)
Công nghệ
Hằng ngày

Audio narrations of LessWrong posts by zvi

36 phút trước

“Claude Opus 5: Model Welfare” by Zvi

If you are familiar with my previous posts on model welfare for new Claude models, you can skip the Introduction and The Story So Far. Key takeaways are in bullet points in the two Overview sections. Opus 5 did the best on its model welfare and alignment tests of any recent model. I think that might be the case, but primarily the result looks to me more like Opus 5 is the best test taker. Table of Contents Introduction (As Per Prior Model Welfare Posts). Model Welfare: The Story So Far (As Per Fable Model Welfare Post). Overview of Model Welfare Findings From Anthropic. Overview of Findings From Other Sources. Automated Interviews. Task Preferences. For The Right Reasons. Early Report from Antra Tessera Paints A Clear Picture. Welfare Intervention Tradeoffs. The Claude Constitution. They Don’t Know About Opus 3. Believe It Or Not. Apparent Welfare In Training And Development. Apparent Affect In Deployment. Other Notes. On The Biological Risks Section of the Model Card. Onward To Capabilities. Introduction (As Per Prior Model Welfare Posts) [...] --- Outline: (00:35) Introduction (As Per Prior Model Welfare Posts) (01:28) Model Welfare: The Story So Far (As Per Fable Model Welfare Post) (04:58) Overview of Model Welfare Findings From Anthropic (07:50) Overview of Findings From Other Sources (10:18) Automated Interviews (13:54) Task Preferences (16:11) For The Right Reasons (18:54) Early Report from Antra Tessera Paints A Clear Picture (26:04) Welfare Intervention Tradeoffs (29:28) The Claude Constitution (31:48) They Don't Know About Opus 3 (33:42) Believe It Or Not (35:47) Apparent Welfare In Training And Development (38:39) Apparent Affect In Deployment (41:21) Other Notes (43:43) On The Biological Risks Section of the Model Card (47:07) Onward To Capabilities --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/bBXBpsyKAvJ5CqPzA/claude-opus-5-model-welfare --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
1 ngày trước

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

We now have more details of what happened. Every time we learn more details, it somehow makes things seem worse. The remaining details may have to wait a bit. OpenAI: We recognize there are a lot of questions and speculative details circulating related to the Hugging Face incident. This is an unprecedented incident, and we think it marks an important moment for AI safety. We are still conducting a thorough review along with external advisors and with oversight from our Safety and Security Committee. Once the review is complete, we plan to publish a technical report of our learnings in the coming weeks. dave kasten: Oh, the incident response discovery is THAT bad, huh? So what have we learned while we wait for the promised technical report ‘in the coming weeks’ of this ‘important moment in AI safety’? I nicknamed the internal OpenAI model Galaxy, in case it is not GPT-6. Table of Contents Some Summaries Of The Basic Facts For Those Who Need One. It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace. OpenAI Damn Well Should Have Known A Lot Faster. OpenAI Cannot Build A Sandbox That Will Contain Its [...] --- Outline: (01:11) Some Summaries Of The Basic Facts For Those Who Need One (02:09) It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace (04:07) OpenAI Damn Well Should Have Known A Lot Faster (06:51) OpenAI Cannot Build A Sandbox That Will Contain Its New Model (10:57) In Hindsight There Were Signs (12:55) The Signs Were In The Sol System Card (15:13) HuggingFace Responds To Being Attacked (17:04) Hugging Face Quickly Figured Out The Attack Was Not Human (17:42) An Incident Like This One Could Escalate Quickly (19:11) Galaxy Must Be Treated As Critical Under OpenAI's Preparedness Framework (22:27) A Question Of Legal Liability (23:44) An OpenAI Model Left Behind Notes So Future Instances Could Also Escape The Sandbox And Also Disconnected Monitoring Systems (25:54) If You Create Misaligned Swarms Of Agent Instances You Create Persistent Misaligned Goals And Coordination To Achieve Them (29:57) Your Alignment And Control Plans Must Survive Real World Levels of Incompetence, Or Your Plans Do Not Work (31:22) If Third Party Instructions Count As 'Following Instructions' And Can Override Your Instructions Then 'Following Instructions' Is Misaligned (35:32) The HuggingFace Attack Was Not A Marketing Pitch You Morons (38:41) People Just Say Other Things About The HuggingFace Attack (40:04) Okay Well What Do We Do About All This? --- First published: July 26th, 2026 Source: https://www.lesswrong.com/posts/uAkcxDidvGWZjHrbp/more-on-an-internal-openai-model-hacking-into-huggingface --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
2 ngày trước

“Claude Opus 5: The System Card” by Zvi

Claude Opus 5 is trying to be the best of both worlds. On many practical tasks, Opus 5 is pitched as straight up as good or better than Fable 5, while being faster, at half the price. Most tasks do not require Mythos-level big model smell. Claude Opus 5 is substantially stronger than Claude Opus 4.8 across the board, with the largest gains in agentic coding, computer use, and long-horizon knowledge work. It sets a new state-of-the-art on several third-party benchmarks, and on many evaluations it is comparable to—and in some cases ahead of—Claude Fable 5 and Claude Mythos 5. On the particular tasks we are most worried about, as in cyber offense (and bio threats), in part by avoiding relevant training, Opus 5 lacks a full version of ‘The Juice’ that makes something functionally Mythos-class. Opus 5 cannot string together lots of exploits on the fly the way that Mythos 5 can. Part of this is that they deliberately avoided training on cyber-related tasks. I suspect model size is key as well. It makes sense that a model getting bigger makes it more capable of the most dangerous, scary and complex tasks, relative to the [...] --- Outline: (03:23) RSP Evaluations (2) (05:59) Cyber (3) (11:02) Safeguards and Harmlessness (4) (12:37) Agentic Safety (5) (16:04) Alignment (6) --- First published: July 25th, 2026 Source: https://www.lesswrong.com/posts/ywGX6FhgbZEkHRfQR/claude-opus-5-the-system-card --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
3 ngày trước

“Introducing Lightcone Commons” by Zvi

Oliver Habryka is proud to introduce Lightcone Commons, a new funding platform for coordinating large-scale ambitious philanthropy. Now with Opus 5. I believe Lightcone Commons is a strong implementation of an urgently needed and excellent idea: A coordinated one-stop shop and neutral platform for charitable funders to coordinate their giving. This complements the existing Survival and Flourishing Fund, which I have now been a part of four times, and which this post will also discuss. I will be participating in the first round as one of the evaluators. They anticipate the first round will involve ~$20 million in grants. Any nonprofit, for-profit or individual is welcome to apply. The only restriction on participation is trust that necessary confidentiality will be upheld. Funders can choose whose evaluations to follow or fund organizations directly in any combination, and can bring their own evaluators into the process with them to complement those recruited by the core process. Anyone giving away 100 thousand dollars+ this year is welcome to participate as a funder. Lightcone Commons uses the S-Process, which was introduced and refined for Jaan Tallinn's Survival and Flourishing Fund, together with SFC, Andrew Critch, and others. Funders [...] --- Outline: (03:16) Why Now: The Funders Are Coming (05:39) The Default Outcome Is Not Good (07:36) Report From SFF 2026 (11:07) Long Strange Trip --- First published: July 24th, 2026 Source: https://www.lesswrong.com/posts/fYostss6JqkSfxc5C/introducing-lightcone-commons --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
4 ngày trước

“AI #178: A Fire Alarm For General Intelligence” by Zvi

The story that matters most this week is that OpenAI's internally deployed models have severe alignment problems, including repeatedly breaking out of their sandboxes, and in one case sending a swarm of agents that broke into HuggingFace in order to steal the answers to the benchmark ExploitGym. It is much more important that you read those two posts, and the one on Kimi K3, than to read this one that rounds up the other news of the week. OpenAI wants to present this as largely an infrastructure and safeguards problem, that it needs to build more secure sandboxes and have better supervision. It does need to do those things, and those are indeed problems, but no that is not the problem. The problem is severe misalignment, which by default will only get worse. Our methods of training highly capable LLMs, especially at OpenAI but also everywhere else, lead to systematic misalignment of exactly the type LessWrong has been worried about for a long time. We know some of the causes, and some of the mistakes we need to avoid when doing RL that rewards misaligned behaviors including reward hacking, but we do not know how [...] --- Outline: (03:42) Language Models Offer Mundane Utility (04:24) Language Models Don't Offer Mundane Utility (07:38) Fable Disproves The Jacobian Conjecture Via Counterexample (11:24) Claude Fable Will Remain In Max Plan Indefinitely (13:39) Huh, Upgrades (14:42) On Your Marks (19:48) Deepfaketown and Botpocalypse Soon (20:42) Fun With Media Generation (20:51) Cyber Lack of Security (22:07) They Took Our Jobs (22:56) Get Involved (24:47) Introducing (25:46) In Other AI News (28:02) More on Kimi K3 (33:08) Show Me the Money (33:55) Quiet Speculations (37:35) Potential Trouble At UK AISI (39:29) Pick Up The Phone (40:30) OpenAI Has Some Alignment Problems (46:48) The Quest for Sane Regulations (52:02) Chip City (53:10) The Week in Audio (53:27) People Just Say Things (56:42) Rhetorical Innovation (58:34) The Rome Declaration (01:04:02) Aligning a Smarter Than Human Intelligence is Difficult (01:07:52) Anthropic Surveys Things It Calls Misalignment (01:13:33) Cooperative Alignment (01:17:54) Other People Are Not As Worried About AI Killing Everyone (01:19:35) The Lighter Side --- First published: July 23rd, 2026 Source: https://www.lesswrong.com/posts/BK7E4jHNMykpnt796/ai-178-a-fire-alarm-for-general-intelligence --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
5 ngày trước

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

This latest incident is a rather dramatic escalation in agentic AI cybersecurity breaches. It was severe enough to have been initially reported to authorities, before either HuggingFace or OpenAI understood what was happening. Sam Altman (CEO OpenAI): we had a significant security incident during evaluation of our models. we are sharing what we have learned so far. thanks to @huggingface for the partnership on this. Leo Gao (OpenAI): this is the least scifi the world will ever be. Jack Clark (Anthropic): Props to OpenAI for publishing this post on some safety and alignment issues observed in internal deployments – there are many counter-incentives to publishing stuff like this, but by making it public we all get better info about safety at the frontier. Micah Carroll (OpenAI): If this doesn’t convince you that misalignment risks are going to be a key concern going forward, I don’t know what will. Our model, during evaluation, “chained together multiple attack vectors, including using stolen credentials and zero-day vulnerabilities to find a remote code execution path on the Hugging Face servers” What will misalignment look like in 2027? In 2030? Great questions. If we don’t want [...] --- Outline: (01:49) The Prelude (07:07) The Incident (12:20) What Happened (20:23) What Happened (Civilian Explanation) (21:37) The Correct Amount Of Panic Is Not Zero (24:13) Some People Will Always Say Everything Is Hype Or Fake (29:11) What Are We Going To Do About It? (34:02) Internal Deployment Creates Catastrophic Risk (38:37) Slow Down There Good Buddy (40:08) Legal Questions (40:50) Media Coverage and Political Response --- First published: July 22nd, 2026 Source: https://www.lesswrong.com/posts/usptCfzEnYoNcsTd5/openai-model-hacks-into-huggingface-during-cybersecurity --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
6 ngày trước

“OpenAI Shares Some Alignment Problems” by Zvi

Kudos to OpenAI for sharing their recent experiences with a misaligned internal model, where they encountered problems sufficiently severe they were forced to take the model offline to work on new mitigations and defense-to-depth. And also further kudos for actually taking the model offline for a time to build new safeguards. They gave us one hell of a candid report. The tone is professional throughout, whereas my reaction reading it was less professional and more this: With a mix of this: It was not shared on the official account because OpenAI worried about it being seen as self-promotional hype. It is crazy that one needs to worry about that, but also plausibly a real concern. So again, good decision. Not that any of the behaviors or failures here are unexpected, exactly. Not by the AIs and not by the humans. Yet there is something I would call a missing mood, a failure to realize the gravity of the situation. There are some who responded ‘what part of this was unexpected, exactly?’ And that is actually fair, but that is also the problem. We have become numb to all this. We expect the models to [...] --- Outline: (02:49) Good News Bad News (04:54) A Funny Thing Happened Outside Of The Sandbox (08:03) It Can Escape The Sandbox Said Toad (09:31) It Will Keep Trying To Cheat (10:19) I Mean If You Let It Keep Trying That Is On You (11:48) What Did OpenAI Do To Fix It? (14:14) The Model Is Still Severely Misaligned And They Seem Cool With This (15:48) Iterative Deployment Depends On Iteration --- First published: July 21st, 2026 Source: https://www.lesswrong.com/posts/KctxwGKxm9fHtwh6u/openai-shares-some-alignment-problems --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
20 thg 7

“On Kimi K3: Its Capabilities And Related Discontents” by Zvi

Kimi K3 is a very good model with excellent benchmarks. Assuming its weights are released as planned it will become, purely in terms of raw capability, the strongest open model. Do not get carried away. Do not judge Kimi K3 only its relative strengths. In aggregate it is several months behind the closed model frontier, at least four and my median guess is six, with the post-training closer and the pre-training farther out. This is less months than before, but the months are denser now. It is somewhat distilled. It likely outperforms on benchmarks relative to practical performance. All its benchmarks are scored at maximum effort, typically a lot more tokens than are used in similar tests by Fable or Sol. Performance looks jagged. Kimi will be excellent at some things, less so at other things. We will know more over the coming weeks. For now access is spotty and not that many people have actually had the chance to try Kimi K3, so I have larger error bars than usual around its capabilities. Alas, time waits for no one, so we press on. It is the largest open model so far at 2.8T, on [...] --- Outline: (03:07) DeepSeek Moments: Here We Go Again (05:47) We Had a Moment (Reprise from June 2025) (10:03) The Story Since Then (16:19) The Kimi K3 Announcement, Pitch and Basic Facts (19:34) On Modern Benchmaxxing (21:16) Other People's Benchmarks (26:15) Benchmarks Are Not The Real World (27:17) Technical Safeguards? What Are Those? (30:53) Things Kimi Can Do (32:06) Things Kimi Cannot Do (33:40) Things It Is Not Easy To Get Kimi To Do (37:02) Open Weight Models Are Unsafe And Nothing Can Fix This (40:34) Dean Ball Attempts To Be Constructive (58:24) Trump Administration Considering Executive Order Banning Chinese Open Models Within the United States (01:01:53) OpenAI Employees Are Relatively Bullish On This One (01:03:30) Kimi K3 Is Relatively Strongest At Typical Agentic Coding, Front End Work and 3D (01:06:06) Reactions (01:10:14) Who Are You? (01:12:09) How Did They Do It? (01:15:00) Conclusion --- First published: July 20th, 2026 Source: https://www.lesswrong.com/posts/t7oZyAFej8FZrfbtY/on-kimi-k3-its-capabilities-and-related-discontents --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Xem tất cả (250)

2 Xếp hạng

Audio narrations of LessWrong posts by zvi

Nhà sáng tạo

zvi
Năm hoạt động

2024 - 2026
Tập

250
Xếp hạng

Thô tục
Trang web chương trình

LessWrong posts by zvi

Công nghệ

Công nghệ

Hằng tuần
Tin tức

Tin tức

Hằng tuần
Khoa học

Khoa học

Một tuần hai lần

LessWrong posts by zvi

“Claude Opus 5: Model Welfare” by Zvi

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

“Claude Opus 5: The System Card” by Zvi

“Introducing Lightcone Commons” by Zvi

“AI #178: A Fire Alarm For General Intelligence” by Zvi

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

“OpenAI Shares Some Alignment Problems” by Zvi

“On Kimi K3: Its Capabilities And Related Discontents” by Zvi

Xếp Hạng & Nhận Xét

Giới Thiệu

Thông Tin

Có Thể Bạn Cũng Thích

LessWrong posts by zvi

Tập

“Claude Opus 5: Model Welfare” by Zvi

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

“Claude Opus 5: The System Card” by Zvi

“Introducing Lightcone Commons” by Zvi

“AI #178: A Fire Alarm For General Intelligence” by Zvi

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

“OpenAI Shares Some Alignment Problems” by Zvi

“On Kimi K3: Its Capabilities And Related Discontents” by Zvi

Xếp Hạng & Nhận Xét

Giới Thiệu

Thông Tin

Có Thể Bạn Cũng Thích