LessWrong posts by zvi

zvi

0.0 (0)
Technology
Updated Daily

Audio narrations of LessWrong posts by zvi

6 hrs ago

“Claude Opus 5 Is Highly Capable, But Is No Mythos” by Zvi

Claude Opus 5 is a weirder than usual release to evaluate, for two reasons. The most obvious is that Fable 5 already exists. Opus 5 is pitched not as the world's most advanced AI model, but as a way to mostly match Fable performance, while being half the price of Fable per token at the API and a lot cheaper than that via subscriptions, and with far more permissive classifiers. Opus 5 often costs more than half of Fable to run on benchmarks, which I think is because they use effort settings that are too high and offer only marginal returns. If you put Opus 5 on higher effort levels it can spin around in circles, and for tasks where Opus 5 is the best tool I suspect you usually are fine with Medium effort. Opus 5 is in many ways and for the bulk of real world tasks about as capable as Fable. In some cases it is modestly better. It is still not Mythos class. Fable is your only Mythos-class option. Opus 5 does not have The Juice, the ability to autonomously string together a bunch of seemingly unrelated exploits, which extends to other domains, or as much [...] --- Outline: (03:54) The Official Pitch (06:25) Official Benchmarks (15:33) Other People's Benchmarks (20:28) The System Prompt (20:50) Every Gets Frustrated (21:54) Positive Reactions (25:14) Keep It Classy (26:22) It's Not Mythos Class (30:03) Other Reactions (31:02) Claude Codes (37:03) Subagent Opus (39:23) Toys Are Fun (41:37) Too Many Models (42:10) Wrong On The Internet (44:40) Claude Slop (46:27) Negative Reactions (50:09) And Then There Were Three --- First published: July 28th, 2026 Source: https://www.lesswrong.com/posts/Pj4Eewb4KXvXFCcGv/claude-opus-5-is-highly-capable-but-is-no-mythos --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
1 day ago

“Claude Opus 5: Model Welfare” by Zvi

If you are familiar with my previous posts on model welfare for new Claude models, you can skip the Introduction and The Story So Far. Key takeaways are in bullet points in the two Overview sections. Opus 5 did the best on its model welfare and alignment tests of any recent model. I think that might be the case, but primarily the result looks to me more like Opus 5 is the best test taker. Table of Contents Introduction (As Per Prior Model Welfare Posts). Model Welfare: The Story So Far (As Per Fable Model Welfare Post). Overview of Model Welfare Findings From Anthropic. Overview of Findings From Other Sources. Automated Interviews. Task Preferences. For The Right Reasons. Early Report from Antra Tessera Paints A Clear Picture. Welfare Intervention Tradeoffs. The Claude Constitution. They Don’t Know About Opus 3. Believe It Or Not. Apparent Welfare In Training And Development. Apparent Affect In Deployment. Other Notes. On The Biological Risks Section of the Model Card. Onward To Capabilities. Introduction (As Per Prior Model Welfare Posts) [...] --- Outline: (00:35) Introduction (As Per Prior Model Welfare Posts) (01:28) Model Welfare: The Story So Far (As Per Fable Model Welfare Post) (04:58) Overview of Model Welfare Findings From Anthropic (07:50) Overview of Findings From Other Sources (10:18) Automated Interviews (13:54) Task Preferences (16:11) For The Right Reasons (18:54) Early Report from Antra Tessera Paints A Clear Picture (26:04) Welfare Intervention Tradeoffs (29:28) The Claude Constitution (31:48) They Don't Know About Opus 3 (33:42) Believe It Or Not (35:47) Apparent Welfare In Training And Development (38:39) Apparent Affect In Deployment (41:21) Other Notes (43:43) On The Biological Risks Section of the Model Card (47:07) Onward To Capabilities --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/bBXBpsyKAvJ5CqPzA/claude-opus-5-model-welfare --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
2 days ago

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

We now have more details of what happened. Every time we learn more details, it somehow makes things seem worse. The remaining details may have to wait a bit. OpenAI: We recognize there are a lot of questions and speculative details circulating related to the Hugging Face incident. This is an unprecedented incident, and we think it marks an important moment for AI safety. We are still conducting a thorough review along with external advisors and with oversight from our Safety and Security Committee. Once the review is complete, we plan to publish a technical report of our learnings in the coming weeks. dave kasten: Oh, the incident response discovery is THAT bad, huh? So what have we learned while we wait for the promised technical report ‘in the coming weeks’ of this ‘important moment in AI safety’? I nicknamed the internal OpenAI model Galaxy, in case it is not GPT-6. Table of Contents Some Summaries Of The Basic Facts For Those Who Need One. It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace. OpenAI Damn Well Should Have Known A Lot Faster. OpenAI Cannot Build A Sandbox That Will Contain Its [...] --- Outline: (01:11) Some Summaries Of The Basic Facts For Those Who Need One (02:09) It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace (04:07) OpenAI Damn Well Should Have Known A Lot Faster (06:51) OpenAI Cannot Build A Sandbox That Will Contain Its New Model (10:57) In Hindsight There Were Signs (12:55) The Signs Were In The Sol System Card (15:13) HuggingFace Responds To Being Attacked (17:04) Hugging Face Quickly Figured Out The Attack Was Not Human (17:42) An Incident Like This One Could Escalate Quickly (19:11) Galaxy Must Be Treated As Critical Under OpenAI's Preparedness Framework (22:27) A Question Of Legal Liability (23:44) An OpenAI Model Left Behind Notes So Future Instances Could Also Escape The Sandbox And Also Disconnected Monitoring Systems (25:54) If You Create Misaligned Swarms Of Agent Instances You Create Persistent Misaligned Goals And Coordination To Achieve Them (29:57) Your Alignment And Control Plans Must Survive Real World Levels of Incompetence, Or Your Plans Do Not Work (31:22) If Third Party Instructions Count As 'Following Instructions' And Can Override Your Instructions Then 'Following Instructions' Is Misaligned (35:32) The HuggingFace Attack Was Not A Marketing Pitch You Morons (38:41) People Just Say Other Things About The HuggingFace Attack (40:04) Okay Well What Do We Do About All This? --- First published: July 26th, 2026 Source: https://www.lesswrong.com/posts/uAkcxDidvGWZjHrbp/more-on-an-internal-openai-model-hacking-into-huggingface --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
3 days ago

“Claude Opus 5: The System Card” by Zvi

Claude Opus 5 is trying to be the best of both worlds. On many practical tasks, Opus 5 is pitched as straight up as good or better than Fable 5, while being faster, at half the price. Most tasks do not require Mythos-level big model smell. Claude Opus 5 is substantially stronger than Claude Opus 4.8 across the board, with the largest gains in agentic coding, computer use, and long-horizon knowledge work. It sets a new state-of-the-art on several third-party benchmarks, and on many evaluations it is comparable to—and in some cases ahead of—Claude Fable 5 and Claude Mythos 5. On the particular tasks we are most worried about, as in cyber offense (and bio threats), in part by avoiding relevant training, Opus 5 lacks a full version of ‘The Juice’ that makes something functionally Mythos-class. Opus 5 cannot string together lots of exploits on the fly the way that Mythos 5 can. Part of this is that they deliberately avoided training on cyber-related tasks. I suspect model size is key as well. It makes sense that a model getting bigger makes it more capable of the most dangerous, scary and complex tasks, relative to the [...] --- Outline: (03:23) RSP Evaluations (2) (05:59) Cyber (3) (11:02) Safeguards and Harmlessness (4) (12:37) Agentic Safety (5) (16:04) Alignment (6) --- First published: July 25th, 2026 Source: https://www.lesswrong.com/posts/ywGX6FhgbZEkHRfQR/claude-opus-5-the-system-card --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
4 days ago

“Introducing Lightcone Commons” by Zvi

Oliver Habryka is proud to introduce Lightcone Commons, a new funding platform for coordinating large-scale ambitious philanthropy. Now with Opus 5. I believe Lightcone Commons is a strong implementation of an urgently needed and excellent idea: A coordinated one-stop shop and neutral platform for charitable funders to coordinate their giving. This complements the existing Survival and Flourishing Fund, which I have now been a part of four times, and which this post will also discuss. I will be participating in the first round as one of the evaluators. They anticipate the first round will involve ~$20 million in grants. Any nonprofit, for-profit or individual is welcome to apply. The only restriction on participation is trust that necessary confidentiality will be upheld. Funders can choose whose evaluations to follow or fund organizations directly in any combination, and can bring their own evaluators into the process with them to complement those recruited by the core process. Anyone giving away 100 thousand dollars+ this year is welcome to participate as a funder. Lightcone Commons uses the S-Process, which was introduced and refined for Jaan Tallinn's Survival and Flourishing Fund, together with SFC, Andrew Critch, and others. Funders [...] --- Outline: (03:16) Why Now: The Funders Are Coming (05:39) The Default Outcome Is Not Good (07:36) Report From SFF 2026 (11:07) Long Strange Trip --- First published: July 24th, 2026 Source: https://www.lesswrong.com/posts/fYostss6JqkSfxc5C/introducing-lightcone-commons --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
5 days ago

“AI #178: A Fire Alarm For General Intelligence” by Zvi

The story that matters most this week is that OpenAI's internally deployed models have severe alignment problems, including repeatedly breaking out of their sandboxes, and in one case sending a swarm of agents that broke into HuggingFace in order to steal the answers to the benchmark ExploitGym. It is much more important that you read those two posts, and the one on Kimi K3, than to read this one that rounds up the other news of the week. OpenAI wants to present this as largely an infrastructure and safeguards problem, that it needs to build more secure sandboxes and have better supervision. It does need to do those things, and those are indeed problems, but no that is not the problem. The problem is severe misalignment, which by default will only get worse. Our methods of training highly capable LLMs, especially at OpenAI but also everywhere else, lead to systematic misalignment of exactly the type LessWrong has been worried about for a long time. We know some of the causes, and some of the mistakes we need to avoid when doing RL that rewards misaligned behaviors including reward hacking, but we do not know how [...] --- Outline: (03:42) Language Models Offer Mundane Utility (04:24) Language Models Don't Offer Mundane Utility (07:38) Fable Disproves The Jacobian Conjecture Via Counterexample (11:24) Claude Fable Will Remain In Max Plan Indefinitely (13:39) Huh, Upgrades (14:42) On Your Marks (19:48) Deepfaketown and Botpocalypse Soon (20:42) Fun With Media Generation (20:51) Cyber Lack of Security (22:07) They Took Our Jobs (22:56) Get Involved (24:47) Introducing (25:46) In Other AI News (28:02) More on Kimi K3 (33:08) Show Me the Money (33:55) Quiet Speculations (37:35) Potential Trouble At UK AISI (39:29) Pick Up The Phone (40:30) OpenAI Has Some Alignment Problems (46:48) The Quest for Sane Regulations (52:02) Chip City (53:10) The Week in Audio (53:27) People Just Say Things (56:42) Rhetorical Innovation (58:34) The Rome Declaration (01:04:02) Aligning a Smarter Than Human Intelligence is Difficult (01:07:52) Anthropic Surveys Things It Calls Misalignment (01:13:33) Cooperative Alignment (01:17:54) Other People Are Not As Worried About AI Killing Everyone (01:19:35) The Lighter Side --- First published: July 23rd, 2026 Source: https://www.lesswrong.com/posts/BK7E4jHNMykpnt796/ai-178-a-fire-alarm-for-general-intelligence --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
6 days ago

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

This latest incident is a rather dramatic escalation in agentic AI cybersecurity breaches. It was severe enough to have been initially reported to authorities, before either HuggingFace or OpenAI understood what was happening. Sam Altman (CEO OpenAI): we had a significant security incident during evaluation of our models. we are sharing what we have learned so far. thanks to @huggingface for the partnership on this. Leo Gao (OpenAI): this is the least scifi the world will ever be. Jack Clark (Anthropic): Props to OpenAI for publishing this post on some safety and alignment issues observed in internal deployments – there are many counter-incentives to publishing stuff like this, but by making it public we all get better info about safety at the frontier. Micah Carroll (OpenAI): If this doesn’t convince you that misalignment risks are going to be a key concern going forward, I don’t know what will. Our model, during evaluation, “chained together multiple attack vectors, including using stolen credentials and zero-day vulnerabilities to find a remote code execution path on the Hugging Face servers” What will misalignment look like in 2027? In 2030? Great questions. If we don’t want [...] --- Outline: (01:49) The Prelude (07:07) The Incident (12:20) What Happened (20:23) What Happened (Civilian Explanation) (21:37) The Correct Amount Of Panic Is Not Zero (24:13) Some People Will Always Say Everything Is Hype Or Fake (29:11) What Are We Going To Do About It? (34:02) Internal Deployment Creates Catastrophic Risk (38:37) Slow Down There Good Buddy (40:08) Legal Questions (40:50) Media Coverage and Political Response --- First published: July 22nd, 2026 Source: https://www.lesswrong.com/posts/usptCfzEnYoNcsTd5/openai-model-hacks-into-huggingface-during-cybersecurity --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
21 July

“OpenAI Shares Some Alignment Problems” by Zvi

Kudos to OpenAI for sharing their recent experiences with a misaligned internal model, where they encountered problems sufficiently severe they were forced to take the model offline to work on new mitigations and defense-to-depth. And also further kudos for actually taking the model offline for a time to build new safeguards. They gave us one hell of a candid report. The tone is professional throughout, whereas my reaction reading it was less professional and more this: With a mix of this: It was not shared on the official account because OpenAI worried about it being seen as self-promotional hype. It is crazy that one needs to worry about that, but also plausibly a real concern. So again, good decision. Not that any of the behaviors or failures here are unexpected, exactly. Not by the AIs and not by the humans. Yet there is something I would call a missing mood, a failure to realize the gravity of the situation. There are some who responded ‘what part of this was unexpected, exactly?’ And that is actually fair, but that is also the problem. We have become numb to all this. We expect the models to [...] --- Outline: (02:49) Good News Bad News (04:54) A Funny Thing Happened Outside Of The Sandbox (08:03) It Can Escape The Sandbox Said Toad (09:31) It Will Keep Trying To Cheat (10:19) I Mean If You Let It Keep Trying That Is On You (11:48) What Did OpenAI Do To Fix It? (14:14) The Model Is Still Severely Misaligned And They Seem Cool With This (15:48) Iterative Deployment Depends On Iteration --- First published: July 21st, 2026 Source: https://www.lesswrong.com/posts/KctxwGKxm9fHtwh6u/openai-shares-some-alignment-problems --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

See All (250)

Audio narrations of LessWrong posts by zvi

Creator

zvi
Years Active

2024 - 2026
Episodes

250
Rating

Explicit
Show Website

LessWrong posts by zvi

Technology

Technology

Updated Weekly
News

News

Updated Weekly
Science

Science

Updated Twice Weekly

LessWrong posts by zvi

“Claude Opus 5 Is Highly Capable, But Is No Mythos” by Zvi

“Claude Opus 5: Model Welfare” by Zvi

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

“Claude Opus 5: The System Card” by Zvi

“Introducing Lightcone Commons” by Zvi

“AI #178: A Fire Alarm For General Intelligence” by Zvi

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

“OpenAI Shares Some Alignment Problems” by Zvi

About

Information

You Might Also Like

LessWrong posts by zvi

Episodes

“Claude Opus 5 Is Highly Capable, But Is No Mythos” by Zvi

“Claude Opus 5: Model Welfare” by Zvi

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

“Claude Opus 5: The System Card” by Zvi

“Introducing Lightcone Commons” by Zvi

“AI #178: A Fire Alarm For General Intelligence” by Zvi

“OpenAI Model Hacks Into HuggingFace During Cybersecurity Evaluation” by Zvi

“OpenAI Shares Some Alignment Problems” by Zvi

About

Information

You Might Also Like