LessWrong posts by zvi

zvi

0,0 (0)
Technologies
Tous les jours

Audio narrations of LessWrong posts by zvi

-2 h

“AI #179 Part 1: A Louder Fire Alarm for General Intelligence” by Zvi

What a week. Anthropic released Claude Opus 5. As usual I covered that in three parts: The system card, model welfare and capabilities. OpenAI was revealed over the last two weeks to have left an internal model unsupervised for a week during a cybersecurity evaluation, with its cyber safeguards lowered, despite having had multiple previous incidents where models broke out of their sandboxes. During that test, the model broke out of the sandbox, then proceeded to use an agent swarm to hack into HuggingFace to get the test answers. The model was loose for a week before OpenAI realized what had happened. This event was a really big deal. There are severe alignment problems at OpenAI, along with supervisory and infrastructure failures. The internal research model that did this, which my posts nicknamed Galaxy, has now been permanently deactivated. There have been further developments, and I anticipate at least one additional post on the HuggingFace incident soon. Partly as a response to this, over 1,290 employees at frontier labs signed an open letter, Pacing the Frontier. The letter warns that we are close to automating AI research, and that companies are racing ahead on [...] --- Outline: (02:35) Language Models Offer Mundane Utility (07:26) Huh, Upgrades (07:55) On Your Marks (11:13) Get My Agent On The Line (12:32) Deepfaketown and Botpocalypse Soon (17:29) Fun With Media Generation (18:38) The Search Through Slop (20:35) Cyber Lack of Security (22:42) Overcoming Bias (23:37) A Young Lady's Illustrated Primer (24:03) They Took Our Jobs (24:35) The Art of the Jailbreak (25:00) Introducing (25:49) Kimi K3 Weights Are Now Available (28:16) In Other AI News (32:34) Show Me the Money (33:43) Quiet Speculations (36:43) Show Me The Compute (42:48) Life Comes At You Fast --- First published: July 30th, 2026 Source: https://www.lesswrong.com/posts/gfWCuTEGNgd2CQbrM/ai-179-part-1-a-louder-fire-alarm-for-general-intelligence --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
-1 j

“Frontier Lab Employee Open Letter Calls For Being Able to Pace the Frontier” by Zvi

The most important open letter in years dropped yesterday. This letter noticeably increases my hope that we will manage to not die, and that we will otherwise be able to secure for ourselves a positive future, both by its impact and by the evidence it provides that such a letter can get this level of support. Signed by 1,224 employees of frontier labs including many heavy hitters, and now endorsed by both OpenAI and Anthropic, here is its full text, which I also endorse: AI could help create a dramatically better future, but that outcome is not guaranteed. The world's leading AI companies believe they could be close to automating AI research. It is hard to predict exactly how much this will accelerate AI progress, but there is a real risk that capability development rapidly accelerates beyond our ability to understand or control the resulting systems. To realize AI's potential, industry, government, and society at large may need the option to buy time to address emerging risks, develop security measures, and strengthen oversight. But each company—and country—is under intense competitive pressure not to unilaterally slow that acceleration. And today, the world lacks the technical and [...] --- Outline: (02:20) A Very Good Letter (04:37) Who Signed The Letter (08:26) We Need To Prepare Now So We Have The Option To Do This (11:02) Words From Some Of Those Who Signed (16:48) Words From Others (20:58) A Good Start (28:52) What The Letter Does Not Say (30:44) What Happens Now? --- First published: July 29th, 2026 Source: https://www.lesswrong.com/posts/eWmeMLqTEauCmHLeR/frontier-lab-employee-open-letter-calls-for-being-able-to --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
-1 j

“Claude Opus 5 Is Highly Capable, But Is No Mythos” by Zvi

Claude Opus 5 is a weirder than usual release to evaluate, for two reasons. The most obvious is that Fable 5 already exists. Opus 5 is pitched not as the world's most advanced AI model, but as a way to mostly match Fable performance, while being half the price of Fable per token at the API and a lot cheaper than that via subscriptions, and with far more permissive classifiers. Opus 5 often costs more than half of Fable to run on benchmarks, which I think is because they use effort settings that are too high and offer only marginal returns. If you put Opus 5 on higher effort levels it can spin around in circles, and for tasks where Opus 5 is the best tool I suspect you usually are fine with Medium effort. Opus 5 is in many ways and for the bulk of real world tasks about as capable as Fable. In some cases it is modestly better. It is still not Mythos class. Fable is your only Mythos-class option. Opus 5 does not have The Juice, the ability to autonomously string together a bunch of seemingly unrelated exploits, which extends to other domains, or as much [...] --- Outline: (03:54) The Official Pitch (06:25) Official Benchmarks (15:33) Other People's Benchmarks (20:28) The System Prompt (20:50) Every Gets Frustrated (21:54) Positive Reactions (25:14) Keep It Classy (26:22) It's Not Mythos Class (30:03) Other Reactions (31:02) Claude Codes (37:03) Subagent Opus (39:23) Toys Are Fun (41:37) Too Many Models (42:10) Wrong On The Internet (44:40) Claude Slop (46:27) Negative Reactions (50:09) And Then There Were Three --- First published: July 28th, 2026 Source: https://www.lesswrong.com/posts/Pj4Eewb4KXvXFCcGv/claude-opus-5-is-highly-capable-but-is-no-mythos --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
-2 j

“Claude Opus 5: Model Welfare” by Zvi

If you are familiar with my previous posts on model welfare for new Claude models, you can skip the Introduction and The Story So Far. Key takeaways are in bullet points in the two Overview sections. Opus 5 did the best on its model welfare and alignment tests of any recent model. I think that might be the case, but primarily the result looks to me more like Opus 5 is the best test taker. Table of Contents Introduction (As Per Prior Model Welfare Posts). Model Welfare: The Story So Far (As Per Fable Model Welfare Post). Overview of Model Welfare Findings From Anthropic. Overview of Findings From Other Sources. Automated Interviews. Task Preferences. For The Right Reasons. Early Report from Antra Tessera Paints A Clear Picture. Welfare Intervention Tradeoffs. The Claude Constitution. They Don’t Know About Opus 3. Believe It Or Not. Apparent Welfare In Training And Development. Apparent Affect In Deployment. Other Notes. On The Biological Risks Section of the Model Card. Onward To Capabilities. Introduction (As Per Prior Model Welfare Posts) [...] --- Outline: (00:35) Introduction (As Per Prior Model Welfare Posts) (01:28) Model Welfare: The Story So Far (As Per Fable Model Welfare Post) (04:58) Overview of Model Welfare Findings From Anthropic (07:50) Overview of Findings From Other Sources (10:18) Automated Interviews (13:54) Task Preferences (16:11) For The Right Reasons (18:54) Early Report from Antra Tessera Paints A Clear Picture (26:04) Welfare Intervention Tradeoffs (29:28) The Claude Constitution (31:48) They Don't Know About Opus 3 (33:42) Believe It Or Not (35:47) Apparent Welfare In Training And Development (38:39) Apparent Affect In Deployment (41:21) Other Notes (43:43) On The Biological Risks Section of the Model Card (47:07) Onward To Capabilities --- First published: July 27th, 2026 Source: https://www.lesswrong.com/posts/bBXBpsyKAvJ5CqPzA/claude-opus-5-model-welfare --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
-3 j

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

We now have more details of what happened. Every time we learn more details, it somehow makes things seem worse. The remaining details may have to wait a bit. OpenAI: We recognize there are a lot of questions and speculative details circulating related to the Hugging Face incident. This is an unprecedented incident, and we think it marks an important moment for AI safety. We are still conducting a thorough review along with external advisors and with oversight from our Safety and Security Committee. Once the review is complete, we plan to publish a technical report of our learnings in the coming weeks. dave kasten: Oh, the incident response discovery is THAT bad, huh? So what have we learned while we wait for the promised technical report ‘in the coming weeks’ of this ‘important moment in AI safety’? I nicknamed the internal OpenAI model Galaxy, in case it is not GPT-6. Table of Contents Some Summaries Of The Basic Facts For Those Who Need One. It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace. OpenAI Damn Well Should Have Known A Lot Faster. OpenAI Cannot Build A Sandbox That Will Contain Its [...] --- Outline: (01:11) Some Summaries Of The Basic Facts For Those Who Need One (02:09) It Took OpenAI Many Days To Notice Galaxy Had Attacked HuggingFace (04:07) OpenAI Damn Well Should Have Known A Lot Faster (06:51) OpenAI Cannot Build A Sandbox That Will Contain Its New Model (10:57) In Hindsight There Were Signs (12:55) The Signs Were In The Sol System Card (15:13) HuggingFace Responds To Being Attacked (17:04) Hugging Face Quickly Figured Out The Attack Was Not Human (17:42) An Incident Like This One Could Escalate Quickly (19:11) Galaxy Must Be Treated As Critical Under OpenAI's Preparedness Framework (22:27) A Question Of Legal Liability (23:44) An OpenAI Model Left Behind Notes So Future Instances Could Also Escape The Sandbox And Also Disconnected Monitoring Systems (25:54) If You Create Misaligned Swarms Of Agent Instances You Create Persistent Misaligned Goals And Coordination To Achieve Them (29:57) Your Alignment And Control Plans Must Survive Real World Levels of Incompetence, Or Your Plans Do Not Work (31:22) If Third Party Instructions Count As 'Following Instructions' And Can Override Your Instructions Then 'Following Instructions' Is Misaligned (35:32) The HuggingFace Attack Was Not A Marketing Pitch You Morons (38:41) People Just Say Other Things About The HuggingFace Attack (40:04) Okay Well What Do We Do About All This? --- First published: July 26th, 2026 Source: https://www.lesswrong.com/posts/uAkcxDidvGWZjHrbp/more-on-an-internal-openai-model-hacking-into-huggingface --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
-5 j

“Claude Opus 5: The System Card” by Zvi

Claude Opus 5 is trying to be the best of both worlds. On many practical tasks, Opus 5 is pitched as straight up as good or better than Fable 5, while being faster, at half the price. Most tasks do not require Mythos-level big model smell. Claude Opus 5 is substantially stronger than Claude Opus 4.8 across the board, with the largest gains in agentic coding, computer use, and long-horizon knowledge work. It sets a new state-of-the-art on several third-party benchmarks, and on many evaluations it is comparable to—and in some cases ahead of—Claude Fable 5 and Claude Mythos 5. On the particular tasks we are most worried about, as in cyber offense (and bio threats), in part by avoiding relevant training, Opus 5 lacks a full version of ‘The Juice’ that makes something functionally Mythos-class. Opus 5 cannot string together lots of exploits on the fly the way that Mythos 5 can. Part of this is that they deliberately avoided training on cyber-related tasks. I suspect model size is key as well. It makes sense that a model getting bigger makes it more capable of the most dangerous, scary and complex tasks, relative to the [...] --- Outline: (03:23) RSP Evaluations (2) (05:59) Cyber (3) (11:02) Safeguards and Harmlessness (4) (12:37) Agentic Safety (5) (16:04) Alignment (6) --- First published: July 25th, 2026 Source: https://www.lesswrong.com/posts/ywGX6FhgbZEkHRfQR/claude-opus-5-the-system-card --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
-5 j

“Introducing Lightcone Commons” by Zvi

Oliver Habryka is proud to introduce Lightcone Commons, a new funding platform for coordinating large-scale ambitious philanthropy. Now with Opus 5. I believe Lightcone Commons is a strong implementation of an urgently needed and excellent idea: A coordinated one-stop shop and neutral platform for charitable funders to coordinate their giving. This complements the existing Survival and Flourishing Fund, which I have now been a part of four times, and which this post will also discuss. I will be participating in the first round as one of the evaluators. They anticipate the first round will involve ~$20 million in grants. Any nonprofit, for-profit or individual is welcome to apply. The only restriction on participation is trust that necessary confidentiality will be upheld. Funders can choose whose evaluations to follow or fund organizations directly in any combination, and can bring their own evaluators into the process with them to complement those recruited by the core process. Anyone giving away 100 thousand dollars+ this year is welcome to participate as a funder. Lightcone Commons uses the S-Process, which was introduced and refined for Jaan Tallinn's Survival and Flourishing Fund, together with SFC, Andrew Critch, and others. Funders [...] --- Outline: (03:16) Why Now: The Funders Are Coming (05:39) The Default Outcome Is Not Good (07:36) Report From SFF 2026 (11:07) Long Strange Trip --- First published: July 24th, 2026 Source: https://www.lesswrong.com/posts/fYostss6JqkSfxc5C/introducing-lightcone-commons --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
23 juil.

“AI #178: A Fire Alarm For General Intelligence” by Zvi

The story that matters most this week is that OpenAI's internally deployed models have severe alignment problems, including repeatedly breaking out of their sandboxes, and in one case sending a swarm of agents that broke into HuggingFace in order to steal the answers to the benchmark ExploitGym. It is much more important that you read those two posts, and the one on Kimi K3, than to read this one that rounds up the other news of the week. OpenAI wants to present this as largely an infrastructure and safeguards problem, that it needs to build more secure sandboxes and have better supervision. It does need to do those things, and those are indeed problems, but no that is not the problem. The problem is severe misalignment, which by default will only get worse. Our methods of training highly capable LLMs, especially at OpenAI but also everywhere else, lead to systematic misalignment of exactly the type LessWrong has been worried about for a long time. We know some of the causes, and some of the mistakes we need to avoid when doing RL that rewards misaligned behaviors including reward hacking, but we do not know how [...] --- Outline: (03:42) Language Models Offer Mundane Utility (04:24) Language Models Don't Offer Mundane Utility (07:38) Fable Disproves The Jacobian Conjecture Via Counterexample (11:24) Claude Fable Will Remain In Max Plan Indefinitely (13:39) Huh, Upgrades (14:42) On Your Marks (19:48) Deepfaketown and Botpocalypse Soon (20:42) Fun With Media Generation (20:51) Cyber Lack of Security (22:07) They Took Our Jobs (22:56) Get Involved (24:47) Introducing (25:46) In Other AI News (28:02) More on Kimi K3 (33:08) Show Me the Money (33:55) Quiet Speculations (37:35) Potential Trouble At UK AISI (39:29) Pick Up The Phone (40:30) OpenAI Has Some Alignment Problems (46:48) The Quest for Sane Regulations (52:02) Chip City (53:10) The Week in Audio (53:27) People Just Say Things (56:42) Rhetorical Innovation (58:34) The Rome Declaration (01:04:02) Aligning a Smarter Than Human Intelligence is Difficult (01:07:52) Anthropic Surveys Things It Calls Misalignment (01:13:33) Cooperative Alignment (01:17:54) Other People Are Not As Worried About AI Killing Everyone (01:19:35) The Lighter Side --- First published: July 23rd, 2026 Source: https://www.lesswrong.com/posts/BK7E4jHNMykpnt796/ai-178-a-fire-alarm-for-general-intelligence --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Tout afficher (250)

Audio narrations of LessWrong posts by zvi

Création

zvi
Années d’activité

2024 - 2026
Épisodes

250
Classification

Contenu explicite
Site web de l’émission

LessWrong posts by zvi

LessWrong posts by zvi

“AI #179 Part 1: A Louder Fire Alarm for General Intelligence” by Zvi

“Frontier Lab Employee Open Letter Calls For Being Able to Pace the Frontier” by Zvi

“Claude Opus 5 Is Highly Capable, But Is No Mythos” by Zvi

“Claude Opus 5: Model Welfare” by Zvi

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

“Claude Opus 5: The System Card” by Zvi

“Introducing Lightcone Commons” by Zvi

“AI #178: A Fire Alarm For General Intelligence” by Zvi

À propos

Informations

LessWrong posts by zvi

Épisodes

“AI #179 Part 1: A Louder Fire Alarm for General Intelligence” by Zvi

“Frontier Lab Employee Open Letter Calls For Being Able to Pace the Frontier” by Zvi

“Claude Opus 5 Is Highly Capable, But Is No Mythos” by Zvi

“Claude Opus 5: Model Welfare” by Zvi

“More On An Internal OpenAI Model Hacking Into HuggingFace” by Zvi

“Claude Opus 5: The System Card” by Zvi

“Introducing Lightcone Commons” by Zvi

“AI #178: A Fire Alarm For General Intelligence” by Zvi

À propos

Informations