Interconnects

Nathan Lambert

0.0 (0)
TECHNOLOGY
UPDATED WEEKLY

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

14 HR AGO

Gemma 4 and what makes an open model succeed

Having written a lot of model release blog posts, there’s something much harder about reviewing open models when they drop relative to closed models, especially in 2026. In recent years, there were so few open models, so when Llama 3 was released most people were still doing research on Llama 2 and super happy to get an update. When Qwen 3 was released, the Llama 4 fiasco had just gone down, and a whole research community was emerging to study RL on Qwen 2.5 — it was a no brainer to upgrade. Today, when an open model releases, it’s competing with Qwen 3.5, Kimi K2.5, GLM 5, MiniMax M2.5, GPT-OSS, Arcee Large, Nemotron 3, Olmo 3, and others. The space is populated, but still feels full of hidden opportunity. The potential of open models feels like a dark matter, a potential we know is huge, but few clear recipes and examples for how to unlock it are out there. Agentic AI, OpenClaw, and everything brewing in that space is going to spur mass experimentation in open models to complement the likes of Claude and Codex, not replace them. Especially with open models, the benchmarks at release are an extremely incomplete story. In some ways this is exciting, as new open models have a much higher variance and ability to surprise, but it also points at some structural reasons that make building businesses and great AI experiences around open models harder than the closed alternatives. When a new Claude Opus or GPT drops, spending a few hours with them in my agentic workflows is genuinely a good vibe test. For open models, putting them through this test is a category error. Something else to be said about open models in the era of agents is that they get out of the debate of integration, harnesses, and tools and let us see close to the ground on what exactly is the ability of just a model. Of course, we can’t test some things like search abilities without some tool, but being able to measure exactly the pace of progress of the model alone is a welcome simplification to a systematically opaque AI space. The list of factors I’d use to assess a new open-weight model I’m considering investing in includes: * Model performance (and size) — how this model performs on benchmarks I care about and how it compares to other models of a similar size. * Country of origin — some businesses care deeply about provenance, and if a model was built in China or not. * Model license — if a model needs legal approval for use, uptake will be slower at mid-sized and large companies. * Tooling at release — many models release with half-broken, or at least substantially slower, implementations in popular software like vLLM, Transformers, SGLANG, etc due to pushing the envelope of architectures or tools. * Model fine-tunability — how easy or hard it is to modify the given model to your use-case when you actually try and use it. The core problem is that some of these are immediately available at release, e.g. general performance, license, origin, etc. but others such as tooling take day(s) to week(s) to stabilize, and others are open research questions — with no group systematically monitoring fine-tunability. In the early era of open models, the days of Llama 2 or 3 and Qwen pre v3.5, the architectures were fairly simple and the models tended to work out of the box. Some of this was due to the extremely hard work of the Llama, Qwen, Mistral, etc. developer teams. Some is due to the new models being genuinely harder to work with. When it comes to something like Qwen 3.5 or Nemotron 3, with hybrid models (either gated delta net or mamba layers), the tooling is very rough at release. Things you would expect to “just work” often don’t. I’ve been following this area closely since we released Olmo Hybrid with a similar architecture, and Qwen 3.5 is just starting to work well in the various open-source tools that need to all play nice together for RL research. That’s 1.5 months after the release date! This is just to start really investing more into understanding the behavior of the models. Of course, others started working on these models sooner by investing more engineering resources or relying on partially closed software. The fully open and distributed ecosystem takes a long time to get going on some new models. All of this is lead-in for the most important question for open models — how easy is it to adapt to specific use-cases? This is a different problem for different model sizes. Large MoE open-weight models may be used by entities like Cursor who need complex capabilities in their domain, e.g. Composer 2 trained on Kimi K2.5. Other applications can be built on much smaller models, such as Chroma’s Context-1 model for agentic search, built on GPT-OSS 20B. The question of “which models are fine-tunable” is largely background knowledge known by engineers across the industry. There should be a thriving research area here to support the open ecosystem model. The first step is to understand characteristics of different base and post-trained models to understand what they look like. The second step is to tune pretraining recipes for open models so they’re more flexible. Interconnects AI is a reader-supported publication. Consider becoming a subscriber. For The ATOM Project and other Interconnects endeavors, we’ve put in substantial effort to measuring adoption trends in the open ecosystem. Everything takes a long time to unfold after a model is first publicly available — and adaptability is why. What we know for sure now, when Qwen has been going from strength to strength with its releases, is that technical staff across the industry has gotten comfortable working with Qwen models. Countless research methods and datasets were made to work with Qwen. It’ll take patience for any other model family to get to this point — a patience I’m not sure many open model builders have. This takes us to Gemma 4, Google’s latest open models. Gemma 3 was released more than a year ago, in March of 2025, and is a bit underrated. Gemma 4 comes in 4 sizes for now, with a bigger, MoE model of over 100B total parameters rumored but not released yet. The models we have today come in sizes of ~5B dense, 8B dense, 26B total 4B active MoE, and 31B dense. I’m most excited that they’re finally adopting a standard Apache 2.0 open source license. This’ll massively boost adoption. The standard of better licenses for strong open-weight LLMs was set by mostly Chinese open model labs in the last 1-2 years, and now U.S. companies are following suit. I will personally be so happy if the horrible Llama licenses and Gemma terms of service were an ~18-month transient dynamic of the industry being nervous about releasing strong open models. The Gemma 4 scores look very solid, the small models have incredible benchmark scores (especially in general domains like LMArena) and the 31B model rivals the recent Qwen 3.5 27B, which is the leading member of that class. The ~30B size range is an important one, as it’s accessible both to researchers and to enterprises looking to deploy the model in real use-cases. Where the 7B model scale is the default for tinkering and research, a 30B model is the default for seeing if an open model can unlock substantial value in your specific workflow — a good mix of intelligence, low price, tractability for downstream training, etc. This takes us back to the above adoption criteria I mentioned for open models and the bigger question — do I think Gemma 4 will be an overwhelming success? Previous Gemma models have been plagued by tooling issues and poorer performance when being finetuned. Gemma 4’s success is going to be entirely determined by ease of use, to a point where a 5-10% swing on benchmarks wouldn’t matter at all. It’s strong enough, small enough, with the right license, and from the U.S., so many companies are going to slot it in. I’m cautiously optimistic that Gemma 4 is going to work better here. Winds are shifting for open models built in America. We saw GPT-OSS go through a bumpy launch to become an overwhelming success. There’s a collective energy around the likes of Reflection, Arcee, Nemotron, Gemma, Olmo, and peers that show substantial demand for building new stacks around open models. There’s capital to be spent on AI stacks across the economy by those who want more ownership of everything, including the model. After launching The ATOM Project 240 days ago, the conversation is shifting into the next stage. Summer of 2025 was a crisis moment where the U.S. AI scene realized it can’t wait and figure out open models after building AGI. The two markets will capture different areas and proceed in parallel. Now that more companies in the U.S. are releasing strong models, we need to improve the ecosystem so that these models are easy to use, understand, and build value around. It’s the hard work to build another inflection point in these adoption plots I’ve been updating consistently, but that’s the work to be done. Join me in it. More data coming soon! Here’s a sneak peek: This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

9 min
22 MAR

Lossy self-improvement

Fast takeoff, the singularity, and recursive self-improvement (RSI) are all top of mind in AI circles these days. There are elements of truth to them in what’s happening in the AI industry. Two, maybe three, labs are consolidating as an oligopoly with access to the best AI models (and the resources to build the next ones). The AI tools of today are abruptly transforming engineering and research jobs. AI research is becoming much easier in many ways. The technical problems that need to be solved to scale training large language models even further are formidable. Super-human coding assistants making these approachable is breaking a lot of former claims of what building these things entailed. Together this is setting us up for a year (or more) of rapid progress at the cutting edge of AI. We’re also at a time where language models are already extremely good. They’re in fact good enough for plenty of extremely valuable knowledge-work tasks. Language models taking another big step is hard to imagine — it’s unclear which tasks they’re going to master this year outside of code and CLI-based computer-use. There will be some new ones! These capabilities unlock new styles of working that’ll send more ripples through the economy. These dramatic changes almost make it seem like a foregone conclusion that language models can then just keep accelerating progress on their own. The popular language for this is a recursive self-improvement loop. Early writing on the topic dates back to the 2000s, such as the blog post entirely on the topic from 2008: Recursion is the sort of thing that happens when you hand the AI the object-level problem of “redesign your own cognitive algorithms”. And slightly earlier, in 2007, Yudkowsky also defined the related idea of a Seed AI in Levels of Organization in General Intelligence: A seed AI is an AI designed for self-understanding, self-modification, and recursive self-improvement. This has implications both for the functional architectures needed to achieve primitive intelligence, and for the later development of the AI if and when its holonic self-understanding begins to improve. Seed AI is not a workaround that avoids the challenge of general intelligence by bootstrapping from an unintelligent core; seed AI only begins to yield benefits once there is some degree of available intelligence to be utilized. The later consequences of seed AI (such as true recursive self-improvement) only show up after the AI has achieved significant holonic understanding and general intelligence. It’s reasonable to think we’re at the start here, with how general and useful today’s models are. Generally, RSI can be summarized as when AI can improve itself, the improved version can improve even more efficiently, creating a closed amplification loop that leads to an intelligence explosion, often referred to as the singularity. There are a few assumptions in this. For RSI to occur, it needs to be that: * The loop is closed. Models can keep improving on themselves and beget more models. * The loop is self-amplifying. The next models will yield even bigger improvements than the current ones. * The loop continues to run without losing efficiency. There are not added pieces of friction that make the exponential knee-capped as an early sigmoid. While I agree that momentous, socially destabilizing changes are coming in the next few years from sustained AI improvements, I expect the trend line of progress to be more linear than exponential when we reflect back. Instead of recursive self-improvement, it will be lossy self-improvement (LSI) – the models become core to the development loop but friction breaks down all the core assumptions of RSI. The more compute and agents you throw at a problem, the more loss and repetition shows up. Interconnects AI is a reader-supported publication. Consider becoming a subscriber. I’m still a believer that the complexity brake on advanced systems will be a strong counterbalance to the reality that AI models are getting substantially better at every narrow task we need to compose together in making a leading AI model. I quoted this previously in April of 2025 in response to AI 2027. Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread “general systems collapse”. There are plenty of examples in how models are already trained, the deep intuitions we need to get them right, and the organizations that build them that show where the losses will come from. Building leading language models is incredibly complex, and only becoming more-so. There are a few core frictions in my mind. 1. Automatable research is too narrow First, it is clear that language models this year will already be useful tools at optimizing localized tasks like lowering the test loss of a model. Andrey Karpathy recently launched his autoresearch that popularized doing just this. This allows AI agents to play directly on GPUs to target tasks like lowering the loss on the test set. This approach works in narrow domains, i.e. one general test loss or one overall reward. The problem is that there’s a long-standing gap between an on-paper more accurate model and models that users find more productive. The most provocative case is for pretraining, which was discussed more at length around scaling laws. Scaling laws show us that the loss will continue going down, but we don’t know if that’ll be economically more valuable. In post-training, reinforcement learning algorithms are at least more directly tied to specific performance gains as most RL training environments can be used directly as an evaluation. Still, I worry about generalization and tying back to models that are better at the specific task of improving themselves. It’s a big leap from models get better at some things to that necessarily translating to models that are better at building themselves and designing experiments. We’ve seen many AI capabilities sort of saturate at certain levels of human taste, such as writing quality. AI research is a bit different here, as there is a very high ceiling to climb up to. Where models mostly saturate on writing because there’s inherent tension in preferences, models will saturate on research because the search space and optimization target is too wide. The early benchmarks for measuring this sort of ability all fall prey to the same problem – narrow scope. Agents will do well at optimizing single metrics, but the leap required to navigate many metrics at once is a very different skill set. That is actually what the best researchers do — they make many scalable ideas work together. The most related benchmark we have to measure this is PostTrainBench, which is quite fun, but progress will very rapidly get distorted on this. Over 90% of the challenge in doing post-training well is getting the last 1-3% of performance, especially without cooking the model in out-of-domain tasks. Post-training a general, leading model is extremely complex, and only getting more complex. I could go on and on about this. Another example is from during my Ph.D. (2017-2022), when there was immense hype around a field called “AutoML” which aimed to use techniques like Bayesian Optimization to find new architectures and parameters for models. The hype never translated into changing my job. Language models will do more than this, but not enough to take jobs away from top AI researchers any time soon. The core currency of researchers is still intuition and managing complexity, rather than specific optimization and implementation. 2. Diminishing returns of more AI agents in parallel The biggest problem for rapid improvement in AI is that even though we’ll have 10,000 remote workers in a datacenter, it’ll be nearly impossible to channel all of them at one problem. Inherently, especially when the models are still so similar, they’re sampling from the same distribution of solutions and capabilities while being bottlenecked by human supervision. Adding more agents will have a strict saturation in the amount of marginal performance that can be added – the intuition of the best few researchers (and time to run experiments) will be the final bottleneck. A common idea to illustrate this is Amdahl’s law, which is taken from computer architecture and shows that a given task can only generate a fixed speedup proportional to how much can be parallelized and how many parallel workers exist. An illustration is below: In AI this should be relatively easier to convey, as the low-level operating details of computers are fairly mysterious. Consider an AI researcher on the transition from writing code by hand to using AI autocomplete assistance to now using autonomous coding agents. These are all massive gains. Let us continue. Now this researcher uses 3-4 agents working on different sub-tasks or approaches to the problem at hand. This is still a large gain. Now consider this single researcher trying to organize 30-40 agents with tasks to do every day. Some people can get more value out of this scale, but not many. How many people do you think could come up with 300-400 tasks for AI agents every day? Not many. This problem will hit the AI models soon enough as well. 3. Resource bottlenecks and politics Fundamentally, all the AI companies are walking a fine line of acquiring substantial capital, converting new compute resources to revenue via suffi

13 min
18 MAR

GPT 5.4 is a big step for Codex

I’m a little late to this model review, but that has given me more time to think about the axes that matter for agents. Traditional benchmarks reduce model performance to a single score of correctness – they always have because that was simple, easy to quickly use to gauge performance, and so on. This is also advice that I give to people trying to build great benchmarks – it needs to reduce to one number that is interpretable. This is likely still going to be true in a year or two, and benchmarks for agents will be better, but for the time being it doesn’t really map to what we feel because agentic tasks are all about a mix of correctness, ease of use, speed, and cost. Eventually benchmarks will individually address these. Where GPT 5.4 feels like another incremental model on some on-paper benchmarks, in practice it feels like a meaningful step in all four of those traits. GPT 5.4 in Codex, always on fast mode and high or extra-high effort, is the first OpenAI agent that feels like it can do a lot of random things you can throw at it. Interconnects AI is a reader-supported publication. Consider becoming a subscriber. I haven’t been particularly deep in software engineering over the last few months, so most of my working with agents has been smaller projects (not totally one-off, but small enough where I’ve built the entire thing and manage the design over weeks), data analysis, and research tasks. When you embrace being agent-native, this style of work entails a lot of regular APIs, background packages (like installing and managing LateX binaries, ffmpeg, multimedia conversion tools, etc), git operations, file management, search etc. Prior to GPT 5.4, I always churned off of OpenAI’s agents due to a death by a thousand cuts. It felt like rage quits. I’d feel like I was getting into GPT 5.2 Codex, but it would fail on a git operation and have me (or Claude) need to reset it. Those hard edges are no longer there. The other subtle change in GPT 5.4’s approachability – the biggest reason I think OpenAI is much more back in the agent wars – is that it just feels a bit more “right.” I classify this differently to the routine tasks I discussed above, and it has to do with how the product (i.e. the model harness) presents the model outputs, requests, and all that to you the user. It has to do with how easy it is to dive in. This has always been Claude’s biggest strength in its astronomical growth. Not only has Claude been immensely useful, but it has a charm and entertainment value to it that’ll make new people stick around. GPT 5.4 has a bit of that, but the underlying model strengths of Claude still leave it feeling warmer. Where Claude is a super smart model, with character, a turn of phrase in a debate, and sometimes forgetting something, OpenAI’s models in Codex feel meticulous, slightly cold, but deeply mechanical. I’d use Claude for things I need more of an opinion on and GPT 5.4 to churn through an overwhelmingly specific TODO list. The instruction following of GPT 5.4 is so precise that I need to learn to interact with the models differently after spending so much time with Claude. Claude, in some domains, you come to see has an excellent model for your intent. GPT 5.4 just does what you say to do. These are very different philosophies of “what will make the best model for an agent”, Claude will likely appeal to the newcomers, but GPT 5.4 will likely appeal to the master agent coordinator that wants to unleash their AI army on distributed tasks. Outside of charm, and dare I say taste, a lot of the usability factors are actually better on OpenAI’s half of the world. The Codex app is compelling – I don’t always use it, but sometimes I totally love it. I suspect substantial innovation is coming in what these apps look like. Personally, I expect them to eventually look like Slack (when multiple agents need to talk to eachother, under my watch). OpenAI also natively offers fast mode for their models with a subscription and very large rate limits. I’ve been on the $100/month Claude plan and $200/month ChatGPT plan for quite some time. I’ve never been remotely close to my Codex limits with fast mode and xhigh reasoning effort, where I hit my Claude limits from time to time. There’s definitely a modeling reason to this – most of OpenAI’s release blogs showcase each iterative model being substantially more concise in the number of tokens it takes to get peak benchmark performance. This is a measure of reasoning efficiency. This 2D (or more) benchmark picture is exactly where the world is going. Here’s a plot from Cursor, which sadly doesn’t have all the GPT 5.4 reasoning efforts, but it confirms this point in a third party evaluation. What is missing across model families is the speed and price (a proxy for total compute used) to get there. The final benefit of GPT 5.4, and OpenAI’s agentic models in general for that matter, is much better context management. In using them regularly now I feel like I’ve never hit the context wall or context anxiety point. The reasoning efficiency I suspect is the case above just lets the model do way more with its initially empty context window. Then, when GPT 5.4 does compact, it’s been less noticeable. The one problem I’ve been having with both Claude Opus 4.6 and GPT 5.4 is a light forgetfulness. If you give the models multiple TODOs in a single message outside of planning mode, I find them often dropping them. Sometimes it feels like the models glitch and try to solve a previous problem rather than the recent ones. I’m not sure what in the model or the harness is the exact cause, but sometimes I like to queue up a few messages as I see the model working on something, to refine the task, but currently this tends to be a pretty risky outcome except in the simplest use-cases. These days I’ve been using both GPT and Claude extensively, mostly based on my mood, and have been getting more done than ever. Having a GPT 5.4 Pro integration directly with Codex, e.g. like \ultrathink, would be a big differentiator for OpenAI. Those models have been incredible. All in, I see GPT 5.4 as an agentic model that brings a ton more simple usability and “agentness” to the very strong software foundation of GPT 5.3 Codex. It’s a big step, and I’m unbelievably excited for which of these two companies releases an update next. On paper, listing the strengths of GPT 5.4 across better top end coding performance, better speed, better context management, better rate limits, it’s a testament to how nuanced choosing a model is. I genuinely still enjoy Claude a bit more for ways that’ll never show up on benchmarks. This makes me type claude into my terminal at the start of my day, rather than codex. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

7 min
16 MAR

What comes next with open models

2025 was the year where a lot of companies started to take open models seriously as a path to influence in the extremely valuable AI ecosystem — the adoption of a strategy that was massively accelerated downstream of DeepSeek R1’s breakout success. Most of this is being done as a mission of hope, principle, or generosity. Very few businesses have a real monetary reason to build open models. Well-cited reasons, such as commoditizing one’s complements for Meta’s Llama, are hard to follow up on when the cost of participating well is billions of dollars. Still, AI is in such an early phase of technological development, mostly defined by large-scale industrialization and massive scale-out of infrastructure, that having any sort of influence at the cutting edge of AI is seen as a path to immense potential value. Open models are a very fast way to achieve this, you can obtain substantial usage and mindshare with no enterprise agreements or marketing campaigns — just releasing one good model. Many companies in AI have raised a ton of money built on less. The hype of open models is simultaneously amplified by the mix of cope, disruptive anticipation, and science fiction that hopes for the world where open models do truly surpass the closed labs. This goal could be an economically catastrophic success for the AI ecosystem, where profits and revenue plummet but the broader balance of power and control of AI models is long-term more stable. There’s a small chance open models win in absolute performance, but it would only be on the back of either a true scientific breakthrough that is somehow kept hidden from the leading labs or the models truly hitting a wall in performance. Both of them are definitely possible, but very unlikely. It is important to remind yourself that there have been no walls in progress to date and all the top AI researchers we discuss this with constantly explain the low-hanging fruit they see on progress. It may not be recursive self-improvement to the singularity (more on that in a separate post), but large technology companies are on a direct path to building definitionally transformative tools. They are coming. The balance of power in open vs. closed models The fair assessment of the open-closed gap is that open models have always been 6-18 months behind the best closed models. It is a remarkable testament to the open labs, operating on far smaller budgets, that this has stayed so stable. Many top analysts like myself are bewildered by the way the gap isn’t bigger. Distillation helps a bit in quality, benchmaxing more than closed labs helps perceptions, but the progress of the leading open models is flat out remarkable. The reality is that the open-closed model gap is more likely to grow than shrink. The top few labs are improving as fast as ever, releasing many great new models, with more on the docket. Many of the most impressive frontier model improvements relative to their open counterparts feel totally unmeasured on public benchmarks. In a new era of coding agents, the popular method to “copy” performance from closed models, distillation, requires more creativity to extract performance — previously, you could use the entire completion from the model to train your student, but now the most important part is the complex RL environments and the prompts to place your agents in them. These are much easier to hide and all the while the Chinese labs leading in open models are always complaining about computational restrictions. As the leading AI models move into longer-horizon and more specialized tasks, mediated by complex and expensive gate-keepers in the U.S. economy (e.g. legal or healthcare systems), I expect large gaps in performance to appear. Coding can largely be mostly “solved” with careful data processes, scraping GitHub, and clever environments. The economies of scale and foci of training are moving into domains that are not on the public web, so they are far harder to replicate than early language models. Developing frontier AI models today is more defined by stacking medium to small wins, unlocked by infrastructure, across time. This rewards organizations that can expand scope while maintaining quality, which is extremely expensive. All of these dynamics together create a business landscape for open models that is hard to parse. Through 2026, closed models are going to take leaps and bounds in performance in directions that it is unlikely for open models to follow. This sets us up for a world where we need to consider, fund, use, and discuss open models differently. This piece lays out how open models are changing. It is a future that’ll be clearly defined by three classes of models. * True (closed) frontier models. These will drive the strongest knowledge work and coding agents. They will be truly remarkable tools that force us to reconsider our relationship to work. * Open frontier models. These will be the best open-weight, large models that are attempting to compete on the same directions as above. There will be plenty of use-cases that they don’t work for relative to the best models, but countless use-cases where they work remarkably well. For many use-cases, even ones as valuable as some subsets of coding, these will work great. The AI ecosystem will still take years to understand what it means to have intelligence of this magnitude served in private, at the marginal cost of electricity for individuals, as assistants, coaches, companions, and more. OpenClaw provided a glimpse behind the mirror that will expand and grow. The class of models around GPT-OSS 120B, Nvidia Nemotron 3 Super, or MiniMax M2.5 are the balance of performance to price that can work as local models. * Open, small models as distributed intelligence. The most successful open models will be complementary tools to closed agents. This is a path for open models to complement and accelerate the frontier of progress.AI is slotting in to automate many repetitive, niche tasks across the technology economy. There’s a huge pressure to shift these tasks off of the best closed models — which frankly are still better at most of the things, across my conversations with businesses trying to build with open models — to small, open models that can be 10X faster and 100X cheaper. There aren’t really people building data and fine-tuning engines for economically viable tasks on the smallest models possible. These models need to be almost brain-numbingly boring and specific. In a world dominated by coding agents, I want to build open models that Claude Code is desperate to use as a tool, letting its sub agents unlock entirely new areas of work. This is possible, but remarkably under-explored. Small models from the likes of Qwen and co. are still marketed on general-task benchmarks. The hype of “open models catching the frontier” distracts the world from this very large area of demand.This is the sort of model that moves open models from just a few, crucial static weights to more of an ecosystem. It requires creativity and a new approach. The goal of this piece is to illustrate why and how to build these, with added context on where open models stand today. All three of these model classes hint at different ways to use agents. It is absolutely definitional to how AI is going to be built going forward that they’re not just model weights, but rather systems that think, search, and act. The weights only define one portion of those abilities. Interconnects AI is a reader-supported publication. Consider becoming a subscriber. Open weights as part of an AI system To start, consider what are the most impactful and impressive things that language models can do without a suite of tools at their side. When was the last time that you were blown away by something that was just autoregressive token outputs? Unless you’re doing a substantial amount of work on mathematical proofs or competition code, it seems like that situation has changed little since GPT-4’s release in 2023. The AI systems we use today are about far, far more than weights. In this world, closed models have a clear advantage. Closed models get to vertically integrate everything from the chips they run on, the inference software, the weights, the tools, and the user interface. Open models on the other hand need to work on every inference setup, with many tools, and in many use-cases. This vertical integration is best expressed today in the joy of using Claude Code with Opus 4.6 or OpenAI’s Codex with GPT 5.4. Open models haven’t passed this point. Some are starting to focus on specific interfaces, e.g. OpenCode, but there’s an inherent tension in making an open model work only in your blessed product roadmap. At the same time, this change could point to more about the latest AI systems being open! If you can do less with the weights alone, maybe more labs will release them. The way to think about AI systems today is as a mix of weights, tools, and harnesses. The weights portion is familiar. The tools are the deeply integrated environments the models act in at deployment time — best typified by search and code sandboxes — and the harness is how these two fit together with a product that the user sees. In this world, there are two things to consider: 1) Is there an equivalent, open system to the closed products that people are using today — I mean truly equivalent, where every level of the stack can be modified and controlled (more on this later), and 2) How does this system’s view impact different future decisions in the open ecosystem? Still looking for open model business strategies To understand how the business and practicality of open models will evolve, let me take a tour back in time to foundational writing on the role of open-source in modern technology companies. The first is a Google blog post, The Meaning of Open, which originally was an internal memo by Jonathan Rosenberg, which sparked an intense

18 min
6 MAR

Dean Ball on open models and government control

Watching history unfold between Anthropic and the Department of War (DoW) it has been obvious to me that this could be a major turning point in perspectives on open models, but one that’ll take years to be obvious. As AI becomes more powerful, existing power structures will grapple with their roles relative to existing companies. Some in open models frame this as “not your weights, not your brain,” but it points to a much bigger problem when governments realize this. If AI is the most powerful technology, why would any global entity let a single U.S. company (or government) control their relationship to it? I got Dean W. Ball of the great Hyperdimensional newsletter onto the SAIL Media weekly Substack live to discuss this. In the end, we agree that the recent actions by the DoW — especially the designation of Anthropic as a supply chain risk (which Dean and I both vehemently disagree with) — points to open models being the 5-10 year stable equilibrium for power centers. The point of this discussion is: * Why do open models avoid some of the power struggles we’ve seen play out last week? * How do we bridge short term headwinds for open models towards long-term strength? * The general balance of capabilities between open and closed models. Personally, I feel the need to build open models more than ever and am happy to see more constituencies wake up to it. What I don’t know is how to fund and organize that. Commoditizing one’s compliments is a valid strategy, but it starts to break down when AI models cost closer to a trillion dollars than a hundred million. With open models being very hard to monetize, there’s a bumpy road ahead for figuring out who builds these models in face of real business growth elsewhere in the AI stack. Enjoy and please share any feedback you have on this tricky topic! Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here. Chapters * 00:00 Intro: is the Anthropic supply chain risk good or bad for open models? * 04:03 Funding open models and the widening frontier gap * 12:33 Sovereign AI and global demand for alternatives * 20:55 Open model ecosystem: Qwen, usability, and short-term outlook * 28:20 Government power, nationalization risk, and financializing compute Transcript 00:00:00 Nathan Lambert: Okay. We are live and people will start joining. I’m very happy to catch up with Dean. I think as we were setting this up, the news has been breaking that the official supply chain risk designation was filed. This is not a live reaction to that. If we get any really, really interesting news, we’ll talk about it. I think one of the undercurrents that I’ve felt that this week where everything happened is gonna touch on is open models, but there’s not an obvious angle. I think I will frame this to Dean to start, which is how does-- Like, there’s two sides of open models. One is that there’s the kind of cliche like, not my weights, not your weights, not your mind, where like somebody could take it away if not an open model, which people are boosting like, “Oh, like Anthropic’s gonna take away their intelligence.” But the other side is people worried about open models existing that the Department of War can just take and use for any purpose that it wants. And I feel like both of these are a little cliche. And the core question is like, is this type of event where more control is coming towards AI and more multi-party interest, like is that gonna be good or bad for the open weight model ecosystem? 00:01:12 Dean Ball: My guess is that in the long run, this is probably profoundly good for open weight AI. And like the whole reason I got in, like, so I became interested in frontier AI governance. I did something totally different with my time before. I wrote about different kinds of policy and studied different kinds of policy. And the reason I got into this was because it immediately occurred to me that the government was gonna... I was like, okay, let’s assume we’re building super intelligence soon or whatever, like very advanced AI that seems like really important and powerful. That’s gonna be something that I depend on, like for my day-to-day life. I’m gonna need it for all kinds of things. It’s gonna profoundly implicate my freedom of expression as an American and my exercise of my liberty and all that. And yet it’s also gonna profoundly implicate national security. And so the government’s gonna have its hands all over it, and they also might not like me using it because I might use it, and others might use it to challenge the status quo in various ways, to challenge the existing power structures which the government is a part of. So we have a political problem on our hands here, in my view. 00:02:36 Dean Ball: It immediately occurred to me that we’re gonna have this huge problem of like, this is gonna be a conflict because this is something that’s gonna enormously implicate American speech and liberty, and also it’s gonna have legitimate national security issues, and also the government’s gonna want it because of bad power-seeking reasons. And so that’s always a part of the picture. And my view was this is just a fight that’s gonna play out over the coming decades, and I wanna be a part of this fight. But number two, in that fight, you have to have an insurance policy, and open weight is the insurance policy. Open weight is the way we can always say yes, but we can build the open ecosystem. We can do that. And so I think in the fullness of time, this is gonna be beneficial, but the problem is there’s a lot of coordination and economic problems that have to be solved here. It’s not just a matter of hoping that Google and Meta or whomever else, or the Chinese companies, by virtue, out of the goodness of their hearts continue to open-source things. That’s not scalable. There has to be a reason to do it. So what are the institutional dynamics open weight gonna look like in the long term? I don’t really know, but it feels deeply under theorized. 00:04:03 Nathan Lambert: I think it’s hard to fund is the thing. I mean, we saw Qwen had their turmoil this week, which is timely, and I’m not that surprised because the stakes for these companies is so high, and they all are trying to make sure their companies win in it. And people will say like, “Oh, Meta should commoditize their complements and release open models.” But no one’s ever commoditized their complements with something that costs a trillion dollars to make. Like, that’s a line item. Like, is Apple gonna commoditize... Apple commoditizing their complement would be them doing the... They could spend just as much as all the other tech companies are on CapEx and spend hundreds of billions of dollars, but they’re choosing not to. And I just like, I agree that long term it should be better, but if we never bridge that gap, does it actually materialize? Like, the crank is being turned of these models getting better and better. GPT 5.4 released today, excited to try it. 00:05:02 Nathan Lambert: But like, where does it go? Like, what I’m working on is totally falling behind the frontier. We’re the foundation of research, but it’s like I see it already slipping. 00:05:13 Dean Ball: So I kinda think, yeah, I mean, look, I think it’s gonna get bad in the short term, it’s gonna be bleak, right? There’s just no doubt about that in my view. Because we’re in this period, like I think the pace of frontier progress is gonna continue. My own view is that, like, just ‘cause I peer in and use the open weight Chinese models on a fairly regular basis, and I kinda just feel as though the gap has widened between the US frontier and the open frontier. Unfortunately, it’s so sad that US frontier and open frontier are increasingly distinct things. But I do feel as though that probably is true. And that’s probably gonna continue because in the next, like, in the early stages of a new technology, you would expect for the vertically integrated players to be the ones who do the best. And over time, the modular players can win, and part of that is ‘cause eventually you do get to good enough, right? Like, eventually, I think most people think the iPhone is good enough now. There was a time when every year the iPhone upgrade was like, “Oh my God, this is so much better.” Intelligence is maybe different, but maybe not for a lot of things. 00:06:37 Nathan Lambert: Well, like, there’s no iPhone that you can buy from anyone. Nothing you can buy from anyone but Apple is nearly as good. That’s the concern. It’s like, is it gonna be Anthropic that like, yeah, it stopped getting better, but you can’t rebuild it. Like, you can’t make the open source version. 00:06:51 Nathan Lambert: I also think I had a later question, which is like, the weights are so much less of a concern for me. So like, somebody dropping a two-trillion-parameter model that’s open weights and way better than anything else that somebody has built and released in the open, it almost doesn’t matter if you don’t understand the harness and the tools and the setup you need to make it into a Claude-like system. Like, you need what, eighty nodes of H100s that cost a hundred thousand dollars a day to run and expertise to make it a system. It’s like the shifting away from weights is also happening. I don’t think it’s happening in this open versus closed ecosystem at the surface level of the discussion. So that’s why I’m just like, I don’t know if it’s gonna exist. The thing that I could see happening is that open weights models are niche, and they help these Claude-like models, but there’s not an alternative in that universe. So it’s like, is the government capable of actually making this alternative exist? I don’t know. Like, I don’t know if you can Manhattan Project this, and I wouldn’t advocate for it. 00:07:53

36 min
5 MAR

Olmo Hybrid and future LLM architectures

So-called hybrid architectures are far from new in open-weight models these days. We now have the recent Qwen 3.5 (previewed by Qwen3-Next), Kimi Linear last fall (a smaller release than their flagship Kimi K2 models), Nvidia’s Nemotron 3 Nano (with the bigger models expecting to drop soon), IBM Granite 4, and other less notable models. This is one of those times when a research trend looks like it’s getting adopted everywhere at once (maybe the Muon optimizer too, soon?). To tell this story, we need to go back a few years to December 2023, when Mamba and Striped Hyena were taking the world by storm — asking the question: Do we need full attention in our models? These early models fizzled out, partially for the same reasons they’re hard today — tricky implementations, open-source tool problems, more headaches in training — but also because the models fell over a bit when scaled up. The hybrid models of the day weren’t quite good enough yet. These models are called hybrid because they mix these new recurrent neural network (RNN) modules with the traditional attention that made the transformer famous. They all work best with this mix of modules. The RNN layers keep part of the computation compressed in a hidden state to be used for the next token in the prediction — a summary of all information that came before — an idea that has an extremely long historical lineage in deep learning, e.g. back to the LSTM. This setup avoids the quadratic compute cost of attention (i.e. avoiding the incrementally expanding the KV cache per token of the attention operator), and can even assist in solving new problems. The models listed to start this article use a mix of RNN approaches, some models (Qwen and Kimi) use a newer idea called Gated DeltaNet (GDN) and some still use Mamba layers (Granite and Nemotron). The Olmo Hybrid model we’re releasing today also falls on the GDN side, based on careful experimentation, and theory that GDN is capable of learning features that attention or Mamba layers cannot. Introducing Olmo Hybrid and its pretraining efficiency Olmo Hybrid is a 7B base model, with 3 experiment post-trained checkpoints released — starting with an Instruct model, with a reasoning model coming soon. It is the best open artifact for studying hybrid models, as it is almost identical to our Olmo 3 7B model from last fall, just with a change in architecture. With the model, we are releasing a paper with substantial theory on why hybrid models can be better than standard transformers. This is a long paper that I’m still personally working through, but it’s excellent. You can read the paper here and poke around with the checkpoints here. This is an incredible, long-term research project led by Will Merrill. He did a great job. To understand the context of why hybrid models can be a strict upgrade on transformers, let me begin with a longer excerpt from the paper’s introduction, emphasis mine: Past theoretical work has shown that attention and recurrence have complementary strengths (Merrill et al., 2024; Grazzi et al., 2025), so mixing them is a natural way to construct an architecture with the benefits of both primitives. We further derive novel theoretical results showing that hybrid models are even more powerful than the sum of their parts: there are formal problems related to code evaluation that neither transformers nor GDN can express on their own, but which hybrid models can represent theoretically and learn empirically. But this greater expressivity does not immediately imply that hybrid models should be better LMs: thus, we run fully controlled scaling studies comparing hybrid models vs. transformers, showing rigorously that hybrid models’ expressivity translates to better token efficiency, in agreement with our observations from the Olmo Hybrid pretraining run. Finally, we provide a theoretical explanation for why increasing an architecture’s expressive power should improve language model scaling rooted in the multi-task nature of the language modeling objective. Taken together, our results suggest that hybrid models dominate transformers, both theoretically, in their balance of expressivity and parallelism, and empirically, in terms of benchmark performance and long-context abilities. We believe these findings position hybrid models for wider adoption and call on the research community to pursue further architecture research. Essentially, we show and argue a few things: * Hybrid models are more expressive. They can form their outputs to learn more types of functions. An intuition for why this would be good could follow: More expressive models are good with deep learning because we want to make the model class as flexible as possible and let the optimizer do the work rather than constraints on the learner. Sounds a lot like the Bitter Lesson. * Why does expressive power help with efficiency? This is where things are more nuanced. We argue that more expressive models will have better scaling laws, following the quantization model of neural scaling. All of this theory work is a great way to go deeper, and frankly I have a lot more to learn on it, but the crucial part is that we transition from theory to clear experiments that back it up. Particularly the scaling laws for designing this model were studied carefully to decide on the final hybrid architecture. The final performance is very sensitive to exactly which RNN block is used and in what quantity. In scaling experiments, the results showed that for Olmo, the hybrid GDN (3:1 ratio of layers) > pure GDN (all RNN layers) > standard transformer (all attention) > hybrid Mamba2 > pure Mamba2. The crucial point was that these gaps maintained when scaling to more parameters and compute. A visual summary of the different types of architectures studied is below. In terms of this specific model, the pretraining gains were giant! Relative to Olmo 3 dense, it represents an about 2X gain on training efficiency. When you look at evaluation performance for pretraining, there was also substantial improvement in performance, particularly after long context extension (the final 2 rows of Table 2 in the paper, highlighted below). The journey to post-training Olmo Hybrid Most of the experience in post-training Olmo models has been climbing up a steep curve in base model capabilities with minor tweaks to architecture. Our recipes from Tulu 2, Tulu 3, and the Olmo 3 reasoning work (building substantially on OpenThoughts 3) all worked in a fairly straightforward, off the shelf manner. Olmo Hybrid is our first experience in post-training a substantially different architecture, and the results were mixed. 1. Benchmark performance Following the Olmo 3 recipe, we got some substantial wins (knowledge) and some substantial losses (extended reasoning) relative to the dense model. All together these still represent a very strong fully open model — just that the pretraining gains didn’t translate as obviously. The results are below. The exact reason why this happens is a research question. Our best guess is that the Olmo Hybrid base model is just a sufficiently different student model, where most of our post training data at early stages is learning from stronger “teacher” models (a recap of this method, called distillation, appeared recently in Interconnects). There is a lot of other research ongoing in the community around what makes a strong teacher model — generally, the best overall model is not the best teacher. In other words, training on data outputted from the model with best evaluation scores today is unlikely to unlock the ceiling in performance for your new base model. A second factor, which is even less explored, is how different base models likely need different teachers to learn from. This is why Olmo Hybrid could perform very differently, where it’s behavior is downstream of an architecture-based learning change, where the pretraining data is almost identical. There’s A LOT more work to dig into here, some empirical work in generating better data and other work in understanding how different training stages fit together. I am confident this Olmo Hybrid base model is solid and more performance can be extracted, but it takes more careful work adapting existing datasets. 2. Open-source tooling The frank reality of new architectures for open models is that the open-source software tooling support is horrific. There’s the paper-cuts that people are familiar with, e.g. random errors in popular libraries (as people experienced with GPT-OSS) that slow adoption, but there are also deeper problems. A large part of the potential benefit of hybrid models is the reduction in memory usage for long-context generation, which is crucial for reinforcement learning and agentic tasks. It should be a huge win for post-training! This, unfortunately, is far from the case, and will likely take another 3-6months to get right for this batch of GDN models. The core problem is that the open-source inference tools, e.g. VLLM, are relying on far less developed kernels (and other internals) when compared to standard transformers. This comes with two challenges — throughput slowdowns and numerical issues. Numerical issues can be combatted with a variety of inference flags. Quoting the paper again: The two key flags in VLLM we needed to get maximum performance with the post-training model were --disable-cascade-attn, which disables cascade attention (an optimization for shared prompt prefixes), and --enforce-eager, which turns off CUDA graphs. These two flags have been used in our RL setup dating back to Olmo 3, but are new additions to evaluations. Scores for the released models drop precipitously without them. We also evaluated our final models with the hybrid model cache in the richer FP32 datatype, to improve stability via --mamba_ssm_cache_dtype following NVIDIA. Essentially, we used these to make sure the model was numerically stable. The downside

11 min
24 FEB

How much does distillation really matter for Chinese LLMs?

Distillation has been one of the most frequent topics of discussion in the broader US-China and technological diffusion story for AI. Distillation is a term with many definitions — the colloquial one today is using a stronger AI model’s outputs to teach a weaker model. The word itself is derived from a more technical and specific definition of knowledge distillation (Hinton, Vinyals, & Dean 2015), which involves a specific way of learning to match the probability distribution of a teacher model. The distillation of today is better described generally as synthetic data. You take outputs from a stronger model, usually via an API, and you train your model to predict those. The technical form of knowledge distillation is not actually possible from API models because they don’t expose the right information to the user. Synthetic data is arguably the single most useful method that an AI researcher today uses to improve the models on a day to day basis. Yes, architecture is crucial, some data still needs exclusively human inputs, and new ideas like reinforcement learning with verifiable rewards at scale can transform the industry, but so much of the day to day life in improving models today is figuring out how to properly capture and scale up synthetic data. To flesh out the point from the start of this piece, the argument has repeatedly been that the leading Chinese labs are using distillation for their models to steal capabilities from the best American API-based counterparts. The most prominent case to date was surrounding the release of DeepSeek R1 — where OpenAI accused DeepSeek of stealing their reasoning traces by jailbreaking the API (they’re not exposed by default — for context, a reasoning trace is a colloquial word of art referring to the internal reasoning process, such as what open weight reasoning models expose to the user). Fear of distillation is also likely why Gemini quickly flipped from exposing the reasoning traces to users to hiding them. There was even very prominent, early reasoning research that built on Gemini! This all leads us to today’s news, where Anthropic named and directly accused a series of Chinese labs for elaborate distillation campaigns on their Claude models. This is a complex issue. In this post we unpack a series of questions, beginning with the impact, and ending with politics. The core question is — how much of a performance benefit do Chinese labs get from distilling from American models. Interconnects AI is a reader-supported publication. Consider becoming a subscriber. To start, let’s review what Anthropic shared. From the blog post, emphasis mine: We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions. These labs used a technique called “distillation,” which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently. Much like the models themselves, the benefits of distillation are very jagged. For some capabilities, particularly if you don’t have a full training pipeline setup for it, quickly distilling some data from the leading frontier model in that area can yield massive performance boosts. This can definitely help the lab distilling from the API catch up much more quickly than they otherwise would. Most distillation is rather benign, using many tokens of an LLM to help process and refine existing data — putting a lot of compute into getting a few, high quality training tokens out. This sort of raw data processing work can be done on many different APIs, but one tends to be best. When we go into what Anthropic says the three Chinese LLM builders actually used the Claude API for — as an aside, Anthropic didn’t confirm that the attack was done through the API, the chat app, or Claude Code — the actual impact of the operations is very mixed. It’s hard to know how much untracked usage these labs deployed for other projects (or other American models). To start, Anthropic puts DeepSeek first in their blog post because they’re the household name in the US for Chinese AI. The extent of their use is actually quite small, showing how this post is more about the big picture than the details: DeepSeek Scale: Over 150,000 exchanges The operation targeted: * Reasoning capabilities across diverse tasks * Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning * Creating censorship-safe alternatives to policy sensitive queries In the scale of training a language model, 150K samples is only scratching the surface as a substantive experiment. It looks like they were experimenting with some rubrics, which could’ve been for an online RL run, but that’s extremely unlikely with how distributed the access was, and then some minor stuff on completions for sensitive queries. This usage of Anthropic’s API will have a negligible impact on DeepSeek’s long-rumored V4 model (or whichever model the data here contributed to). This was also very likely a small team at DeepSeek and unknown to much of the broader training organization. The other two labs, Moonshot AI (makers of the Kimi models) and MiniMax reflected much broader usage. Moonshot AI Scale: Over 3.4 million exchanges The operation targeted: * Agentic reasoning and tool use * Coding and data analysis * Computer-use agent development * Computer vision MiniMax Scale: Over 13 million exchanges The operation targeted: * Agentic coding * Tool use and orchestration The role of distillation is constantly changing. Distilling from Claude today for its agentic behavior is much more valuable than versions of Claude have been as a teacher in the past. Claude Opus 4.6 has a well-rounded agentic navigation that none of the other models quite match. Why not try training on some of the model outputs to see if your model absorbs it? Over the next few months, that’ll be less differentiated. It’s sort of like how all the models are way better at math today than most people need — there are plenty of places to distill from. Estimates will vary, but if each response had 10-25K tokens per exchange, the total tokens across these two labs, mostly with MiniMax, would be 150-400 billion tokens. This is a substantial amount, which could meaningfully improve a models’ post-training. For example, in Olmo 3 we had an SFT dataset of 20 billion tokens that could be built like this, and increasing it by 10X would be very reasonable. These numbers are just scratching the surface of total synthetic data generation across APIs hosted by US companies. At the same time, quantity is a pretty crude way to measure impact. Just taking the outputs from Claude and figuring out how to add them to your model pipeline isn’t easy. The research community has seen many cases where taking outputs from a certain teacher model unexpectedly makes the student worse — subtle interactions between the data make it variable and tricky to do this type of distillation. It’s fundamentally a research problem. This is what I’m sure the Chinese labs are innovating at. There’s an argument that Chinese frontier labs are substantially more efficient than their Western counterparts — this is misleading. The labs operate under different constraints. The Chinese labs are likely slightly more efficient out of necessity in being lower on resources, but overall the picture of talent access is very similar. The Chinese labs also approach benchmarks differently, making it appear that they’re a bit closer than they really are (and appearing as if they’re potentially surpassing). This is needed to get momentum and brand recognition in the AI market. The Chinese labs likely innovate greatly on distilling from leading API models, due to their restricted access to GPUs. GPUs could be used to construct synthetic data, but for organizations with more funding than they can spend on research compute (being supply limited), using API-based models is one of the few other options for effectively getting more compute. It’s way easier to figure out getting access to “banned” API models than it is to smuggle tens of thousands of physical GPUs and get them set up. It’s not only the Chinese labs that operate like this. Synthetic data from a model you don’t own is all arguably distillation. Distillation is a shortcut to more compute for anyone. It’s also a far less risky cost, as having a big cluster for research requires a very large financial commitment, where APIs are pay-as-you-go. For example, in Olmo 3 we used millions of GPU hours on the Frontier supercomputer and Azure credits through NAIRR for synthetic data. We didn’t have the equivalent in GPUs (or really the cash, thank you research credits!). All together, it’s very fair for Anthropic to be concerned about this. I still wouldn’t say it is a crucial factor in these Chinese labs post-training capabilities, especially not one that’ll be easy to measure in a time gap to matching the model they’re distilling from a la the US-China performance lag. If we take a step back, there was even a time when Claude Sonnet was the flagship model ahead of Opus (I think this was with Sonnet 3.5), much of this comes from it being well distilled internally from Opus checkpoints. Fast iteration and high-quality data can go very

11 min
9 FEB

Opus 4.6, Codex 5.3, and the post-benchmark era

Last Thursday, February 5th, both OpenAI and Anthropic unveiled the next iterations of their models designed as coding assistants, GPT-5.3-Codex and Claude Opus 4.6, respectively. Ahead of this, Anthropic had a firm grasp of the mindshare as everyone collectively grappled with the new world of agents, primarily driven by a Claude Code with Opus 4.5-induced step change in performance. This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models. The fine margins between Opus 4.6 and Codex 5.3 will be felt in many model versions this year, with Opus ahead in this matchup on usability. Going into these releases I’d been using Claude Code extensively as a general computer agent, with some software engineering and a lot of data analysis, automation, etc. I had dabbled with Codex 5.2 (usually on xhigh, maximum thinking effort), but found it not to quite work for me among my broad, horizontal set of tasks. For the last few days, I’ve been using both of the models much more evenly. I mean this as a great compliment, but Codex 5.3 feels much more Claude-like, where it’s much faster in its feedback and much more capable in a broad suite of tasks from git to data analysis (previous versions of Codex, including up to 5.2, regularly failed basic git operations like creating a fresh branch). Codex 5.3 takes a very important step towards Claude’s territory by having better product-market fit. This is a very important move for OpenAI and between the two models, Codex 5.3 feels far more different than its predecessors. OpenAI’s latest GPT, with this context, keeps an edge as a better coding model. It’s hard to describe this general statement precisely, and a lot of it is based on reading others’ work, but it seems to be a bit better at finding bugs and fixing things in codebases, such as the minimal algorithmic examples for my RLHF Book. In my experience, this is a minor edge, and the community thinks that this is most apparent in complex situations (i.e. not most vibe-coded apps). As users become better at supervising these new agents, having the best top-end ability in software understanding and creation could become a meaningful edge for Codex 5.3, but it is not an obvious advantage today. Many of my most trusted friends in the AI space swear by Codex because it can be just this tiny bit better. I haven’t been able to unlock it. Switching from Opus 4.6 to Codex 5.3 feels like I need to babysit the model in terms of more detailed descriptions when doing somewhat mundane tasks like “clean up this branch and push the PR.” I can trust Claude to understand the context of the fix and generally get it right, where Codex can skip files, put stuff in weird places, etc. Both of these releases feel like the companies pushing for capabilities and speed of execution in the models, but at the cost of some ease of use. I’ve found both Opus 4.6 and Codex 5.3 ignoring an instruction if I queue up multiple things to do — they’re really best when given well-scoped, clear problems (especially Codex). Claude Code’s harness has a terrible bug that makes subagents brick the terminal, where new messages say you must compact or clear, but compaction fails. Despite the massive step by Codex, they still have a large gap to close to Claude on the product side. Opus 4.6 is another step in the right direction, where Claude Code feels like a great experience. It’s approachable, it tends to work in the wide range of tasks I throw at it, and this’ll help them gain much broader adoption than Codex. If I’m going to recommend a coding agent to an audience who has limited-to-no software experience, it’s certainly going to be Claude. At a time when agents are just emerging into general use, this is a massive advantage, both in mindshare and feedback in terms of usage data. In the meantime, there’s no cut-and-dried guideline on which agent you need to use for any use-case, you need to use multiple models all the time and keep up with the skill that is managing agents. Interconnects AI is a reader-supported publication. Consider becoming a subscriber. Assessing models in 2026 There have been many hints through 2025 that we were heading toward an AI world where benchmarks associated with model releases no longer convey meaningful signal to users. Back in the time of the GPT-4 or Gemini 2.5 Pro releases, the benchmark deltas could be easily felt within the chatbot form factor of the day — models were more reliable, could do more tasks, etc. This continued through models like OpenAI’s o3. During this phase of AI’s buildout, roughly from 2023 to 2025, we were assembling the core functionality of modern language models: tool-use, extended reasoning, basic scaling, etc. The gains were obvious. It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter. For this release, I barely looked at the evaluation scores. I saw that Opus 4.6 had a bit better search scores and Codex 5.3 used far fewer tokens per answer, but neither of these were going to make me sure they were much better models. Each of the AI laboratories, and the media ecosystems covering them, have been on this transition away from standard evaluations at their own pace. The most telling example is the Gemini 3 Pro release in November of 2025. The collective vibe was Google is back in the lead. Kevin Roose, self-proclaimed “AGI-pilled” NYTimes reporter in SF said: There's sort of this feeling that Google, which kind of struggled in AI for a couple of years there — they had the launch of Bard and the first versions of Gemini, which had some issues — and I think they were seen as sort of catching up to the state of the art. And now the question is: is this them taking their crown back? We don’t need to dwell on the depths of Gemini’s current crisis, but they have effectively no impact at the frontier of coding agents, which as an area feels the most likely for dramatic strides in performance — dare I say, even many commonly accepted definitions of AGI that center around the notion of a “remote worker?” The timeline has left them behind 2 months after their coronation, showing Gemini 3 was hailed as a false king. On the other end of the spectrum is Anthropic. With Anthropic’s release of Claude 4 in May of 2025, I was skeptical of their bet on code — I was distracted by the glitz of OpenAI and Gemini trading blows with announcements like models achieving IMO Gold medals in mathematics or other evaluation breakthroughs. Anthropic deserves serious credit for the focus of its vision. They were likely not the only AI lab to note the coming role of agents, but they were by far the first to shift their messaging and prioritization towards this. In my post in June of 2025, a month after Claude 4 was released, I was coming around to them being right to deprioritize standard benchmarks: This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working. This leaves me reflecting on the role of Interconnects’ model reviews in 2026. 2025 was characterized by many dramatic, day-of model release blog posts, with the entry of many new Chinese open model builders, OpenAI’s first open language model since GPT-2, and of course the infinitely hyped GPT-5. These timely release posts still have great value — they center the conversation around the current snapshot of a company vis-a-vis the broader industry, but if models remain similar, they’ll do little to disentangle the complexity in mapping the current frontier of AI. In order to serve my role as an independent voice tracking the frontier models, I need to keep providing regular updates on how I’m using models, why, and why not. Over time, the industry is going to develop better ways of articulating the differences in agentic models. For the next few months, maybe even years, I expect the pace of progress to be so fast and uneven in agentic capabilities, that consistent testing and clear articulation will be the only way to monitor it. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

8 min

See All (143)

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

Creator

Nathan Lambert
Years Active

2023 - 2026
Episodes

143
Rating

Clean
Show Website

Interconnects

Technology

Technology

Every two weeks
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated daily
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly

Interconnects

Gemma 4 and what makes an open model succeed

Lossy self-improvement

GPT 5.4 is a big step for Codex

What comes next with open models

Dean Ball on open models and government control

Olmo Hybrid and future LLM architectures

How much does distillation really matter for Chinese LLMs?

Opus 4.6, Codex 5.3, and the post-benchmark era

About

Information

You Might Also Like

Interconnects

Episodes

Gemma 4 and what makes an open model succeed

Lossy self-improvement

GPT 5.4 is a big step for Codex

What comes next with open models

Dean Ball on open models and government control

Olmo Hybrid and future LLM architectures

How much does distillation really matter for Chinese LLMs?

Opus 4.6, Codex 5.3, and the post-benchmark era

About

Information

You Might Also Like