Interconnects

Nathan Lambert

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

  1. 3 DAYS AGO

    Burning out

    One of the obvious topics of the Valley today is how hard everyone works. We’re inundated with comments on “The Great Lock In”, 996, 997, and now even a snarky 002 (midnight to midnight with a 2 hour break). Plenty of this is performative flexing on social media, but enough of it is real and reflecting how trends are unfolding in the LLM space. I’m affected. My friends are affected. All of this hard work is downstream of ever increasing pressure to be relevant in the most exciting technology of our generation. This is all reflective of the LLM game changing. The time window to be a player at the most cutting edge is actually a closing window, not just what feels like one. There are many different sizes and types of models that matter, but as the market is now more fleshed out with resources, all of them are facing a constantly rising bar in quality of technical output. People are racing to stay above the rising tide — often damning any hope of life balance. Interconnects is a reader-supported publication. Consider becoming a subscriber. AI is going down the path that other industries have before, but on steroids. There’s a famous section of the book Apple in China, where the author Patrick McGee describes the programs Apple put in place to save the marriages of engineers traveling so much to China and working incredible hours. In an interview on ChinaTalk, McGee added “Never mind the divorces, you need to look at the deaths.” This is a grim reality that is surely playing out in AI. The Wall Street Journal recently published a piece on how AI Workers Are Putting In 100-Hour Workweeks to Win the New Tech Arms Race. The opening of the article is excellent to capture how the last year or two has felt if you’re participating in the dance: Josh Batson no longer has time for social media. The AI researcher’s only comparable dopamine hit these days is on Anthropic’s Slack workplace-messaging channels, where he explores chatter about colleagues’ theories and experiments on large language models and architecture. Work addicts abound in AI. I often count myself, but take a lot of effort to make it such that work expands to fill available time and not that I fill everything in around work. This WSJ article had a bunch of crazy comments that show the mental limits of individuals and the culture they act in, such as: Several top researchers compared the circumstances to war. Comparing current AI research to war is out of touch (especially with the grounding of actual wars happening simultaneously to the AI race!). What they really are learning is that pursuing an activity in a collective environment at an elite level over multiple years is incredibly hard. It is! War is that and more. In the last few months I’ve been making an increasing number of analogies to how working at the sharp end of LLMs today is similar to training with a team to be elite athletes. The goals are far out and often singular, there are incredibly fine margins between success and failure, much of the grinding feels over tiny tasks that add up over time but you don’t want to do in the moment, and you can never quite know how well your process is working until you compare your outputs with your top competition, which only happens a few times a year in both sports and language modeling. In college I was a D1 lightweight rower at Cornell University. I walked onto a team and we ended up winning 3 championships in 4 years. Much of this was happenstance, as much greatness is, but it’s a crucial example in understanding how similar mentalities can apply in different domains across a life. My mindset around the LLM work I do today feels incredibly similar — complete focus and buy in — but I don’t think I’ve yet found a work environment where the culture is as cohesive as athletics. Where OpenAI’s culture is often described as culty, there are often many signs that the core team members there absolutely love it, even if they’re working 996, 997, or 002. When you love it, it doesn’t feel like work. This is the same as why training 20 hours a week while a full time student can feel easy. Many AI researchers can learn from athletics and appreciate the value of rest. Your mental acuity can drop off faster than your physical peak performance does when not rested. Working too hard forces you to take narrower and less creative approaches. The deeper into the hole of burnout I get in trying to make you the next Olmo model, the worse my writing gets. My ability to spot technical dead ends goes with it. If the intellectual payoffs to rest are hard to see, your schedule doesn’t have the space for creativity and insight. Crafting the team culture in both of these environments is incredibly difficult. It’s the quality of the team culture that determines the outcome more than the individual components. Yes, with LLMs you can take brief shortcuts by hiring talent with years of experience from another frontier lab, but that doesn’t change the long-term dynamic. Yes, you obviously need as much compute as you can get. At the same time, culture is incredibly fickle. It’s easier to lose than it is to build. Some argue that starting a new lab today can be an advantage against the established labs because you get to start from scratch with a cleaner codebase, but this is cope. Three core ingredients of training: Internal tools (recipes, code-bases, etc.), resources (compute, data), and personnel. Leadership sets the direction and culture, where management executes with this direction. All elements are crucial and cannot be overlooked. The further along the best models get, the harder starting from scratch is going to become. Eventually, this dynamic will shift back in favor of starting from scratch, because public knowhow and tooling will catch up, but in the meantime the closed tools are getting better at a far faster rate than the fully open tools. The likes of SSI, Thinky, and Reflection are likely the last efforts that are capitalized enough to maybe catch up in the near term, but the odds are not on their side. Getting infinite compute into a new company is meaningless if you don’t already have your code, data, and pretraining architectures ready. Eventually the clock will run out for company plans to be just catching up to the frontier, and then figure it out from there. The more these companies raise, the more the expectations on their first output will increase as well. It’s not an enviable position, but it’s certainly ambitious. In many ways I see the culture of Chinese technology companies (and education systems) as being better suited for this sort of catch up work. Many top AI researchers trained in the US want to work on a masterpiece, where what it takes in language modeling is often extended grinding to stabilize and replicate something that you know definitely can work. I used to think that the AI bubble would pop financially, as seen through a series of economic mergers, acquisitions, and similar deals. I’m shifting to see more limitations on the human capital than the financial capital thrown at today’s AI companies. As the technical standard of relevance increases (i.e. how good the models people want to use are, or the best open model of a given size category), it simply takes more focused work to get a model there. This work is hard to cheat in time. This all relates to how I, and other researchers, always comment on the low hanging fruit we see to keep improving the models. As the models have gotten better, our systems to build them have gotten more refined, complex, intricate, and numerically sensitive. While I see a similar amount of low-hanging fruit today as I did a year ago, the efforts (or physical resources, GPUs) it can take to unlock them have increased. This pushes people to keep going one step closer to their limits. This is piling on to more burnout. This is also why the WSJ reported that top researchers “said repeatedly that they work long hours by choice.” The best feel like they need to do this work or they’ll fall behind. It’s running one more experiment, running one more vibe test, reviewing one more colleague’s PR, reading one more paper, chasing down one more data contract. The to-do list is never empty. The amount of context that you need to keep in your brain to perform well in many LM training contexts is ever increasing. For example, leading post-training pipelines around the launch of ChatGPT looked like two or maybe three well separated training stages. Now there are tons of checkpoints flying around getting merged, sequenced, and chopped apart in part of the final project. Processes that used to be managed by one or two people now have teams coordinating many data and algorithmic efforts that are trying to land in just a few models a year. I’ve personally transitioned from a normal researcher to something like a tech lead who is always trying to predict blockers before they come up (at any point in the post-training process) and get resources to fix them. I bounce in and out of problems to wherever the most risk is. Cramming and keeping technical context pushes out hobbies and peace of mind. Training general language models you hope others will adopt — via open weights or API — is becoming very much an all-in or all-out domain. Half-assing it is becoming an expensive way to make a model that no one will use. This wasn’t the case two years ago, where playing around with a certain part of the pipeline was legitimately impactful. Culture is a fine line between performance and toxicity, and it’s often hard to know which you are until you get to a major deliverable to check in versus competitors. Personally, I’m fighting off a double-edged sword of this. I feel immense responsibility to make all the future Olmo models of the world great, while simultaneously trying to do a substantial amount of ecosystem work to create an informed discussion around the

    10 min
  2. 20 OCT

    How to scale RL

    Two quick housekeeping items before I get to the post.1. I’ll be in SF this week for the PyTorch conference (22-23), AI Infra Summit (21st), and other local events. Come say hi.2. I launched a new Substack AI bundle with 8 of my favorite publications packaged together for teams of 20+. Learn more at readsail.com.Onto the post! “Scaling reinforcement learning (RL)” is the zeitgeisty way to capture the next steps in improving frontier models — everyone is staring at the same hill they plan on climbing. How these different groups are approaching the problem has been a poorly kept secret. It’s a simple idea, but one that’s hard to copy: Predicting the trajectory of the learning curve. There have been two reasons this is hard to copy for academics, which will be solved on different time scales: * The lack of stable RL training setups. There are many RL libraries being developed in parallel and the community has collectively made them much more ready for big RL runs over the summer. * The lack of compute for experimentation. These aren’t new stories. In many ways they mirror the progression of open Mixture of Experts (MoE) models, where they still lag far behind the implementations of the codebases within top AI laboratories because it involves overcoming substantial engineering headaches in an expensive experimentation regime. Scaling RL has been shaping up the same way, but it turns out it is just a bit more approachable. Last week we got the first definitive paper on scaling RL. It proposes a clear method to extrapolate RL learning curves over compute scales and sets a baseline for the order of compute that should be spent to have top-end performance. The paper, The Art of Scaling Reinforcement Learning Compute for LLMs (Khatri & Madaan et al. 2025), referred to as ScaleRL, is a must read for anyone looking to understand the absolute cutting edge of RL algorithms and infrastructure. For some personal context, for all of 2025 we’ve had our main slack channel in the reasoning space at Ai2 called “scaling-rl” because of how essential we knew the first clear piece of work in this area would be. This post covers the key details and what I see coming next. There are two key things you need to know about these, even if all the lower level RL math is confusing to you too. First is how these intuitively work and what they’re actually predicting. Second is how they compare to the pretraining scaling laws we know and love. To the first point, what the approach entails is taking one (or a handful of) your key base models, run a bit of RL on each of them, predict the end point by a bit of shape forecasting across many stable runs, then, for your big run, you can predict the end point in terms of final performance. The shape of RL runs that motivates this is how you see your model often gain ~80% of the accuracy gain in the first few steps, and you wonder what the final performance of the model will be if you trained on your entire dataset. The authors define three constants that they fit, A for a measure of the peak performance — accuracy on a subset of your training dataset, aka the validation set, B for the slope of the sigmoid curve, and C as compute on the x axis. What is then done is that you take a set of RL training jobs and you fit a regression that predicts the last chunk of real training points given the early measurements of accuracy over time. Then, you can compare the predicted final performance of your future RL ablations on that starting model by understanding the normal shape of your RL learning curves. Second is to consider how this compares to pretraining scaling laws. These are very far from the deeply insightful power law relating downstream test loss to pretraining compute — accuracy on RL training datasets is a far more bounded measure than next token prediction. The RL scaling laws are most useful for ablating design choices, relative to pointing to something fundamental about the nature of models. In many ways, scaling laws for pretraining could’ve been viewed this way at the beginning, too, so we’ll see how RL evolves from here. With that difference, scaling laws for RL will play a very different role in training leading models than the pretraining scaling laws we have today. The pretraining laws are about choosing the exact configuration for your big pretraining run (that you can’t really run a meaningful chunk of to debug at all), where RL is more about ablating which algorithm you’ll let run much longer. In pretraining many decisions depend on your budget and scaling laws can give the answer. Your training compute, communication bottlenecks, maximum run time, data availability, etc. all define a certain model window. Scaling laws for RL may inform this very soon, but for now it's best to think about scaling laws as a way to extract the maximum performance from a given base model. For all of these reasons, scaling RL is more like an art, as the authors say it, because it’s about finding the run that’ll get the last few percentage points of performance when let run over an extra order of magnitude (or two) of samples. It’s a fine grained way to extrapolate RL curves — which have a standard shape of a quick rise then a slow saturation. In practice, the authors fit curves over 1/4 of their training compute to predict the outcome after the remaining 3/4 of GPU hours. The limits of scaling laws will likely be pushed further in the future (and I don’t have a good heuristic for what percentage of compute is used for establishing pretraining scaling laws, versus what is deployed in the final run, comment if you do!). From here, the paper quickly gets technical, serving as a check in on the major ideas that dominated the RL research ecosystem in the last 6 months. This paper blesses those as important or not when it comes to scaled up RL training. This fits a recurring trend across language modeling in the last few years: Most of the key ideas are out there, but open labs tend to not have the resources to put them all together in the right configuration. This sort of slow accumulation of knowledge takes an organizational intensity, clarity, and ability that is hard for small research groups to match. Interconnects is a reader-supported publication. Consider becoming a subscriber. There are a few key ideas that stand out to me as worth knowing and betting on following this paper: * Algorithmic advancements: The paper is very favorable on, arguably painting them as essential, some recent algorithms or advancements. These include truncated importance sampling (TIS), Group Sequence Policy Optimization (GSPO), and Clipped IS-weight Policy Optimization (CISPO) via the MiniMax M1 paper. More on these in a second. * Systems improvements: The authors highlight PipeLine RL (paper or repository) as the canonical reference for the combination of in-flight updates — i.e. changing model weights within one very long generation — and continuous batching — i.e. filling your RL batch over time until you have enough prompts for a learning step — which together represent 4X+ improvements over standard RL implementations on LLMs in terms of throughput. What this looks like in terms of idle GPUs is below, from the ServiceNow paper. Intuitively, think about what happens if you were to ask 8 different questions to an LLM simultaneously. Some of these would finish early and some would take a long time. If you allocate your GPUs such that they have to finish all 8 questions before moving onto the next stack of questions, inevitably there will be GPUs idle when you’re waiting for the last answer. Instead, continuous batching pulls in new questions all the time when the GPUs have cycles to do more processing. Though, this is more complicated in the RL setup because after every 8 (or your batch size) of questions you need to update your RL weights. Can you still do this and fill in new questions all the time to the GPUs? What happens to that one question that is taking forever? In-flight updates is the solution to this. What is literally happening is that the model is updated in the middle of the generation. The models and RL systems just handle this seamlessly, and it removes a ton of idle time in matching the inference weights to the new updates from your RL algorithm.Not having a few key details like this will make big RL runs not only more expensive in GPUs, but more importantly in time. A 1 day feedback cycle vs 4 days makes for a very different research setup. We have these two features in Open Instruct, our post training repo at Ai2, as do many other RL libraries. A lot of this is fixing numerics, which is far harder with Mixture of Experts (MoE) models, and something that most open RL research hasn’t touched. This hunt for numerical stability is a common rumor for why Thinking Machines put out the deterministic VLLM blog post ahead of releasing their Tinker API — deterministic VLLM could be their forward pass. Back to algorithms. Ross Taylor summarized the various eras of RL algorithms that the community has gone through in 2025. First was the transition from vanilla GRPO to the likes of DAPO (see my earlier post on GRPO tricks or my YouTube video on them too), which noticed issues with the clipping formulation and biases in the GRPO advantage calculation. The next class of algorithms are those cited in this ScaleRL paper, CISPO and a general class of Truncated Importance Sampling (TIS) approaches, that are designed for sequence level optimization (often closer to vanilla policy gradient) that account for the probability delta between actor (the GPUs generating completions for RL, often something fast like VLLM) and learner (the GPUs performing gradient updates, in a different library). This importance sampling term seems to be essential to getting modern RL infrastructure right, as without it, scaling to more complex systems is hard to get numerical stability w

    13 min
  3. 16 OCT

    The State of Open Models

    This talk covers everything that’s happened this year in the open model landscape — DeepSeek kickstarting the Chinese open model norms, Llama’s fade, Qwen’s dominance, GPT-OSS — and what comes next. It is my attempt to share what people need to know about where open models are heading, building on all of my research here at Interconnects and in my day job of training these models, in order to help us take the actions we need to steer it in a better direction. I strongly recommend watching (or listening, as it’s in the podcast feed) if any of the discussions around open models or Chinese AI impacts your decision making. This felt like one of the better talks I’ve given in a bit and I’m excited to keep expanding my coverage here. You can click through the slides here. Thanks to the organizers of The Curve for inviting me (and encouraging me to give this talk), and for permission to post this video. EDIT: I noticed sometimes the audio jumps weirdly, not sure what caused it (from slideslive export, raw is here: https://slideslive.com/39046297/open-models-in-2025-stakes-state-and-strategy) Chapters 00:00 2025 so far05:53 China takes the lead15:54 What comes next21:20 What we should do25:00 Q & A (Podcast feed / Audio only version trims 7 seconds of silence to start) References & Recommended Reading * The ATOM Project * On China’s open-source community & trajectory * Ranking China’s open AI labs * On GPT-OSS * Recent open models * More on The Curve conference Of course, you can watch on YouTube: Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

    47 min
  4. 7 OCT

    Thoughts on The Curve

    I spent the weekend debating AI timelines, among other things, at The Curve conference. This translates as spending the weekend thinking about the trajectory of AI progress with a mix of DC and SF types. This is a worthwhile event that served as a great, high-bandwidth way to check in on timelines and expectations of the AI industry. Updating timelines My most striking takeaway is that the AI 2027 sequence of events, from AI models automating research engineers to later automating AI research, and potentially a singularity if your reasoning is so inclined, is becoming a standard by which many debates on AI progress operate under and tinker with. It’s good that many people are taking the long term seriously, but there’s a risk in so many people assuming a certain sequence of events is a sure thing and only debating the timeframe by which they arrive. I’ve documented my views on the near term of AI progress and not much has changed, but through repetition I’m developing a more refined version of the arguments. I add this depth to my takes in this post. I think automating the “AI Research Engineer (RE)” is doable in the 3-7 year range — meaning the person that takes a research idea, implements it, and compares it against existing baselines is entirely an AI that the “scientists” will interface with. In some areas the RE is arguably already automated. Within 2 years a lot of academic AI research engineering will be automated with the top end of tools — I’m not sure academics will have access to these top end of tools but that is a separate question. An example I would give is coming up with a new optimizer and testing it on a series of ML baselines from 100M to 10B parameters. At this time I don’t expect the models to be able to implement the newest problems the frontier labs are facing alone. I also expect academics to be fully priced out from these tools. Within 1-3 years we’ll have tools that make existing REs unbelievably productive (80-90% automated), but there are still meaningful technical bottlenecks that are solvable but expensive. The compute increase per available user has a ceiling too. Labs will be spending $200k+ per year per employee on AI tools easily (ie the inference cost), but most consumers will be at tiers of $20k or less due to compute scarcity. Within 3-4 years the augmented research engineers will be able to test any idea that the scientists come up with at the frontier labs, but many complex system problems will need some (maybe minimal) amount of human oversight. Examples would include modifying RL implementations for extremely long horizon tasks or wacky new ideas on continual learning. This is so far out that the type of research idea almost isn’t worth speculating on. These long timelines are strongly based on the fact that the category of research engineering is too broad. Some parts of the RE job will be fully automated next year, and more the next. To check the box of automation the entire role needs to be replaced. What is more likely over the next few years, each engineer is doing way more work and the job description evolves substantially. I make this callout on full automation because it is required for the distribution of outcomes that look like a singularity due to the need to remove the human bottleneck for an ever accelerating pace of progress. This is a point to reinforce that I am currently confident in a singularity not happening. Up-skilling employees as their roles become irrelevant creates a very different dynamic. The sustained progress on code performance over the next few years will create a constant feeling of change across the technology industry. The range of performance in software is very high and it is possible to perceive relatively small incremental improvements. These are very complex positions to hold, so they’re not that useful as rhetorical devices. Code is on track to being solved, but the compute limits and ever increasing complexity of codebases and projects (ie. LLMs) is going to make the dynamic very different than the succinct assumptions of AI 2027. To reiterate, the most important part of automation in the discussion is often neglected. To automate someone you need to outcompete the pairing of a human with the tool too. Onto the even trickier argument in the AI 2027 standard — automating AI research altogether. At the same time as the first examples of AI systems writing accepted papers at notable AI venues, I’m going to be here arguing that full automation of AI research isn’t coming anytime soon. It’s daunting to try and hold (and explain) this position, and it relies on all the messy firsthand knowledge of science that I have and how it is different in academia versus frontier AI labs. For one, the level and type of execution at frontier labs relative to academic research is extremely different. Academia also has a dramatically higher variance in quality of work that is accepted within the community. For this reason, we’re going to be seeing incredible disruption at standard academic venues in the very near future, but the nature of science at frontier labs will remain heavily intertwined with human personalities. Models will be good at some types of science, such as taking two existing fields and merging ideas and seeing what happens, but awful at what I consider to be the most idolized version of science, being immersed in the state of the art and having a brilliant insight that makes anywhere from a ripple causing small performance gain to a tsunami reshaping the field. I don’t think AI will fully automate our current notion of an AI researcher in the next 5-10 years, but it could reshape what science means altogether and make that role far less relevant to progress. The researchers grinding out new datasets at frontier labs will have dramatic help on data processing scripts. The researchers coming up with new algorithmic ideas will not expand the rate at which they come up with ideas too much, but their ability to test them is far higher. A large part of science is a social marketplace of ideas. Convincing your colleagues that you are right and to help you double down on it is not going to change in its core nature. Everyone will have superpowers on making evidence to support their claims, but the relative power there stays the same. At a dinner during The Curve I went through a lot of these points with Ryan Greenblatt, Chief Scientist at Redwood Research, and a point he made stuck with me. He summarized my points as thinking the increase in performance from these largely engineering tooling improvements will be equalled out by challenges of scaling compute, so the resulting progress will feel much more linear rather than exponential. A lot of our discussions on automation we agree on, with slightly different timelines, but it didn’t feel like it captured my entire point of view. What is missing is that I expect an inherent slowdown as our AI models get more complicated. Our models today needs tools, more complex serving systems, products to wrap them, and so on. This is very different than the age when just model weights were needed for the cutting edge of AI. There’s an inevitable curse of complexity, a death by a thousand cuts, that is going to add on top of the obvious compute costs to slow down progress. 2026 will be a big year on the compute rollout front, and shipping meaningful improvements to users will be essential to funding the progress that comes after. I’m not sure the economy can keep shifting even more of its weight behind AI progress, where most people bought into fast timelines think of it as a default position. Peter Wildeford wrote a summary of the situation that I resonate with: Here’s how I think the AI buildout will go down. Currently the world doesn’t have any operational 1GW+ data centers. However, it is very likely we will see fully operational 1GW data centers before mid-2026. This likely will be a part of 45-60GW of total compute across Meta, Microsoft, Amazon/AWS/Anthropic, OpenAI/Oracle, Google/DeepMind, and xAI. My median expectation is these largest ~1GW data center facilities will hold ~400,000-500,000 Nvidia Blackwell chips and be used to train ~4e27 FLOP model sometime before the end of 2027. Such a model would be 10x larger than the largest model today and 100x larger than GPT-4. Each individual 1GW facility would cost ~$40B to manufacture, with ~$350B total industry spend across 2026. He continues with estimates for 2028, and saying he’s fuzzy on 2029, but my fuzziness cuts in a bit earlier depending on adoption and performance across the AI industry. Where I feel like in the long run it’ll look like a very consistent pace of progress, that feels like a bunch of big jumps and periods of stagnation in the short-term. I have fairly large error bars on how the price of intelligence — and therefore adoption — is going to evolve over the next 2-4 years, with it obviously becoming far cheaper over the following decades. As for my recent articles on timelines and key debates in the field, I encourage people to comment and dig in on what I wrote below. Interconnects is a reader-supported publication. Consider becoming a subscriber. Other thoughts Something crazy about this conference is no one is talking about how the models actually work or are trained, and everyone here is totally convinced that AGI is coming soon. One of my new friends at the conference described this tendency as “an obsession with the problem.” This is a feeling that many AI obsessors are more interested in where the technology is going rather than how or what exactly it is going to be. Helen Toner gave a great talk at The Curve related to this, arguing how the current and future jaggedness of AI — the fact that similarly difficult tasks when assigned to a human will either be easily mastered by AI or barely showing any competence (her will appear later on her great Sub

    12 min
  5. 30 SEPT

    ChatGPT: The Agentic App

    Ever since ChatGPT exploded in popularity, there has been a looming “how” to its monetization plans. Much has been said about shopping and advertising as the likely paths, especially with Fidji Simo joining as CEO of Applications under Sam Altman. Advertising as a business model for AI is logical but difficult to personalize and specialize. We know tons of people spend a lot of time using AI models, but how do you best get the sponsored content into the outputs? This is an open technical problem, with early efforts from the likes of Perplexity falling short. Shopping is another, but the questions have long been whether AI models actually have the precision to find the items you want, to learn exactly what you love, and to navigate the web to handle all the corner cases of checkouts. These reflect a need for increased capabilities on known AI benchmarks, rather than inventing a new way of serving ads. OpenAI’s o3 model was a major step up in search functionality, showing it was viable; the integration was either a business problem — where OpenAI had to make deals — or an AI one — where ChatGPT wasn’t good enough at managing websites for you. Yesterday, ChatGPT launched its first integrated shopping push with Buy It in ChatGPT, a simple checkout experience, and an integrated commerce backend built on the Agentic Commerce Protocol (ACP). The announcement comes with the perfect partners to complement the strengths of OpenAI’s current models. GPT-5-Thinking is the best at finding niche content on the web, and ChatGPT’s launch partner for shopping is Shopify (*soon, Etsy is available today), the home to the long tail of e-commerce merchants of niche specialties. If this works, it will let users actively uncover exactly what they are looking for — from places that were often hard to impossible to find on Google. This synergy is a theme we’ll see reoccur in other agents of the future. The perfect model doesn’t make a useful application unless it has the information or sandbox it needs to think, search, and act. The crucial piece that is changing is that where models act is just as important as the weights themselves — in the case of shopping, it is the network of stores with their own rankings and API. The ACP was built in collaboration with Stripe, and both companies stand to benefit from this. Stripe wants more companies to build on the ACP so that its tools become the “open standard for agentic payments” and OpenAI wants the long-tail of stores to adopt it so they can add them to their ever-growing internal recommendation (or search) engine. The business model is simple, as OpenAI says “Merchants pay a small fee on completed purchases.” OpenAI likely takes a larger share than Stripe, and it is a share that can grow as their leverage increases over shoppers. I’m cautiously optimistic about this. Finding great stuff to buy on the web is as hard as it has ever been. Users are faced with the gamification of Google search for shopping and the enshittification of the physical goods crowding out Amazon. Many of the best items to buy are found through services like Meta’s targeted ads, but the cost of getting what you want should not be borne through forced distraction. OpenAI will not be immune to the forces that drove these companies to imperfect offerings, but they’ll come at them with a fresh perspective on recurring issues in technology. If this works for OpenAI, they have no competitor. They have a distribution network of nearly 1B weekly users and no peer company ready to serve agentic models at this scale. Yes, Google can change its search feed, but the thoroughness of models like GPT-5 Thinking is on a totally different level than Google search. This agentic model is set up to make ChatGPT the one Agentic App across all domains. The idea of an agentic model, and really the GPT-5 router itself, shows us how the grand idea of one giant model that’s the best for every conceivable use-case is crumbling. OpenAI only chooses the more expensive thinking model when it deems a free user to need it and they have an entirely different model for their coding products. On the other hand, Claude released their latest model, Claude 4.5 Sonnet, yesterday as well, optimizing their coding peak performance and speed yet again — they have no extended model family. The reality that different models serve very different use-cases and how AI companies need to decide and commit to a certain subset of them for their development points to a future with a variety of model providers. Where coding is where you can feel the frontier of AI’s raw intelligence or capabilities, and Anthropic has turned their entire development towards it, the type of model that is needed for monetization of a general consumer market could be very different. This is the web-agent that OpenAI has had the industry-leading version of for about 6 months. Specialization is making the AI market far more interesting, as companies like OpenAI and Google have been in lockstep with their offerings for years. Every company would drop the same model modalities with approximately the same capabilities. Now, as hill-climbing benchmarks are no longer providing immediate user value, especially in text domains, the vision for each AI company is more nuanced. I predicted this earlier in the summer, in my post on what comes next: This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. What I missed is that this applies downward pressure on the number of models labs will release — the value can be more in the integrations and applications than the model itself. Expect releases like today, where Claude released Claude Sonnet 4.5 along with version 2 of Claude Code. The period will still be busy as the industry is on the tail end of the low hanging fruit provided by reasoning models, but over time the hype of model releases themselves will be harder to conjure. Interconnects is a reader-supported publication. Consider becoming a subscriber. Let’s consider the applications that are rolling out today on top of different models. If you haven’t pushed the limits of GPT-5-Thinking, and better yet GPT-5-Pro, for search you really need to, it’s a transformative way of using compute that can find many buried corners of the web. In terms of untapped model capability value, the abilities of search-heavy thinking models like GPT-5 seem far higher than coding agents, which are obviously heavily used. Search-heavy models are an entirely new use, where coding models were the first widespread LLM-based product. As coding agents become more autonomous, they’ll continue to flex and mold a new form for the software industry, but this will be a slow co-evolution. OpenAI is going to focus on its vertical Agentic App where Anthropic (and likely Gemini with Google Cloud) are going to power the long-tail of AI applications reshaping the web and the rest of work. OpenAI will only expand from here. Email, scheduling, travel bookings, and more everyday digital tasks are surely on their roadmap. Their biggest competitor is themselves — and whether their vision can be crafted into something people actually use. If shopping doesn’t work out as the vertical that lets them realize their valuation, they’re positioned to keep trying more. OpenAI has both the lead in the variety of models that power these agentic information tasks and the user base to incentivize companies to collaborate with them. The application paradigm that dominated the mobile era is going to rebound. AI applications started in a form where the user needed to be heavily involved in the work process. The first beneficiaries of this were IDEs and terminal tools. Both of these workplaces allow in-depth and detailed inspection of the process and results. The cutting edge of AI will still work there, but the long tail of casual use will all shift to the standard mode of applications — siloed, simple, and scalable in the cloud. The simpler an AI application is, the wider its potential audience. With this addition of shopping, OpenAI is poised to launch a standalone TikTok-style app with the release of its next video generation model, Sora 2, soon after Meta launched Vibes in their Meta AI app for only AI generated videos with a specific theme to start. At the same time, OpenAI’s Codex web agent is available in the ChatGPT application, which represents an even bigger change in the nature of software work than the addition of coding agents — it allows real websites, and soon businesses, to be built with only a prompt on your phone. In 6-12 months, these agentic applications that feel rough around the edges due to the quality of the AI today, rather than the interface, are going to feel seamless and second-nature to use, despite their complete novelty relative to the past decades of technology. If OpenAI is positioning itself to be The Agentic App, this also opens the door to the near future where many applications we use today shift to an agentic era. Want to schedule a meeting with someone? Let the Google Calendar agent handle that (or some startup that beats them to it). Your email application can find who the next client is and remind them of their appointment. The Banking App will file your taxes in one prompt. The list of these is infinite and across a wide spectrum of difficulty. OpenAI wants to be the one app, The Agentic App, that serves all of these, and the rest of the industry is racing to master their specific vertical before OpenAI gets there. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

    9 min
  6. 22 SEPT

    Thinking, Searching, and Acting

    The weaknesses of today’s best models are far from those of the original ChatGPT — we see they lack speed, we fear superhuman persuasion, and we aspire for our models to be more autonomous. These models are all reasoning models that have long surpassed the original weaknesses of ChatGPT-era language models, hallucinations, total lack of recent information, complete capitulations, and other hiccups that looked like minor forms of delusion laid on top of an obviously spectacular new technology. Reasoning models today are far more complex than the original chatbots that consisted of standalone model weights (and other lightweight scaffolding such as safety filters). They're built on three primitives that'll be around for years to come: * Thinking: The reasoning traces that enabled inference-time scaling. The "thoughts" of a reasoning model take a very different form than those of humans that inspired the terminology used like Chain of Thought (CoT) or Thinking models. * Searching: The ability to request more, specific information from non-parametric knowledge stores designed specifically for the model. This fills the void set by how model weights are static but living in a dynamic world. * Acting: The ability for models to manipulate the physical or digital world. Everything from code-execution now to real robotics in the future allow language models to contact reality and overcome their nondeterministic core. Most of these executable environments are going to build on top of infrastructure for coding agents. These reasoning language models, as a form of technology are going to last far longer than the static model weights that predated and birthed ChatGPT. Sitting just over a year out from the release of OpenAI's o1-preview on September 12, 2024, the magnitude of this is important to write in ink. Early reasoning models with astounding evaluation scores were greeted with resounding criticism of “they won’t generalize,” but that has turned out to be resoundingly false. In fact, with OpenAI's o3, it only took 3-6 months for these primitives to converge! Still, it took the AI industry more broadly a longer time to converge on this. The most similar follow-up on the search front was xAI's Grok 4 and some frontier models such as Claude 4 express their reasoning model nature in a far more nuanced manner. OpenAI's o3 (and GPT-5 Thinking, a.k.a. Research Goblin) and xAI's Grok 4 models seem like a dog determined to chase their goal indefinitely and burn substantial compute along the way. Claude 4 has a much softer touch, resulting in a model that is a bit less adept at search, but almost always returns a faster answer. The long-reasoning traces and tool use can be crafted to fit different profiles, giving us a spectrum of reasoning models. The taxonomy that I laid out this summer for next-generation reasoning models — skills for reasoning intelligence, calibration to not overthink, strategy to choose the right solutions, and abstraction to break them down — are the traits that'll make a model most functional given this new perspective and agentic world. The manner of these changes are easy to miss. For one, consider hallucinations, which are an obvious weakness downstream of the stochastic inference innate to the models and their fixed date cutoff. With search, hallucinations are now missing context rather than blatantly incorrect content. Language models are nearly-perfect at copying content and similarly solid at referencing it, but they're still very flawed at long-context understanding. Hallucinations still matter, but it’s a very different chapter of the story and will be studied differently depending on if it is for reasoning or non-reasoning language models. Non-reasoning models still have a crucial part to play in the AI economy due to their efficiency and simplicity. They are part of a reasoning model in a way because you can always use the weights without tools and they'll be used extensively to undergird the digital economy. At the same time, the frontier AI models (and systems) of the coming years will all be reasoning models as presented above — thinking, searching, and acting. Language models will get access to more tools of some form, but all of them will be subsets of code or search. In fact, search can be argued to be a form of execution itself, but given the imperative of the underlying information it is best left as its own category. Another popular discussion with the extremely-long generations of reasoning models has been the idea that maybe more efficient architectures such as diffusion language models could come to dominate by generating all the tokens in parallel. The (or rather, one) problem here is that they cannot easily integrate tools, such as search or execution, in the same way. These’ll also likely be valuable options in the AI quiver, but barring a true architectural or algorithmic revolution that multiplies the performance of today’s AI models, the efficiency and co-design underway for large transformers will enable the most dynamic reasoning models. Interconnects is a reader-supported publication. Consider becoming a subscriber. With establishing what makes a reasoning model complete comes an important mental transition in what it takes to make a good model. Now, the quality of the tools that a model is embedded with is arguably something that can be more straightforward to improve than the model — it just takes substantial engineering effort — and is far harder with open models. The AI “modeling” itself is mostly open-ended research. Closed models have the benefit of controlling the entire user experience with the stack, where open models need to be designed so that anyone can take the weights off of HuggingFace and easily get a great experience deploying it with open-source libraries like VLLM or SGLang. When it comes to tools used during inference, this means that the models can have a recommended setting that works best, but they may take time to support meaningful generalization with respect to new tools. For example, OpenAI can train and serve their models with only one search engine, where I at Ai2 will likely train with one search engine and then release the model into a competitive space of many search products. A space where this can benefit open models could be something like MCP, where open models are developed innately for a world where we cannot know all the uses of our models, making something like MCP libraries a great candidate for testing. Of course, leading AI laboratories will (or have already started) do this, but the ranking will be different in a priority stack. Much has been said about tokenomics and costs associated with reasoning models, without taking the tool component into account. There was a very popular article articulating how models are only getting more expensive, with a particular focus on reasoning models using far more tokens. This is overstating a blip, a point in time when serving costs increased by 1000x for models by generating vastly more tokens, but without improved hardware. The change in cost of reasoning models reflected a one-time step up in most circumstances where the field collectively turned on inference-time scaling by using the same reasoning techniques. At the same time as the reasoning model explosion, the size of models reaching users in parameter count has all but stagnated. This is due to diminishing returns in quality due to scaling parameters — it’s why OpenAI said GPT 4.5 wasn’t a frontier model and why Gemini never released their Ultra model class. The same will come for reasoning tokens. While diminishing returns are hitting reasoning token amount for serial streams, we’re finally seeing large clusters of Nvidia’s Blackwell GPUs come online. The costs for models seem well on path to level out and then decrease as the industry develops more efficient inference systems — the technology industry is phenomenal at making widely used products far cheaper year over year. The costs that’ll go up are the agents that are enabled by these reasoning models, especially with parallel inference, such as the Claude Code clones or OpenAI’s rumored Pro products. What we all need is a SemiAnalysis article explaining how distorted standard tokenomics are for inference with tools and if tools substantially increase variance in implementations. People focus too much on the higher token costs from big models with long context lengths, those are easy to fix with better GPUs, while there are many other costs such as search indices or idle GPU time waiting for tool execution results. When we look at a modern reasoning model, it is easy to fixate on the thinking token aspects that give the models their name. At the same time, search and execution are such fundamental primitives to modern language models that they can rightfully stand on their own as pillars of modern AI. These are AI systems that substantially depend on the quality of the complex inference stack far more than getting the right YOLO run for the world’s best model weights. The cause of thinking, searching, and acting all being looped in as a “reasoning model” is that this inference-time scaling with meandering chains of thought was the technological innovation that made both search and execution far more functional. Reasoning was the step change event that set these three as technology standards. The industry is in its early days of building out fundamental infrastructure to enable them, which manifests as the early days of language model agents. The infrastructure pairs deterministic computing and search with the beauty, power, and flexibility of the probabilistic models we fell in love with via ChatGPT. This reasoning model layer is shaping up to be the infrastructure that underpins the greatest successes of the future technology industry. This is a public episode. If you'd like to discuss this with other subscribers o

    9 min
  7. 18 SEPT

    Coding as the epicenter of AI progress and the path to general agents

    Coding, due to its breadth of use-cases, is arguably the last tractable, general domain of continued progress for frontier models that most people can interface with. This is a bold claim, so let’s consider some of the other crucial capabilities covered in the discourse of frontier models: * Chat and the quality of prose written by models has leveled off, other than finetuning to user measures such as sycophancy. * Mathematics has incredible results, but very few people directly gain from better theoretical mathematics. * The AIs’ abilities to do novel science are too unproven to be arguable as a target of hillclimbing. Still, coding is a domain where the models are already incredibly useful, and they continue to consistently stack on meaningful improvements. Working daily with AI over the last few years across side projects and as an AI researcher, it has been easy to take these coding abilities for granted because some forms of them have been around for so long. We punt a bug into ChatGPT and it can solve it or autocomplete can tab our way through entire boilerplate. These use-cases sound benign, and haven’t changed much in that description as they have climbed dramatically in capabilities. Punting a niche problem in 1000+ lines of code to GPT-5-Pro or Gemini Deep Think feels like a very fair strategy. They really can sometimes solve problems that a teammate or I were stuck on for hours to days. We’re progressing through this summarized list of capabilities: * Function completion: ~2021, original Github CoPilot (Codex) * Scripting: ~2022, ChatGPT * Building small projects: ~2025, CLI agents * Building complex production codebases, ~2027 (estimate, which will vary by the codebase) Coding is maybe the only domain of AI use where I’ve felt this slow, gradual improvement. Chat quality has been “good enough” since GPT-4, search showed up and has been remarkable since OpenAI’s o3. Through all of these more exciting moments, AIs’ coding abilities have just continued to gradually improve. Now, many of us are starting to learn a new way of working with AI through these new command-line code agents. This is the largest increase in AI coding abilities in the last few years. The problem is the increase isn’t in the same domain where most people are used to working with AI, so the adoption of the progress is far slower. New applications are rapidly building users and existing distribution networks barely apply. The best way to work with them — and I’ll share more examples of what I’ve already built later in this post — is to construct mini projects, whether it’s a new bespoke website or a script. These are fantastic tools for entrepreneurs and researchers who need a way to quickly flesh out an idea. Things that would’ve taken me days to weeks can now be attempted in hours. Within this, the amount of real “looking at the code” that needs to be done is definitely going down. Coding, as an activity done through agents, is having the barriers to entry fully fall down through the same form factor that is giving the act of coding re-found joy. Why I think a lot of people miss these agents is that the way to use the agents is so different from the marketing of incredible evaluation breakthroughs that the models are reaching. The gap between “superhuman coding” announcements and using an agent for mini projects is obviously big. The best way to use the agents is still mundane and requires careful scoping of context. For example, yesterday, on September 17, 2025, OpenAI announced that GPT-5 as part of a model system got a higher score than any human (and Google’s Gemini Deep Think) at the ICPC World Finals, “the premier collegiate programming competition where top university teams from around the world solve complex algorithmic problems.” Here’s what an OpenAI researcher said they did: We competed with an ensemble of general-purpose reasoning models; we did not train any model specifically for the ICPC. We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model. These competitions often get highlighted because they’re “finite time,” so the system must respond in the same fixed time as a human does, but the amount of compute used by GPT-5 or another model here is likely far higher than any user has access to. This is mostly an indication that further ability, which some people call raw intelligence, can be extracted from the models, but most of that is limited by scaffolding and product when used by the general population. The real story is that these models are delivering increasing value to a growing pool of people. For followers of AI, coding with AI models is the easiest way to feel progress. Now that models are so good at chat, it takes very specialized tasks to test the general knowledge of models, or many of the gains are in getting the right answer faster than GPT-5-Thinking’s meandering path. I’m not an expert software engineer and the huge differences between models, and improvements that the individual models and systems are making, have been incredibly obvious. I’ve said many times how Claude Code (or now Codex) are far better than Cursor Agent, which is in turn far better than Github CoPilot. GitHub CoPilot feels borderline drunk at the wheel. Cursor often feels a little distracted while still being smart, but Claude Code and Codex seem on topic and able to test the best of a model’s intelligence on the problem at hand. Yes, even the best agents often aren’t good enough in complex codebases, but it removes the need to go back and forth countless times in a chat window to see if a model can reach the end of the puzzle for you. These CLI agents can run tests, fix git problems, run local tools, whatever. The scope is constantly growing. For the nuanced take of Claude Code vs Codex CLI right now, the answer is expensive. The best has been Claude Code forcing Claude Opus 4.1, but Codex is not far behind and comes in at a much cheaper entry point ($20/month) — Opus requires a $100+/month plan. Codex also has nice features like web search, but it hasn’t been a major differentiator yet in my use. The new workflow is to switch to the other agent when one cannot solve the current problem, and let it see the repository with fresh eyes, much like you pasted a question to another chatbot. The agents are just one tab away, just like the competitors for chat. Interconnects is a reader-supported publication. Consider becoming a subscriber. In the comparison of Claude, Cursor, and CoPilot above, the crucial component is that all of these agents can be tested with the same Claude 4 Sonnet model. The gaps are just as wide as I stated, highlighting how so many of the gains in coding agents are just in product implementations. A second version is slightly embarrassing for me, but follows as I hadn’t updated my OpenAI Codex code when trying the new GPT-5-Codex model, which resulted in an immediate massive jump in performance by changing it. It’s a new phenomenon to have a domain at the cutting edge of AI abilities where the software scaffolding of a model is felt so strongly. Product and prompts matter more than ever and this sensation will expand to more domains. The why of these performance differences — even when using the same model — is worth dwelling on. It’s unlikely that the Claude team is that much better at general software engineering and product design — rather, Anthropic has extensive in-house experience in extracting the most from models. The current shift in models has been about how to take a set of models that are designed for question answering and other single-stream text tasks and break down problems. In my taxonomy on next-generation reasoning models, I called this ability “abstraction.” The need to just slightly shift the model to this task explains OpenAI’s recent specialized model for this, GPT-5-Codex. GPT-5 was primarily a release about balancing OpenAI’s books with a user base approaching 1B active users in the chat format. GPT-5 is a honed tool for a different job. The evaluation scores are slightly better than the general reasoning model for this new GPT-5-Codex, but the main gains are in how behavior is different in coding tasks. GPT‑5-Codex adapts how much time it spends thinking more dynamically based on the complexity of the task. The model combines two essential skills for a coding agent: pairing with developers in interactive sessions, and persistent, independent execution on longer tasks. That means Codex will feel snappier on small, well-defined requests or while you are chatting with it, and will work for longer on complex tasks like big refactors. During testing, we've seen GPT‑5-Codex work independently for more than 7 hours at a time on large, complex tasks, iterating on its implementation, fixing test failures, and ultimately delivering a successful implementation. And they included this somewhat confusing plot to showcase this dynamic. I’ve certainly felt these changes when I updated the Codex software and the Codex model. This represents another key problem I presented in my taxonomy — calibration, i.e. not overthinking. Having specialized models and specialized products for a use case could make people think that they’re narrowing in to make progress, but in OpenAI’s case it is rather that their hands are tied financially to support the main ChatGPT application. Claude has already fully committed to code. This is due to the size that the space could expand into. These “coding” agents are definitely going to be seen as doing far more than writing code. Yes, their primary ability is going to be writing the code itself and executing it, but what that enables is an entirely

    16 min
  8. 9 SEPT

    On China's open source AI trajectory

    Hello everyone! I’m coming back online after two weeks of vacation. Thankfully it coincided with some of the slowest weeks of the year in the AI space. I’m excited to get back to writing and (soon) share projects that’ll wrap up in the last months of the year. It seemed like a good time to remind people of the full set of housekeeping for Interconnects. * Many people love the audio version of the essays (read by me, not AI). You can get them in your podcast player here. Paid subscribers can add private podcast feeds under “manage your subscription” where voiceover is available for paywalled posts. * The Interconnects Discord for paid subscribers continues to get better, and is potentially the leading paid perk amid the fragmentation of Twitter etc. * We’re going to be rolling out more perks for group subscriptions and experimental products this fall. Stay tuned, or get in touch if group discounts are super exciting for your company. For the time being, I’m planning trips and meetups across a few conferences in October. I’ll be speaking at The Curve (Oct. 3-5, Berkeley), COLM (Oct. 7-10, Montreal, interest form), and the PyTorch Conference (Oct. 21-24, SF) on open models, Olmo, and the ATOM Project, so stay tuned for meetups and community opportunities. On to the post! China is maneuvering to double down on its open AI ecosystem. Depending on how the U.S. and its allies change culture and mobilize investment, this could make the dominance of Chinese AI models this summer, from Qwen, Kimi, Z.ai, and DeepSeek, looks like foreshadowing rather than the maximum gap in open models between the U.S. and China. Until the DeepSeek moment, AI was likely a fringe issue to the PRC Government. The central government will set guidelines, rules, budgets, and focus areas that will be distributed and enforced across the decentralized government power structures. AI wasn’t a political focus and the strategy of open-source was likely set by companies looking to close the gap with leading American competitors and achieve maximum market share in the minimum time. I hear all the time that most companies in the U.S. want to start with open models for IT and philosophical reasons, even when spinning up access to a new API model is almost effortless, and it’s likely this bias could be even higher internationally where spending on technology services is historically lower. Most American startups are starting with Chinese models. I’ve been saying this for a while, but a more official reference for this comes from a recent quote from an a16z partner, Martin Casado, another vocal advocate of investment in open models in America. He was quoted in The Economist with regards to his venture portfolio companies: “I’d say 80% chance [they are] using a Chinese open-source model.” The crucial question for the next few years in the geopolitical evolution of AI is whether China will double down on this open-source strategy or change course. The difficulty with monitoring this position is that it could look like nothing is happening and China maintains its outputs, even when the processes for creating them are far different. Holding a position is still a decision. It’s feasible in the next decade that AI applications and open models are approached with the same vigor that China built public infrastructure over the last few decades (Yes, I’m reading Dan Wang’s new book Breakneck). It could become a new area that local officials compete in to prove their worth to the nation — I’m not sure even true China experts could make confident predictions here. A large source of uncertainty is whether the sort of top-down, PRC edicts can result in effective AI models and digital systems, where government officials succeeded in the past with physical infrastructure. At the same time as obvious pro-AI messaging, Chinese officials have warned of “disorderly competition” in the AI space, which is an indirect signal that could keep model providers releasing their models openly. Open models reduce duplicative costs of training, help the entire ecosystem monitor best practices, and force business models that aren’t reliant on simple race-to-the-bottom inference markets. Open model submarkets are emerging for every corner of the AI ecosystem, such as video generation or robotic action models, (see our coverage of open models, Artifacts Logs) with a dramatic evolution from research ideas to mature, stable models in the last 12-18 months. China improving the open model ecosystem looks like the forced adoption of Chinese AI chips, further specialization of companies’ open models to evolving niches, and expanded influence on fundamental AI research shared internationally. All of these directions have early signs of occurring. If the PRC Government wanted to exert certain types of control on the AI ecosystem — they could. This Doug Guthrie excerpt from Apple in China tells the story from the perspective of international companies. Guthrie was a major player in advising on culture changes in Cupertino to better adapt Apple’s strategy to the Chinese market. “When you stake your life, your identity, on and around certain ideas, you sort of fight for them,” Guthrie says. “Xi Jinping kind of broke my heart… I was sitting there, in China, in my dream job, and I’m watching Xinjiang’s internment camps. I’m watching China tearing up a fifty-year agreement over Hong Kong.” Apple, meanwhile, had become too intertwined with China. Guthrie had been hired to help understand the country and to navigate it. And Apple had followed through—very successfully. But it had burned so many boats, as the saying goes, that Guthrie felt its fate was married to China’s and there was no way out. “The cost of doing business in China today is a high one, and it is paid by any and every company that comes looking to tap into its markets or leverage its workforce,” he later wrote in a blog. “Quite simply, you don’t get to do business in China today without doing exactly what the Chinese government wants you to do. Period. No one is immune. No one.” China famously cracked down on its largest technology companies in late 2020, stripping key figures of power and dramatic amounts of market value off the books. AI is not immune to this. The primary read here is that the PRC leadership will decide on the role they want to have in the open-source AI ecosystem. The safe assumption has been that it would continue because the government picked up a high-impact national strategy when it first started focusing on the issue, already seeded with international influence. To formalize these intentions, the Chinese government has recently enacted an “AI+” plan that reads very similarly to the recent White House AI Action Plan when it comes to open models. The AI+ plan idea was first proposed in March 2024 and was just approved in its full text on July 31, 2025. The AI+ plan, when enacted by local officials, lays out goals for the AI industry in how many open models to have at different tiers of performance and some funding mechanisms for nurturing them. This is right in line with other comments from party officials. Chinese Premier Li Qiang, second-ranking member of the Politburo Standing Committee, made comments in March directly supporting open-source models. From the Wall Street Journal: Li pledged that China would boost support for applications of large-scale AI models and AI hardware, such as smartphones, robots, and smart cars. China’s top economic planning body also said Wednesday that the country aimed to develop a system of open-source models while continuing to invest in computing power and data for AI. An excerpt from Beijing’s city plan as part of the overall AI+ initiative, translated by GPT-5 Pro, has interesting, specific goals: By end-2025: implement 5 benchmark application projects at a world-leading level; organize 10 demonstration application projects that lead the nation; and promote a batch of commercializable results. Strive to form 3–5 advanced, usable, and self-controllable base large-model products, 100 excellent industry large-model products, and 1,000 industry success cases. Take the lead in building an AI-native city, making Beijing a globally influential AI innovation source and application high ground. The goal of this is to: Encourage open-source, high-parameter, ‘autonomous and controllable’ base foundation models, and support building cloud hosting platforms for models and datasets to facilitate developer sharing and collaboration. Beyond the minor translation bumpiness, the intentions are clear. The goal of the A+ plan is clear with multiple mentions of both open-source models and an open ecosystem with them where the models can be adopted widely. The ecosystem of models can make the impact of any one individual model greater than it would be alone. The Chinese government having centralized power has more direct levers to enact change than the White House, but this comes with the same trade-offs as all initiatives face when comparing the U.S. vs. China’s potential. I won’t review all of the differences in the approaches here. Where the Chinese Government enacts top level edicts that’ll be harder to follow from the West, there are numerous anecdotes and interactions that highlight in plain terms the mood of the AI ecosystem in China. I’ve routinely been impressed by the level of direct engagement I have received from leading Chinese AI companies and news outlets. Interconnects’ readership has grown substantially in China. Chinese companies are very sensitive to how their open contributions are viewed — highlighting great pride in both their work and approach. The latest case was via our China open model rankings that got direct engagement from multiple Chinese AI labs and was highlighted by a prominent AI news outlet in China — 机器之心/Synced. They described Interconnects

    14 min

About

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

You Might Also Like