Interconnects

Nathan Lambert

0.0 (0)
TECHNOLOGY
UPDATED WEEKLY

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

17 AUG

Ranking the Chinese Open Model Builders

The Chinese AI ecosystem has taken the AI world by storm this summer with an unrelenting pace of stellar open model releases. The flagship releases that got the most Western media coverage are the likes of Qwen 3, Kimi K2, or Zhipu GLM 4.5, but there is a long-tail of providers close behind in both quality and cadence of releases. In this post we rank the top 19 Chinese labs by the quality and quantity of contributions to the open AI ecosystem — this is not a list of raw ability, but outputs — all the way from the top of DeepSeek to the emerging open research labs. For a more detailed coverage of all the specific models, we recommend studying our Artifacts Log series, which chronicles all of the major open model releases every month. We plan to revisit this ranking and make note of major new players, so make sure to subscribe. At the frontier These companies rival Western counterparts with the quality and frequency of their models. DeepSeek deepseek.com | 🤗 deepseek-ai | X @DeepSeek_AI DeepSeek needs little introduction. Their V3 and R1 models, and their impact, are still likely the biggest AI stories of 2025 — open, Chinese models at the frontier of performance with permissive licenses and the exposed model chains of thought that enamored users around the world. With all the attention following the breakthrough releases, a bit more has been said about DeepSeek in terms of operations, ideology, and business model relative to the other labs. They are very innovative technically and have not devoted extensive resources to their consumer chatbot or API hosting (as judged by higher than industry-standard performance degradation). Over the last 18 months, DeepSeek was known for making “about one major release a month.” Since the updated releases of V3-0324 and R1-0528, many close observers have been surprised by their lack of contributions. This has let other players in the ecosystem close the gap, but in terms of impact and actual commercial usage, DeepSeek is still king. An important aspect of DeepSeek’s strategy is their focus on improving their core models at the frontier of performance. To complement this, they have experiments using their current generation to make fundamental research innovations, such as theorem proving or math models, which ultimately get used for the next iteration of models. This is similar to how Western labs operate. First, you test a new idea as an experiment internally, then you fold it into the “main product” that most of your users see. DeepSeekMath, for example, used DeepSeek-Coder-Base-v1.5 7B and introduced the now famous reinforcement learning algorithm Group Relative Policy Optimization (GRPO), which is one of the main drivers of R1. The exception to this (at least today) is Janus, their omni-modal series, which has not been used in their main line. Qwen qwenlm.ai | 🤗 Qwen | X @Alibaba_Qwen Tongyi Qianwen, the primary AI lab within Alibaba’s cloud division, is by far and away most known for their open language model series. They have been releasing many models across a range of sizes (quite similar to Llama 1 through 3) for years. Recently, their models from Qwen 2.5 and Qwen 3 have had accelerating market share among AI research and startup development. Qwen is closer to American Big Tech companies than to other Chinese AI labs in terms of releases: They are covering the entire stack, from VLMs to embedding models, coding models, image and video generation, and so on.They also cater to all possible customers (or rather every part of the open community) by releasing capable models of all sizes. Small dense models are important for academia to run experiments and for small/medium businesses to power their applications, so it comes to no surprise that Qwen-based models are exploding in popularity. On top of model releases for everyone, they also focused on supporting the (Western) community, releasing MLX and GGUF versions of their models for local usage or a CLI for their coding models, which includes a generous amount of free requests. Unlike some American companies, the core team seems to have stayed relatively small in terms of headcount, in line with other Chinese AI labs: Qwen3 has 177 contributors, whereas Llama 3 has thrice the amount, while Gemini 2.5 has over 3,000 people as part of the model. Close competitors These companies have recently arrived at the frontier of performance and we will see if they have the capability to consistently release great models at a pace matching Qwen or DeepSeek. Moonshot AI (Kimi) moonshot.cn | 🤗 moonshotai | X @Kimi_Moonshot Moonshot AI is one of the so-called “AI tigers”, a group of hot Chinese AI startups determined by Chinese media and investors. This group consists of Baichuan, Zhipu AI, Moonshot AI, MiniMax, StepFun, and 01.AI — most of which have attracted investments by tech funds and other tech grants. For example, Alibaba is seen as a big winner in the AI space by having their own models and by being a lead investor in Moonshot, sort of like how big tech companies in the U.S. are investing in fundraising rounds for newer AI labs. While their first models, K1 and K1.5, were closed and available on their API, they started releasing open models after the R1 release with experimental models using the Muon optimizer. Similar to DeepSeek, they focus on a single model line, with small experiments eventually feeding back into the main model. K2 is their “moonshot run,” a.k.a. yolo run, and quickly became a hit similar to R1 (see our report from the release). Further reading on Kimi can be found on ChinaTalk. Zhipu / Z.AI z.ai | 🤗 zai-org | X @Zai_org Zhipu, known in the west as Z.ai, is a startup spinoff of Tsinghua University with considerable investments by Chinese companies and VCs. Currently, they are even considering an IPO, which would make them the first AI tiger to do so. In terms of models, they are mostly known for their recent release of GLM-4.5 and GLM-4.5V, which are all very capable for their sizes (both of which are fairly large mixture of expert models). However, they are not just releasing LLMs, but also image and video generation models, setting them apart from pure-LLM companies and labs. Noteworthy These companies are transitioning to open releases, have open models with inferior capabilities, or slightly different foci than the text-centric labs pushing the frontiers of intelligence. StepFun stepfun.ai | 🤗 stepfun-ai | X @StepFun_ai StepFun first started as a closed model provider, but pivoted to open model releases after DeepSeek R1 shook up the industry. They are mostly focusing on multi-modal model releases, with Step3 being their flagship VLM. They also have image, audio and video generation models. Tencent (Hunyuan) hunyuan.tencent.com | 🤗 Tencent | X @TencentHunyuan Hunyuan is mostly known for HunyuanVideo and Hunyuan3D. While they have released three series of different LLMs, their releases come with very strict licenses, which is unusual for Chinese companies and dampens excitement when combined with performance levels that can be found elsewhere. RedNote (Xiaohongshu) xiaohongshu.com | 🤗 rednote-hilab The Chinese version of Instagram, RedNote, recently joined the ranks of Chinese companies releasing open models. Especially their capable character recognition / OCR model surprised many (see our coverage). Similar to Xiaomi and Baidu, it remains to be seen what their overall open strategy will be in the near and distant future and they have not competed in the large, frontier model space. MiniMax minimaxi.com | 🤗 MiniMaxAI | X @MiniMax__AI MiniMax is another of the AI tigers and also started as a closed company. After the release of R1, they changed their strategy and released the weights of Minimax-Text-01, following up with reasoning models building upon it. The unique selling point of these models are the 1M context window achieved with hybrid attention. These text models are not the only thing they are focusing on — they also have image and video generation models, but those remain closed and only available on their API. They are also promoting their consumer platform heavily as they eye an IPO. OpenGVLab / InternLM internlm.intern-ai.org.cn | 🤗 InternLM | X @opengvlab InternLM & OpenGVLab have deep ties to the Shanghai AI Laboratory, with InternLM focusing on the language models, while OpenGVLab releases vision models. While they release a range of models such as S1 or InternLM-Math, the orgs are mostly known for the strong InternVL series. While the first versions mostly used their own InternLM pretrained models, later releases (such as InternVL3) rely on Qwen as the language backend. Skywork skywork.ai | 🤗 Skywork | X @Skywork_AI The Singaporean Skywork first started out as an online karaoke company (yes, really) before they pivoted to AI and being a competitor to Manus, with their platform focusing on agents for work-related tasks, such as slide generation. Their LLM journey started with them releasing their own pretrained dense and MoE models. However, they stopped pre-training their own models and instead started to fine-tune existing models: Their OR1 reasoning model builds on top of DeepSeek-R1-Distill-Qwen-32B, R1V3 uses InternVL3 (which itself uses Qwen2.5 as its LLM backend). Aside from LLMs, they have a wide range of other models, from world models, image and video generation models, and reward models. Similar to their LLMs, they mostly build on top of other models. Unlike many labs, Skywork has released some datasets with their models, such as preference and reasoning training data. On the rise These companies are either just getting their toes wet with open models or operating as more of academic research organizations than labs pushing the performance of models. ByteDance Seed seed.bytedance.com | 🤗 ByteDance-Seed Seed is the R&D arm of ByteDance and eerily similar to Meta’s FAIR division: Diverse

13 min
15 AUG

Contra Dwarkesh on Continual Learning

Dwarkesh Patel’s now well-read post on why he is extending his AI timelines focuses on the idea of continual learning. If you ask me, what we have already is AGI, so the core question is: Is continual learning a bottleneck on AI progress? In this post, I argue that continual learning as he describes it actually doesn’t matter for the trajectory of AI progress that we are on. Continual learning will eventually be solved, but in the sort of way that a new type of AI will emerge from it, rather than continuing to refine what it means to host ever more powerful LLM-based systems. Continual learning is the ultimate algorithmic nerd snipe for AI researchers, when in reality all we need to do is keep scaling systems and we’ll get something indistinguishable from how humans do it, for free. To start, here’s the core of the Dwarkesh piece as a refresher for what he means by continual learning. Sometimes people say that even if all AI progress totally stopped, the systems of today would still be far more economically transformative than the internet. I disagree. I think the LLMs of today are magical. But the reason that the Fortune 500 aren’t using them to transform their workflows isn’t because the management is too stodgy. Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs. And this has to do with some fundamental capabilities these models lack. I like to think I’m “AI forward” here at the Dwarkesh Podcast. I’ve probably spent over a hundred hours trying to build little LLM tools for my post production setup. And the experience of trying to get them to be useful has extended my timelines. I’ll try to get the LLMs to rewrite autogenerated transcripts for readability the way a human would. Or I’ll try to get them to identify clips from the transcript to tweet out. Sometimes I’ll try to get them to co-write an essay with me, passage by passage. These are simple, self contained, short horizon, language in-language out tasks - the kinds of assignments that should be dead center in the LLMs’ repertoire. And they're 5/10 at them. Don’t get me wrong, that’s impressive. But the fundamental problem is that LLMs don’t get better over time the way a human would. The lack of continual learning is a huge huge problem. The LLM baseline at many tasks might be higher than an average human's. But there’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box. You can keep messing around with the system prompt. In practice this just doesn’t produce anything even close to the kind of learning and improvement that human employees experience. The core issue I have with this argument is the dream of making the LLMs we’re building today look more like humans. In many ways I’m surprised that Dwarkesh and other very AGI-focused AI researchers or commentators believe this — it’s the same root argument that AI critics use when they say AI models don’t reason. The goal to make AI more human is constraining the technological progress to a potentially impossible degree. Human intelligence has long been the inspiration for AI, but we have long surpassed it being the mirror we look to for inspiration. Now the industry is all in on the expensive path to make the best language models it possibly can. We’re no longer trying to build the bird, we’re trying to transition the Wright Brothers’ invention into the 737 in the shortest time frame possible. To put it succinctly. My argument very much rhymes with some of my past writing. Do language models reason like humans? No. Do language models reason? Yes. Will language model systems continually learn like humans? No.Will language model systems continually learn? Of course. Interconnects is a reader-supported publication. Consider becoming a subscriber. Dwarkesh writes “Rather, I think it’s genuinely hard to get normal humanlike labor out of LLMs.” This is because we’re still early on the buildout of the technology. Human labor takes an immense amount of context and quick thinking, both of which we’re starting to unlock with our language models. On top of this, human labor may not be what we want to create — we want to augment it. Using LLMs as drop in replacements for humans is not a requirement for AGI nor is what Dwarkesh describes a fundamental limitation on AI progress. Francois Chollet cleverly poked at this weakness in his recent conversation with Dwarkesh at an ARC-AGI event: Well, how do you define the difference between the ability to adapt to a new task and learning on the fly? It's, it sounds like the same thing to me. Language models can already pick up subtle context extremely fast. ChatGPT’s memory feature has gotten far better for me. When we’re using the far more powerful models we can expect in the next 18 months this’ll already start to appear magical. Language models are extremely apt at inferring context even without us giving it to them. Soon we’ll be unlocking that subtle connection engine by providing immense, explicit context. I don’t know of anyone who has actually thoroughly digitized all the relevant context of their job and formatted it in a way that is easily readable by an LLM. GPT-5 Pro estimates that all of the writing on Interconnects would be only 500K tokens. That would fit into an existing LLM with no extra system, but I’ve never tried it. The problem that Dwarkesh is facing is that we’re still using LLMs primarily in a single generation manner, which got far better with the introduction of reasoning models, but the economically useful way to use current tools in more complex intellectual domains will require a deep-research style approach over all of your recent work interactions. No one is giving language models that kind of context. None of the tools we use are set up properly to accumulate this type of context. I expect this to change rapidly. ChatGPT, Claude, and the likes are all adding memory features across chats and countless connectors to other pieces of information in your professional life. These memory features will be omnimodal and essential to extracting the type of value Dwarkesh wants. Without them, I agree language models in their current form are hopeless at solving continual learning. This is what I would expect the rumored $2000/month ChatGPT level subscriptions to work with. Each of these bespoke tasks needs to absorb a ton of context and reasoning tokens in order to make a directionally right output. If someone built the Claude Code equivalent for my Substack, with every post tagged by topic and performance metrics, I bet the AI could easily make useful suggestions on how to format my content. Continual learning in how Dwarkesh presents it is a systems problem rather than a learning problem. I expect better context management over my information ecosystem to exist in 2026, but more work to be needed for the AI companies to know how best to reference it and unlock in-context learning that feels like rapid adaptation. Call that 2027. The models that have been released in 2025 will make this far more tractable in the near future. Reasoning models have made in-context learning far more powerful, resulting in rapid progress on held-out and complex domains such as ARC-AGI. These models also have come with massive improvements in context length. Claude and Gemini have 1M+ token context lengths and GPT-5’s is at 400K — they’re all growing steadily. What is important with the context length numbers is that evaluations are showing that these are meaningful improvements that the models can leverage intelligently. With these reasoning models and smart retrieval of context, the systems we are building will look indistinguishable from continual learning. This will definitely be multiple LLMs working together and will operate very differently than the first versions of ChatGPT we were given (and often still use today). The path to continual learning is more context and more horsepower. This is directly in line with the direction AI investment is going. This doesn’t feel like a bottleneck, rather another product problem that we are going to solve. This sort of continual learning may not enable the type of raw intelligence and autonomy that many vocal leaders in AI describe as “superintelligence.” Training models to be smarter on even more complex tasks — e.g. novel biological research — requires mastering agentic behaviors that need to be learned from scratch, as discussed in my post on “What comes next with RL”. There’s no internet scale pretraining data for such agentic tasks. My point is that not all jobs that require continual learning will require the frontiers of intelligence. I’m excited to write blog posts with the bliss of my ChatGPT 6 co-editor. This technology coming soon will not be without its challenges. My first reaction to the continual learning post was more in line with “society isn’t ready for this” rather than commentary on its feasibility. I’ll repeat my warning: For a long time I’ve written that AI models have a higher risk potential in terms of social outcomes because the modalities they interact with us in are far more personal… As AI is going to be so powerful as a standalone entity, breaking some of the symbiotic links will be good for adding friction that makes the technology easier to steer towards good outcomes. In short, be wary of wishing for end-to-end (reinforcement) learning when you’re part of the environment.2 It’s a destiny to dystopia. What we have today is a form of AGI and it’ll soon get much better with better context and memory. The industrialization of language models is giving us incredible improvements across a wide swath of use-cases. These will blow past many basic primitives of intelligence in humans that have motivated AI for decades. First was models reasoning, then will come systems with continual learning. Th

10 min
7 AUG

GPT-5 and the arc of progress

If you want a video version of this, check out the last 20 minutes of the livestream reaction (edit, fixed link) I did with Will Brown of Prime Intellect and Swyx of Smol AI & Latent Space. GPT-5 was set up to fail on some of the narratives it was expected to satisfy. The two central themes it had to decide between were the AGI (or superintelligence) narrative that Sam Altman & co. have been using to fundraise and the fact that ChatGPT is one of the fastest-growing consumer technologies of all time. To fulfill both, GPT-5 needed to be AGI while also being cheap enough to serve as the most-used AI system in the world. Business and technological realities made it inevitable that GPT-5’s primary impact would be to solidify OpenAI’s market position, even if it raises a lot of eyebrows for the long-term trajectory of AI. The reactions online capture this as well. The OpenAI live streams have historically catered to AI insiders, but the product speaks entirely to a different audience. The people discussing this release on Twitter will be disappointed in a first reaction, but 99% of people using ChatGPT are going to be so happy about the upgrade. Confusingly enough, this includes many of the critics. GPT-5 is a good AI system. It’s right in line with best-in-class across pretty much every evaluation, while being cheap enough to serve the whole world. OpenAI is largely fixing its product offering with an announcement that was hyped to be one of the biggest AI news cycles of the year. AI news being loud is defined by narratives being different more-so than technology being better. OpenAI releasing an open model again will likely be pinpointed as just as important a day for the arc of AI as the GPT-5 release. In many ways GPT-5 was set up to fail and that is very off-putting for those expecting maximum AI progress in the near term. I’m not going to dwell on it, but oh boy, that was a messy release. GPT-5 being announced and rolled out like this is very odd. Countless plots were mislabeled, live demos had bugs, and the early rollout is doing some weird stuff. This reinforces how OpenAI was torn about the release and backed into a corner with their messaging. They knew they needed to improve the experience with strong competition in the industry, but releasing GPT-5 needed to make a splash after how long they’ve waited (and already parked the GPT 4.5 name). The core question we track in this post is: What does it mean for the next 6-18 months of AI progress if GPT-5 is just as good as all the best models out there, e.g., Claude Sonnet for coding or o3 for search, funneled into one, super cheap package? If AGI was a real goal, the main factor on progress would be raw performance. GPT-5 shows that AI is on a somewhat more traditional technological path, where there isn’t one key factor, it is a mix of performance, price, product, and everything in between. Interconnects is a reader-supported publication. Consider becoming a subscriber. GPT-5’s performance There are a few places that we can see that GPT-5 represents a solid step on the performance trend line, but nothing like a step change. First, on LMArena, GPT-5 is fantastic, sweeping the board to #1 on all categories. The last model to claim #1 in pretty much every category was Gemini 2.5 Pro — and that was the biggest step change in Elo since GPT-4 Turbo skyrocketed past the first Claude. Second, GPT-5 is the top model on the ArtificialAnalysis composite benchmark. These two, LMArena & ArtificialAnalysis, represent two coarse evaluations — community vibes and raw benchmarks. Both of these can be gamed, but are still correlated with real-world use. You can also see in OpenAI’s shared results how much the smaller versions improve on the likes of GPT-4.1 mini and o4-mini. In many ways, the march of progress on evals has felt slowed for a while because model releases are so frequent and each individual step is smaller. Lots of small steps make for big change. The overall trend line is still very positive, and multiple companies are filling in the shape of it. My post on “what comes next” from earlier this summer all but called this type of release, where the numbers aren’t shocking but the real world use cases are great, becoming more common. This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working. To say it succinctly: Abilities will develop more slowly than products. The product overhang is being extended with each release. We’re still building untapped value with AI models and systems faster than we’re capturing it. Another way to see this incremental push out in models or systems is through OpenAI’s update to the famous METR plot of time to completion for humans of various tasks AI systems can solve 50% of the time. GPT-5 is leading, but also just in line with trends. All of this is to say comprehensively that AI progress is very alive and well, as long as you don’t subscribe to the exponential takeoff in ability. Those arguments are very strained by this GPT-5 release. Yes, AI progress on intelligence and “raw ability” is certainly going to continue at a solid pace for a long time, but how will this translate into recursive self-improvement? GPT-5’s details If you’re reading closely, you may have noticed that this post uses the word system instead of model. All of the leading chat systems have been adding more components onto them like safety checkers and so on, but this is the first one to use different architectures and weights for the primary generation of content across similar queries. GPT-5 is the first in what is to come, mostly to better balance cost and give better user experiences. From the system card: GPT‑5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent (for example, if you say “think hard about this” in the prompt). The router is continuously trained on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time. Along with this, they shipped many product improvements, such as how the model has a 400K context window in the API with great performance, reduced hallucinations, and new personalities. Primarily, I worry as a power user about the router. I sense that for now I’ll default to GPT-5 Thinking, and sometimes upgrade to Pro mode, while downgrading to standard GPT-5 only for benign queries (depending on its search behavior — if it is search-heavy like o3 without thinking, then it should still work well). Thankfully, the thinking mode has a “get an early answer” button, so I don’t see any reason to start elsewhere. If I need an answer fast, I’ll get one. If not, I want the best responses possible. As for prices, here’s a comparison. GPT-5’s top-level model is cheaper than Claude Sonnet and far better than any OpenAI model has been before at coding — one of the core details of this release. Matching Gemini Pro’s pricing when considering Google’s infrastructure advantage is a substantial accomplishment. * OpenAI — GPT-5 (API sizes) * GPT-5: input $1.25, output $10.00. (OpenAI) * GPT-5 mini: input $0.25, output $2.00. (OpenAI) * GPT-5 nano: input $0.05, output $0.40. (OpenAI) * OpenAI — o3 (reasoning) * o3: input $2.00, output $8.00. (OpenAI Platform) * o3-mini: input $1.10, output $4.40. (cached input $0.55) (OpenAI Platform) * Anthropic — Claude 4 family * Claude Sonnet 4: input $3.00, output $15.00. (Anthropic) * Claude Opus 4.1: input $15.00, output $75.00. (Anthropic) * Google — Gemini 2.5 * Gemini 2.5 Pro: input $1.25 (≤200k prompt) / $2.50 (>200k); output $10.00 (≤200k) / $15.00 (>200k). (Google AI for Developers) * Gemini 2.5 Flash: input $0.30 (text/image/video) or $1.00 (audio); output $2.50 (includes thinking tokens). (Google AI for Developers) * Gemini 2.5 Flash-Lite: input $0.10 (text/image/video) or $0.30 (audio); output $0.40. (Google AI for Developers) Cheaper, thinking models that work well in applications are far more useful than scaling (as GPT-4.5 has shown us). GPT-5’s impact It seems like most people in all walks of life are going to love this model — from AI researchers all the way to people who are learning of ChatGPT for the first time today. This is very in line with my expectations for how AI will proceed, as a long, steady march of progress. The fact that the models are getting way cheaper rather than way more expensive definitely signals that we cannot just brute-force scale our way to much stronger systems. Scaling helps, but it is now one of many considerations, and all the laboratories are showing us that much bigger models have diminishing returns in value to customers. At the same time, models being cheaper could be just what we need for Jevons paradox to kick in and provide another boost in AI adoption. Many people will claim that the GPT-5 release was a flop and the bubble will pop for AI. This is downstream of the industry generally making totally unrealistic promises. As someone whose core through-line when covering frontier models is tracking the pace of progress, I translate this as “AI capabilities on benchmarks will proceed a bit more slowly, but we aren’t reaching any clear walls in performance.” The AI performance hills we’re climbing up a

11 min
5 AUG

gpt-oss: OpenAI validates the open ecosystem (finally)

OpenAI released two open-weight, text-only reasoning models today, both mixture of experts (MoE) sized to run efficiently on a range of hardware from consumer GPUs to the cloud. These models have the Apache 2.0 license, so they’re available for distillation into other reasoning models, deployment into commercial products, and are free of downstream restrictions. These two models, the smaller gpt-oss-20B with 3.6B active parameters and 21B total and the larger gpt-oss-120B with 5.1B active parameters, follow the trends we’ve seen with the other leading open models in architecture choices. Where this release shines is in the dramatic change in open model performance and strategy that comes with the leading name in AI releasing an open model that undercuts some of their own API products. We’ll get to the technical details on the model later, but the main point of this post is how much OpenAI has changed by releasing their first open language model since GPT-2. The larger 120B model “achieves near-parity with OpenAI o4 mini on core reasoning benchmarks‬” and is a major moment for the ecosystem: * OpenAI has released an open model at the frontier of current open model performance — highlighting how major concerns over open models that OpenAI leadership mentioned in 2023 were overblown. The marginal risks of open models have been shown to not be as extreme as many people thought (at least for text only — multimodal is far riskier). Once other organizations, particularly Meta and China showed OpenAI that there was no risk here, the path was opened to release a model. * OpenAI has revealed far more of their technical stack than any release to date. This blog post has light details on many things in the model, but community tinkering will begin to better understand what is going on here. This includes basic things like our first time seeing a raw chain of thought (CoT) for an OpenAI reasoning model, but also more interesting things like how this model is trained to use tools in the CoT like their o3 model. Other details include researchers being able to play with OpenAI’s instruction hierarchy in raw weights (where pieces of it are untouchable in the API), a new “harmony” prompt format, the same “reasoning efforts” of low, medium & high from the API, a huge proof of concept on how far basic, community standard architectures with MoEs can be pushed, and other small details for the AI community to unpack. * OpenAI has initiated a scorched earth policy on the API market, undercutting their own offerings and unleashing an extremely strong, trusted model brand with a permissive license. While adoption of any open model is much slower than an API due to testing, additional configuration, etc., this is set up to go about as fast as it can. Any API model that competes with current models like OpenAI o4 mini, Claude Haiku, Gemini Flash, DeepSeek R1 etc. are all going to have to compete with this model. OpenAI’s o4 mini model is currently served at $1.1 per million input tokens and $4.4 per million output. Serving this open model will likely cost at least 10x less. There are many potential strategic reasons for this, all of which paint OpenAI as having a clearer vision of what makes it valuable. What OpenAI hasn’t touched with this model is interesting too — “For those seeking multimodal support, built-in tools, and‬ seamless integration with our platform, models available through our API platform remain the‬ best option.” These are dropped for reasons above, and “headaches” discussed later in the post. Together, these paint a much clearer vision by OpenAI on how they’ll control the AI ecosystem. The top potential reasons on my mind are: * OpenAI could be trying to make all API models potentially obsolete on cost ahead of the GPT-5 release, which they hope to capture the top end of the market on. Or, * OpenAI could be realizing that models are no longer their differentiation, as ChatGPT users continue to steadily climb — and they’ll soon pass 1 billion weekly actives. There are plenty of other reasons, such as the politics alluded to at the end of the blog post, but OpenAI tends to only act when it serves them directly — they’ve always been a focused company on their goals. There’s also a long list of head scratchers or in-between the lines points that illuminate OpenAI’s strategy a bit more. OpenAI of course didn’t release training data, code, or a technical report, as expected. OpenAI is trying to make a big splash with the name that captures more of the enterprise market, but in doing so takes some collateral damage in the research and true “open source” AI communities. These future questions include: * The naming is bad — a mixture of cringe, confusion-inducing, and still useful for their marketing goals. For anyone following open-source AI for a long time it won’t be new that a major company is blurring the association of the term open-source with the community accepted definitions. I understand why OpenAI did this, but the naming conflict further enforces that the true open source AI community isn’t the target of this release — it’s people that want to try an “open source AI model” for their business, and OpenAI has made the target too big to miss for enterprises. * OpenAI did not release the base models. Anyone following the space would’ve expected this, but it matters substantially for researchers. These two sparse, low numerical precision MoE models won’t be easy for researchers to use. The best model for researchers and tinkerers are dense, base models from 1 to 7 billion parameters. These are much “longer term” artifacts in the open community that will still be using almost only Qwen. I need to take a second before the “unknowns” section and comment on the architecture. These models are reinforcing trends we’re seeing in modeling across the industry. Recent frontier open models are all very sparse MoEs inspired by the DeepSeek architecture. DeepSeek V3 had 37B active and 671B total parameters. Kimi K2 had 32B active and 1T total parameters. With 5B active and 121B total, the sparsity factor fits right in with normal. Sparsity in MoEs is totally king right now. The smaller gpt-oss is a bit less sparse than Qwen’s 3B active, 30B total smaller MoE, but expect the sparsity of these models to continue to increase. Some things we need more testing to know the impact of include: * The model has been quantized for release to MXFP4 (4 bit floating point). It’s not clear exactly who will be impacted here, but this could make it benefit people most with the newest hardware, cause minor issues across Torch/Cuda versions, or even make some of the behaviors weird relative to the trained version internal to OpenAI. This could also be a plus, depending on performance, as the bigger model is quantized to 4 bit precision to enable it to be run on GPUs with 80GB of memory, such as the A/H100 line from NVIDIA. * Safety measures have been taken to change how finetunable the model is. With, or soon after, this release OpenAI is releasing a research paper on new methods to make it so you can’t “finetune the safety away” from a released instruct model. This is a very long-standing issue that people have concerns with over releasing open models. The main question here is if the models OpenAI releases are still able to be finetuned or not for productive use-cases. OpenAI claims they can be in their blog post, but this will be left up to the community to decide. Is finetuning the safety away actually a feature of an easy to use model?For example, Gemma has been tougher for people to finetune historically because it uses a different attention implementation and has a different parameter space from being distilled. Open finetuning stacks are still tuned for Llama and Qwen — this takes a long time to change.Many people will take the “we made it impossible to un-censor this model” as a challenge, which will be interesting to follow in the jailbreaking research community. There is a substantial market for modifiable models. * The model was trained to expect tools, but open model tool use is a mess. One of the biggest problems I worry about in designing an OLMo model with native o3-style tool use is that I need to make it seamless for users to use the same tools from training time at inference time. An early tester in my network mentioned that the model would hallucinate tool calls from training (sort of like what was mentioned around o3’s full release). I don’t expect this to be an unsolvable issue, but it could slow adoption. It could also allow people to reverse engineer the tools that OpenAI uses during training, we’ll see! * We need to re-benchmark the model on open infrastructure. OpenAI did a good job for this release integrating it everywhere, but we need to confirm that the community can easily replicate their evaluation scores. Evaluation at closed labs has increasingly become bespoke to suit their internal needs, which is a logical decision, but this comes at a cost of friction when an open model is released. This is me saying loud and clear that this isn’t a model performance review in a nuanced sense, but a summary of the importance of OpenAI’s approach (and where the opportunity is for the rest of us). Not all good models are easy to use. Some models benchmark well and are useful — e.g. Qwen. Some models benchmark well and are forgotten. Regardless of scores, I expect this to be a useful model. Overall, I would give OpenAI a very strong grade on their first open release in a while — they definitely listened to the feedback given by the community. The path to earning goodwill with the open community, especially with researchers, is to embrace more risk in making models that are easier to modify (and potentially even more revealing), such as the base models for these checkpoints. Open models from the U.S. labs

14 min
4 AUG

Towards American Truly Open Models: The ATOM Project

I’m very excited to share a substantial project on invigorating investment in open language models and AI research in the U.S. The ATOM (American Truly Open Models) Project is the mature evolution of my original “American DeepSeek Project” and I hope it can help be a turning point in the current trajectory of losing open model relevance vis-a-vis China, and even the rest of the world. I’ve included the full text below, but I encourage you to visit the website for the full version with added visuals, data, and a place to sign your support. This is a community movement, rather than me fundraising, starting an organization, or anything like that If you can help get the word out and or sign your support, I’d greatly appreciate it. (Or watch a 5 minute overview on YouTube) The ATOM Project: Towards fully open models for US research & industry Reinvigorating AI research in the U.S. by building leading, open models at home America's AI leadership was built by being the global hub and leading producer of open AI research, research which led directly to innovations like the Transformer architecture, ChatGPT, and the latest innovations in reasoning models and agents. America is poised to lose this leadership to China, in a period of geopolitical uncertainty and rising tensions between these two nations. America's best AI models have become more closed and restricted, while Chinese models have become more open, capturing substantial market share from businesses and researchers in the U.S. and abroad. Open language models are becoming the foundation of AI research and the most important tool in securing this leadership. America has lost its lead in open models – both in performance and adoption – and is on pace to fall further behind. The United States must lead AI research globally, and we must invest in making the tools our researchers need to do their job here in America: a suite of leading, open foundation models that can re-establish the strength of the research ecosystem. Recommendation: To regain global leadership in open source AI, America needs to maintain at least one lab focused on training open models with 10,000+ leading-edge GPUs. The PRC currently has at least five labs producing and releasing open models at or beyond the capabilities of the best U.S. open model. Regaining open source leadership is necessary to drive research into fundamental AI advances, to maximize U.S. AI market share, and to secure the U.S. AI stack. Overview Open language model weights and data are the core currency of recent AI research – these are the artifacts that people use to come up with new architectures, training paradigms, or tools that will lead to the next paradigms in AI to rival The Transformer or Inference-time Scaling. These research advances provide continued progress on existing products or form the basis for new technology companies. At the same time, open language models create potential for a broader suite of AI offerings by allowing anyone to build and modify AI how they see fit, without their data being sent through the cloud to a few, closed model providers. Open language models are crucial for long-term competition within American industry. Today, substantial innovation is happening inside of large, closed AI laboratories, but these groups can only cover so many of the potential ideas. These companies spend the vast majority of their resources focusing on the next model they need to train, where the broader, open research community focuses on innovations that’ll be transformative in 2, 5, 10, or more years. The most progress in building useful, intelligent AI systems will come when the most people can participate in improving today's state-of-the-art, rather than the select few at certain companies. The open AI ecosystem (regarding the models, not to be confused with the company OpenAI) has historically been defined by many parties participating. The United States emerged as a hub of the deep learning revolution via close collaboration between leading technology companies and academic institutions. Following ChatGPT, there have been countless contributions from around the globe. This distribution of impact on research has been collapsing towards clear Chinese leadership due to their commitment to open innovation, while a large proportion of leading scientists working in the United States have joined closed research organizations. The playbook that led Google to invent and share the Transformer – the defining language model architecture of which all leading models such as ChatGPT, Gemini, or Claude are derived from – is now the standard mode of operation for Chinese companies, but it is increasingly neglected by American companies. The impact of China’s models and research are growing because the institutions focused on open models have access to substantial compute resources for training – e.g. some have formed a close relationship between leading AI training laboratories and academic institutions. Until the United States and its partners directly invest in training more, higher performance open models and sharing the processes to do so, its pace of progress in AI research will lag behind. To train open models at the frontier of performance, a developer currently needs a high concentration of capital and talent. We estimate that to lead in open model development, the United States needs to invest in multiple clusters of 10,000+ H100 level GPUs to create an ecosystem of fully open language models that are designed to enable a resurgence in Western AI research. Stacking large investments such as this into a few focused efforts will help them to learn from each other and make progress across a range of challenges quickly and robustly. Splitting such an investment in AI training into smaller, widespread projects will not be sufficient to build leading models due to a lack of compute concentration. Along the way we need to build models of various sizes that can enable applications of AI at every scale from local or edge devices all the way to high performance cloud computing. Open models as the engine for AI research and development America's AI leadership was built by tens of thousands of our best and brightest students, academics and researchers. This process occurred over decades, but it is faltering at a crucial transition point to the new, language modeling era of AI research. Since the release of ChatGPT, open language models and computational resources are the most important table stakes for doing relevant and impactful research. High-quality open models and their subsequent technical reports quickly accrue thousands of citations and accolades such as best paper awards and the focus of large swaths of students. These act as foundational currencies of AI research and are crucial, achievable artifacts for the long-term American AI ecosystem. While many direct consumers of open models are academics, this community is far from the only group that will benefit immensely from a new wave of American open models. The low cost, flexibility, and customizability of open models makes them ideal for many use cases, including many of the ways that AI stands to advance and transform businesses large and small. If the United States does not create its own leading open models, the focus of American researchers and businesses will continue to shift abroad. The benefits of openly sharing a technology accrue to the builder in mindshare and other subtle soft power dynamics seen throughout the history of open source software. Today, these benefits are accruing elsewhere due to the intentional support of open models by many Chinese organizations. The gap in performance and adoption will only grow as the American ecosystem sees strong open models as something that is nice to have, or an afterthought, rather than a key long-term priority. China is adopting the playbook for open innovation of language models that the United States used to create its current AI leadership, yielding rapid innovation, international adoption, and research interest. The collapse of American dominance in AI research is driven not only by the remarkable quality of the Chinese ecosystem, but also by the commitment of China to these very same Open Model Principles - the principles that American scientists used to start this AI revolution. This is reflected further in a consistent trend of Chinese open models being released with more permissive terms of use than their American counterparts. The many leading closed research institutions in the United States are still creating world-class models – and the work they do is extraordinary. This collapse is not their fault, but closed labs make closed research, and the acceleration of AI was built on open collaboration with world-class American models as the key tool. As researchers, our focus is on leading the research and development for the core technology defining the future, but there is also a growing list of other urgent security and policy concerns facing our nation around the lack of strong open models. To start, adoption of open models from the PRC in the US and our allies has been slow in some sectors due to worries about backdoors or poor security in generated code. Similarly, there is concern over the outputs of these Chinese models being censored or inconsistent with everyday American values of freedom, equality, and independence. There are even parallels between how the PRC’s national AI champions are increasingly racing to release cheap and open AI models and the PRC’s historical practice of dumping state-subsidized, below-cost exports from China to undermine American competitors. With the dynamic and rapid evolution of this technology, we need to get ahead of these issues before stronger habits, cost disadvantages, or other incentives reduce the practicality of adopting American open models. America's lost lead in open model performance On countless benchmarks, the

22 min
29 JUL

Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next

I’m excited to welcome Ross Taylor back on the podcast (and sorry for the lack of episodes in general – I have a lot going on!). The first time Ross came on we focused on reasoning – before inference-time scaling and that sort of RL was popular, agents, Galactica, and more from his Llama days. Since then, and especially after DeepSeek R1, Ross and I have talked asynchronously about the happenings of AI, so it’s exciting to do it face to face. In this episode we cover some of everything: * Recent AI news (Chinese models and OpenAI’s coming releases) * “Do and don’t” of LLM training organizations * Reasoning research and academic blind spots * Research people aren’t paying enough attention to * Non language modeling news & other topics Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here. Show outline as a mix of questions and edited assertions that Ross sent me as potential topics. 00:00 Recent AI news Related reading is on Kimi’s K2 model, thoughts on OpenAI’s forthcoming open release. * What did you think of Z.ai’s GLM 4.5 model (including MIT licensed base model) with very strong scores? And Kimi? * What will OpenAI’s open model actually be? * What do you make of the state of the ecosystem? 12:10 “Do and don’t” of LLM training organizations Related reading is on managing training organizations or the Llama 4 release. This is one of my favorite topics – I think a lot of great stuff will be written on it in the future. For now, Ross asserts… * Most major LLM efforts are not talent-bound, but politics-bound. Recent failures like Llama 4 are org failures not talent failures. * Most labs are chaotic, changing direction every week. Very different picture from the narrative presented online. * Most labs represent investment banks or accountancy firms in that they hire smart young people as “soldiers” and deliberately burn them out with extremely long hours. 36:40 Reasoning research and academic blind spots Related reading is two papers point questions at the Qwen base models for RL (or a summary blog post I wrote). I start with: What do you think of o3, and search as something to train with RL? And Ross asserts… * Most open reasoning research since R1 has been unhelpful - because not enough compute to see what matters (underlying model and iterations). * Best stuff has been simple tweaks to GRPO like overlong filtering and removing KL divergence. * Far too much focus on MATH and code - AIME has tens of samples too so is very noisy. * People are generally building the wrong kind of environments - like puzzles, games etc - instead of thinking about what kind of new capabilities they’d like to incentivise emerging. 50:20 Research people aren’t paying enough attention to The research area I hear the most about right now is “rubrics” – a per-prompt specialized LLM-as-a-judge to replace reward models. SemiAnalysis reported OpenAI scaling this approach and lots of great research is coming out around it. I start with: What do you think of the state of RL scaling and generalization? What of models losing Ross asserts… * Rubrics are underhyped on social media - they were driving force behind projects like DeepResearch - and GenRMs are interesting but perhaps slightly overhyped. * There is an evals crisis - there are not enough high quality evals, particularly for frontier tasks like automating research and real life work. Impediment to anyone building agents or ASI. 01:02:46 Extra stuff! I ask Ross: What AI are you using today? Why? To conclude, Ross wanted to discuss how AlphaEvolve has been underhyped on social media, and means the future isn’t just RL. Shows there are other effective ways to use inference compute. Interconnects is a reader-supported publication. Consider becoming a subscriber. Transcript Created with AI, pardon the minor typos, not quite enough time this week but I’m hiring someone to help with this soon!Nathan Lambert: Hey, Ross. How's it going? Welcome back to Interconnects. I took a many month break off podcasting. I've been too busy to do all this stuff myself. Ross Taylor: Yeah, I was trying to think of all the things that happened since the last time we did a podcast a year ago. In AI time, that's like two hundred years. Nathan Lambert: Yeah, so I was looking at it. We talked about reasoning and o1 hadn’t happened yet. For a brief intro, Ross was a co-founder of Papers with Code, and that brought him to Meta. And then at Meta, he was a lead on Galactica, which was a kind of language model ahead of its time relative to ChatGPT. So if people don't know about Galactica, there's a great paper worth reading. And then he was doing a bunch of stuff on reasoning with Llama related to a lot of the techniques that we'll talk about in this episode. And now he's doing a startup. I don't know if he wants to talk about this, but generally, we talk a lot about various things. This got started through o1 and trying to figure out scaling RL. We started talking a lot but then we also just resonate on a lot of topics on training language models and other fun stuff - and also trying to be one of the few people not in these big labs that tries to talk about this and think about what the heck's going on. So we're gonna kind of roll through a long list of a lot of things that Ross sent me that he wanted to talk about, but this will be a compilation of the things that we've talked about and fleshing them out outside of the Signal chat. So, Ross, if you want to introduce yourself more, you can, or we'll just start talking about news because I think a lot of people already know you. Ross Taylor: Yeah, let's get into the news. There’s lots of fun things to talk about. Nathan Lambert: So, the last two weeks of Chinese models. I think we had Z.ai's GLM 4.5 today. Kimi-K2 last week. I think Qwen is on a roll. I thought summer was supposed to be chill but this is crazy. I haven't even used all of these. The pace is just incredible. And all the open models have actually good licenses now. But is this going to hurt anyone in the US? Where do you see this going in six months? Ross Taylor: Yeah, so yesterday was the one day I actually tried to turn off Twitter. And so when you told me in the morning about the new GLM model, I had to read up on that. So that shows if you take your eye off Twitter for one second, then you’re behind on open source... But yes, I think the general theme is that it’s been absolutely relentless. So thinking about the last time I spoke to you on the podcast a year ago, Llama 3 was a fairly established standard. There were still things happening in the background, if you paid attention to things, but now it's absolutely relentless. In the case of China, I think their business culture is that - as soon as they find something is successful - they’re very good at concentrating resources and going after it. So it’s created a very competitive space. I think the context is very interesting in several different dimensions. There's the geopolitical dimension, which you've hinted at in some of your blogs. For example, what does it mean if the open source standard is Chinese? What does that mean if we think about these models not just as things which power products, but as (critical) infrastructure? Then it seems like China has a great advantage if they want to be the standard for the whole Global South. Nathan Lambert: Yeah. There are a few things that we're going to come back to in this conversation that are so interesting. We're gonna roll into what it takes to train these models. And we're going to talk about how crazy, political and hard it is in the US. But we have all these orgs popping up in China - so is this partially just a US problem? But then we also have OpenAI that's supposedly going to release a model. There are multiple things. But my question is: why is China doing so well? Are they well suited to training these language models? Ross Taylor: I’ll caveat what I’m about to say by saying that I want to be careful about making generalisations. Because, for example, we’ve seen some of these new Chinese organisations be good at innovation - for example, this week we had GSPO which was nice. But for Chinese orgs, my general sense is that, once something has already been validated, the specification for what to build has been set, and the task can be reduced to an engineering problem, then Chinese culture is very well set up to succeed in those situations. The other dimension which has become relevant - especially after DeepSeek - is that the Chinese Government has traditionally been very good at recognising what’s successful, pouring resources in, and facilitating public-private collaborations. I think that surprises people still in the West. For example, people are surprised that a group can come out of Tsinghua can and fairly quickly have their own state-of-the-art LLM. Why isn’t there a similar story for groups coming out of MIT? Nathan Lambert: I’m not sure about this. Ross Taylor: I think the US will eventually wake up to this, but… Nathan Lambert: My understanding is that Z.ai is a startup that spun out of Tsinghua, so I don’t know if it’s the best comparison. Also Alibaba is the clear winner here because they have Qwen, but they’ve also invested in Moonshot, which is Kimi, and then I think also Z.ai. So I’m more interested in the question as to why they are all open. That seems more important relative to talent because there are lots of universities that might have model orgs spinning out of them - even in the US - and it’s not solely a Chinese thing. I think it could happen with a group out of MIT. That being said, I agree that the US should have more compute deployed for academics and a lot of universities are just spinning them up now. It just takes a long time. So I think there’s a lot of things that Twitter is mixing up here.

1h 15m
23 JUL

The White House's plan for open models & AI research in the U.S.

Today, the White House released its AI Action Plan, the document we’ve been waiting for to understand how the new administration plans to achieve “global dominance in artificial intelligence (AI).” There’s a lot to unpack in this document, which you’ll be hearing a lot about from the entire AI ecosystem. This post covers one narrow piece of the puzzle — its limited comments on open models and AI research investment. For some context, I was a co-author on the Ai2 official comment to the Office of Science and Technology Policy (OSTP) for the AI Action Plan and have had some private discussions with White House staff on the state of the AI ecosystem. A focus of mine through this document is how the government can enable better fully open models to exist, rather than just more AI research in general, as we’re in a shrinking time window where if we don’t create better fully open models then the academic community could be left with a bunch of compute to do research on models that are not reflective of the frontier of performance and behavior. This is why I give myself ~18 months to finish The American DeepSeek Project. Important context for this document is to consider what the federal government can actually do to make changes here. The executive branch has limited levers it can pull to disperse funding and make rules, but it sends important signaling to the rest of the government and private sector. Overall, the White House AI Action Plan comes across very clearly that we should increase investment in open models, and for the right reasons. This reflects a shift from previous federal policy, where the Biden executive order had little to say about open models other than them getting grouped into models needing pre-release testing if they were trained with more than 10^26 FLOPS (which led to substantial discussion on the general uselessness of compute thresholds as a policy intervention). Later, the National Telecommunications and Information Administration (NTIA) released a report from under the umbrella of the Biden Administration that was far more positive on open models, but much more limited in the scope of its ability for agenda setting. This is formatted as comments in line with the full text on open models and related topics in the action plan. Let’s dive in, any emphasis in italics is mine. Encourage Open-Source and Open-Weight AI Open-source and open-weight AI models are made freely available by developers for anyone in the world to download and modify. Models distributed this way have unique value for innovation because startups can use them flexibly without being dependent on a closed model provider. They also benefit commercial and government adoption of AI because many businesses and governments have sensitive data that they cannot send to closed model vendors. And they are essential for academic research, which often relies on access to the weights and training data of a model to perform scientifically rigorous experiments. This covers three things we’re seeing play out with open models and is quite sensible as an introduction: * Startups use open models to a large extent because pretraining themselves is expensive and modifying the model layer of the stack can provide a lot of flexibility with low serving costs. Today, most of this happens on Qwen at startups, where larger companies are more hesitant to adopt Chinese models. * Open model deployments are slowly building up around sensitive data domains such as health care. * Researchers need strong and transparent models to perform valuable research. This is the one I’m most interested in, as it is the one with the highest long-term impact by determining the fundamental pace of progress in the research community. We need to ensure America has leading open models founded on American values. Open-source and open-weight models could become global standards in some areas of business and in academic research worldwide. For that reason, they also have geostrategic value. While the decision of whether and how to release an open or closed model is fundamentally up to the developer, the Federal government should create a supportive environment for open models. The emphasized section is entirely the motivation behind ongoing efforts for The American DeepSeek Project. The interplay between the three groups above is inherently geopolitical, where Chinese model providers are actively trying to develop mindshare with Western developers and release model suites that offer great tools for research (e.g. Qwen). The document is highlighting why fewer open models exist right now from leading Western AI companies, simply “the decision of whether and how to release an open or closed model is fundamentally up to the developer” — this means that the government itself can mostly just stay out of the way of leading labs releasing models if we think the artifacts will come from the likes of Anthropic, OpenAI, Google, etc. The other side of this is that we need to invest in building organizations around releasing strong open models for certain use cases that do not have economic conflicts or different foci. Onto the policy steps. Recommended Policy Actions * Ensure access to large-scale computing power for startups and academics by improving the financial market for compute. Currently, a company seeking to use large-scale compute must often sign long-term contracts with hyperscalers—far beyond the budgetary reach of most academics and many startups. America has solved this problem before with other goods through financial markets, such as spot and forward markets for commodities. Through collaboration with industry, the National Institute of Standards and Technology (NIST) at the Department of Commerce (DOC), the Office of Science and Technology Policy (OSTP), and the National Science Foundation’s (NSF) National AI Research Resource (NAIRR) pilot, the Federal government can accelerate the maturation of a healthy financial market for compute. The sort of issue the White House is alluding to here is that if you want to have 1000 GPUs as a startup or research laboratory you often need to sign a 2-3 year commitment in order to get low prices. Market prices for on-demand GPUs tend to be higher. The goal here is to make it possible for people to get the GPU chunks they need through financial incentives. We’ve already seen a partial step for this in the recent budget bill, where AI training costs now can be classified as R&D expenses, but this largely helps big companies. Actions here that are even more beneficial for small groups releasing open weight or open-source models would be great to see. One of the biggest problems I see for research funding is going to be the challenge of getting concentrated compute into the hands of researchers, so I hope the administration follows through here for compute density in places. A big pool of compute spread across the entire academic ecosystem means too little compute for models to get trained at any one location. It reads as if the OSTP understands this and has provided suitable guidance. Interconnects is a reader-supported publication. Consider becoming a subscriber. * Partner with leading technology companies to increase the research community’s access to world-class private sector computing, models, data, and software resources as part of the NAIRR pilot. * Build the foundations for a lean and sustainable NAIRR operations capability that can connect an increasing number of researchers and educators across the country to critical AI resources. This is simple and to my knowledge has largely been under way. NAIRR provided a variety of resources to many academic parties, such as API credits, data, and compute access, so it should be expanded upon. I wrote an entire piece on saving the NAIRR last November when its funding future was unclear (and needed Congressional action). This is the balance to what I was talking about above on model training. It provides smaller resource chunks to many players, which is crucial, but doesn’t address the problem of building great open models. * Continue to foster the next generation of AI breakthroughs by publishing a new National AI Research and Development (R&D) Strategic Plan, led by OSTP, to guide Federal AI research investments. This seems like a nod to a logical next step. Where the overall picture of research funding in the U.S. has been completely dire, the priority in AI research has already been expressed through AI being the only area of NSF grant areas without major cuts. There is likely to be many other direct effects of this, but it is out of scope of the article. More exact numbers can be found in the NSF 2026 proposed budget, where AI is an outlier as one of the only topics with a positive net change from 2024 or 2025. * Led by DOC through the National Telecommunications and Information Administration (NTIA), convene stakeholders to help drive adoption of open-source and open-weight models by small and medium-sized businesses. This is a more unexpected line item, but a welcome one. It’ll be harder to implement, but if it works it’ll do a lot of good for building momentum around open model investment. A large part of why few open models exist in the U.S. is just because there’s not a lot of business value from releasing them. A big story of 2025 has been how open models are closing the gap in capabilities, or at least crossing important ability thresholds, which could start to change this equilibrium. That’s it for the core section on open models! It’s right to the point. There are a couple related sections I wanted to point you to, which largely complement the above or show how it is hard for a document like this to acknowledge things like “our R&D ecosystem is being outcompeted by Chinese models.” First, more on AI research itself. Advance the Science of AI Just as LLMs and generative AI systems represented a paradigm shift in the

13 min
14 JUL

Kimi K2 and when "DeepSeek Moments" become normal

https://www.interconnects.ai/p/kimi-k2-and-when-deepseek-moments The DeepSeek R1 release earlier this year was more of a prequel than a one-off fluke in the trajectory of AI. Last week, a Chinese startup named Moonshot AI dropped Kimi K2, an open model that is permissively licensed and competitive with leading frontier models in the U.S. If you're interested in the geopolitics of AI and the rapid dissemination of the technology, this is going to represent another "DeepSeek moment" where much of the Western world — even those who consider themselves up-to-date with happenings of AI — need to change their expectations for the coming years. In summary, Kimi K2 shows us that: * HighFlyer, the organization that built DeepSeek, is far from a uniquely capable AI laboratory in China, * China is continuing to approach (or reached) the absolute frontier of modeling performance, and * The West is falling even further behind on open models. Kimi K2, described as an "Open-Source Agentic Model" is a sparse mixture of experts (MoE) model with 1T total parameters (~1.5x DeepSeek V3/R1's 671B) and 32B active parameters (similar to DeepSeek V3/R1's 37B). It is a "non-thinking" model with leading performance numbers in coding and related agentic tasks (earning it many comparisons to Claude 3.5 Sonnet), which means it doesn't generate a long reasoning chain before answering, but it was still trained extensively with reinforcement learning. It clearly outperforms DeepSeek V3 on a variety of benchmarks, including SWE-Bench, LiveCodeBench, AIME, or GPQA, and comes with a base model released as well. It is the new best-available open model by a clear margin. These facts with the points above all have useful parallels for what comes next: * Controlling who can train cutting edge models is extremely difficult. More organizations will join this list of OpenAI, Anthropic, Google, Meta, xAI, Qwen, DeepSeek, Moonshot AI, etc. Where there is a concentration of talent and sufficient compute, excellent models are very possible. This is easier to do somewhere such as China or Europe where there is existing talent, but is not restricted to these localities. * Kimi K2 was trained on 15.5T tokens and has a very similar number of active parameters as DeepSeek V3/R1, which was trained on 14.8T tokens. Better models are being trained without substantial increases in compute — these are referred to as a mix of "algorithmic gains" or "efficiency gains" in training. Compute restrictions will certainly slow this pace of progress on Chinese companies, but they are clearly not a binary on/off bottleneck on training. * The gap between the leading open models from the Western research labs versus their Chinese counterparts is only increasing in magnitude. The best open model from an American company is, maybe, Llama-4-Maverick? Three Chinese organizations have released more useful models with more permissive licenses: DeepSeek, Moonshot AI, and Qwen. This comes at the same time that new inference-heavy products are coming online that'll benefit from the potential of cheaper, lower margin hosting options on open models relative to API counterparts (which tend to have high profit margins). Kimi K2 is set up for a much slower style "DeepSeek Moment" than the DeepSeek R1 model that came out in January of this year because it lacks two culturally salient factors: * DeepSeek R1 was revelatory because it was the first model to expose the reasoning trace to the users, causing massive adoption outside of the technical AI community, and * The broader public is already aware that training leading AI models is actually very low cost once the technical expertise is built up (recall the DeepSeek V3 $5M training cost number), i.e. the final training run is cheap, so there should be a smaller reaction to similar cheap training cost numbers in the Kimi K2 report coming soon. Still, as more noise is created around the K2 release (Moonshot releases a technical report soon), this could evolve very rapidly. We've already seen quick experiments spin up slotting it into the Claude Code application (because Kimi's API is Claude-compatible) and K2 topping many nice "vibe tests" or creativity benchmarks. There are also tons of fun technical details that I don't have time to go into — from using a relatively unproven optimizer Muon and scaling up the self-rewarding LLM-as-a-judge pipeline in post-training. A fun tidbit to show how much this matters relative to the noisy Grok 4 release last week is that Kimi K2 has already surpassed Grok 4 in API usage on the popular OpenRouter platform. Later in the day on the 11th, following the K2 release, OpenAI CEO Sam Altman shared the following message regarding OpenAI's forthcoming open model (which I previously shared more optimistic thoughts on here) : we planned to launch our open-weight model next week. we are delaying it; we need time to run additional safety tests and review high-risk areas. we are not yet sure how long it will take us. while we trust the community will build great things with this model, once weights are out, they can’t be pulled back. this is new for us and we want to get it right. sorry to be the bearer of bad news; we are working super hard! Many attributed this as a reactive move by OpenAI to get out from the shadow of Kimi K2's wonderful release and another DeepSeek media cycle. Even though someone at OpenAI shared with me that the rumor that Kimi caused the delay for their open model is very likely not true, this is what being on the back foot looks like. When you're on the back foot, narratives like this are impossible to control. We need leaders at the closed AI laboratories in the U.S. to rethink some of the long-term dynamics they're battling with R&D adoption. We need to mobilize funding for great, open science projects in the U.S. and Europe. Until then, this is what losing looks like if you want The West to be the long-term foundation of AI research and development. Kimi K2 has shown us that one "DeepSeek Moment" wasn't enough for us to make the changes we need, and hopefully we don't need a third. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

7 min

See All (115)

Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what's under the hood, and telling stories. www.interconnects.ai

Creator

Nathan Lambert
Years Active

2023 - 2025
Episodes

115
Rating

Explicit
Show Website

Interconnects

Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Every two weeks
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly
Technology

Technology

Updated weekly

Interconnects

Ranking the Chinese Open Model Builders

Contra Dwarkesh on Continual Learning

GPT-5 and the arc of progress

gpt-oss: OpenAI validates the open ecosystem (finally)

Towards American Truly Open Models: The ATOM Project

Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next

The White House's plan for open models & AI research in the U.S.

Kimi K2 and when "DeepSeek Moments" become normal

About

Information

You Might Also Like

Interconnects

Episodes

Ranking the Chinese Open Model Builders

Contra Dwarkesh on Continual Learning

GPT-5 and the arc of progress

gpt-oss: OpenAI validates the open ecosystem (finally)

Towards American Truly Open Models: The ATOM Project

Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next

The White House's plan for open models & AI research in the U.S.

Kimi K2 and when "DeepSeek Moments" become normal

About

Information

You Might Also Like