Happy New Year! You may have noticed that in 2025 we had moved toward YouTube as our primary podcasting platform. As we’ll explain in the next State of Latent Space post, we’ll be doubling down on Substack again and improving the experience for the over 100,000 of you who look out for our emails and website updates! We first mentioned Artificial Analysis in 2024, when it was still a side project in a Sydney basement. They then were one of the few Nat Friedman and Daniel Gross’ AIGrant companies to raise a full seed round from them and have now become the independent gold standard for AI benchmarking—trusted by developers, enterprises, and every major lab to navigate the exploding landscape of models, providers, and capabilities. We have chatted with both Clementine Fourrier of HuggingFace’s OpenLLM Leaderboard and (the freshly valued at $1.7B) Anastasios Angelopoulos of LMArena on their approaches to LLM evals and trendspotting, but Artificial Analysis have staked out an enduring and important place in the toolkit of the modern AI Engineer by doing the best job of independently running the most comprehensive set of evals across the widest range of open and closed models, and charting their progress for broad industry analyst use. George Cameron and Micah-Hill Smith have spent two years building Artificial Analysis into the platform that answers the questions no one else will: Which model is actually best for your use case? What are the real speed-cost trade-offs? And how open is “open” really? We discuss: * The origin story: built as a side project in 2023 while Micah was building a legal AI assistant, launched publicly in January 2024, and went viral after Swyx’s retweet * Why they run evals themselves: labs prompt models differently, cherry-pick chain-of-thought examples (Google Gemini 1.0 Ultra used 32-shot prompts to beat GPT-4 on MMLU), and self-report inflated numbers * The mystery shopper policy: they register accounts not on their own domain and run intelligence + performance benchmarks incognito to prevent labs from serving different models on private endpoints * How they make money: enterprise benchmarking insights subscription (standardized reports on model deployment, serverless vs. managed vs. leasing chips) and private custom benchmarking for AI companies (no one pays to be on the public leaderboard) * The Intelligence Index (V3): synthesizes 10 eval datasets (MMLU, GPQA, agentic benchmarks, long-context reasoning) into a single score, with 95% confidence intervals via repeated runs * Omissions Index (hallucination rate): scores models from -100 to +100 (penalizing incorrect answers, rewarding \”I don’t know\”), and Claude models lead with the lowest hallucination rates despite not always being the smartest * GDP Val AA: their version of OpenAI’s GDP-bench (44 white-collar tasks with spreadsheets, PDFs, PowerPoints), run through their Stirrup agent harness (up to 100 turns, code execution, web search, file system), graded by Gemini 3 Pro as an LLM judge (tested extensively, no self-preference bias) * The Openness Index: scores models 0-18 on transparency of pre-training data, post-training data, methodology, training code, and licensing (AI2 OLMo 2 leads, followed by Nous Hermes and NVIDIA Nemotron) * The smiling curve of AI costs: GPT-4-level intelligence is 100-1000x cheaper than at launch (thanks to smaller models like Amazon Nova), but frontier reasoning models in agentic workflows cost more than ever (sparsity, long context, multi-turn agents) * Why sparsity might go way lower than 5%: GPT-4.5 is ~5% active, Gemini models might be ~3%, and Omissions Index accuracy correlates with total parameters (not active), suggesting massive sparse models are the future * Token efficiency vs. turn efficiency: GPT-5 costs more per token but solves Tau-bench in fewer turns (cheaper overall), and models are getting better at using more tokens only when needed (5.1 Codex has tighter token distributions) * V4 of the Intelligence Index coming soon: adding GDP Val AA, Critical Point, hallucination rate, and dropping some saturated benchmarks (human-eval-style coding is now trivial for small models) Links to Artificial Analysis * Website: https://artificialanalysis.ai * George Cameron on X: https://x.com/georgecameron * Micah-Hill Smith on X: https://x.com/micahhsmith Full Episode on YouTube Timestamps * 00:00 Introduction: Full Circle Moment and Artificial Analysis Origins * 01:19 Business Model: Independence and Revenue Streams * 04:33 Origin Story: From Legal AI to Benchmarking Need * 16:22 AI Grant and Moving to San Francisco * 19:21 Intelligence Index Evolution: From V1 to V3 * 11:47 Benchmarking Challenges: Variance, Contamination, and Methodology * 13:52 Mystery Shopper Policy and Maintaining Independence * 28:01 New Benchmarks: Omissions Index for Hallucination Detection * 33:36 Critical Point: Hard Physics Problems and Research-Level Reasoning * 23:01 GDP Val AA: Agentic Benchmark for Real Work Tasks * 50:19 Stirrup Agent Harness: Open Source Agentic Framework * 52:43 Openness Index: Measuring Model Transparency Beyond Licenses * 58:25 The Smiling Curve: Cost Falling While Spend Rising * 1:02:32 Hardware Efficiency: Blackwell Gains and Sparsity Limits * 1:06:23 Reasoning Models and Token Efficiency: The Spectrum Emerges * 1:11:00 Multimodal Benchmarking: Image, Video, and Speech Arenas * 1:15:05 Looking Ahead: Intelligence Index V4 and Future Directions * 1:16:50 Closing: The Insatiable Demand for Intelligence Transcript Micah [00:00:06]: This is kind of a full circle moment for us in a way, because the first time artificial analysis got mentioned on a podcast was you and Alessio on Latent Space. Amazing. swyx [00:00:17]: Which was January 2024. I don’t even remember doing that, but yeah, it was very influential to me. Yeah, I’m looking at AI News for Jan 17, or Jan 16, 2024. I said, this gem of a models and host comparison site was just launched. And then I put in a few screenshots, and I said, it’s an independent third party. It clearly outlines the quality versus throughput trade-off, and it breaks out by model and hosting provider. I did give you s**t for missing fireworks, and how do you have a model benchmarking thing without fireworks? But you had together, you had perplexity, and I think we just started chatting there. Welcome, George and Micah, to Latent Space. I’ve been following your progress. Congrats on... It’s been an amazing year. You guys have really come together to be the presumptive new gardener of AI, right? Which is something that... George [00:01:09]: Yeah, but you can’t pay us for better results. swyx [00:01:12]: Yes, exactly. George [00:01:13]: Very important. Micah [00:01:14]: Start off with a spicy take. swyx [00:01:18]: Okay, how do I pay you? Micah [00:01:20]: Let’s get right into that. swyx [00:01:21]: How do you make money? Micah [00:01:24]: Well, very happy to talk about that. So it’s been a big journey the last couple of years. Artificial analysis is going to be two years old in January 2026. Which is pretty soon now. We first run the website for free, obviously, and give away a ton of data to help developers and companies navigate AI and make decisions about models, providers, technologies across the AI stack for building stuff. We’re very committed to doing that and tend to keep doing that. We have, along the way, built a business that is working out pretty sustainably. We’ve got just over 20 people now and two main customer groups. So we want to be... We want to be who enterprise look to for data and insights on AI, so we want to help them with their decisions about models and technologies for building stuff. And then on the other side, we do private benchmarking for companies throughout the AI stack who build AI stuff. So no one pays to be on the website. We’ve been very clear about that from the very start because there’s no use doing what we do unless it’s independent AI benchmarking. Yeah. But turns out a bunch of our stuff can be pretty useful to companies building AI stuff. swyx [00:02:38]: And is it like, I am a Fortune 500, I need advisors on objective analysis, and I call you guys and you pull up a custom report for me, you come into my office and give me a workshop? What kind of engagement is that? George [00:02:53]: So we have a benchmarking and insight subscription, which looks like standardized reports that cover key topics or key challenges enterprises face when looking to understand AI and choose between all the technologies. And so, for instance, one of the report is a model deployment report, how to think about choosing between serverless inference, managed deployment solutions, or leasing chips. And running inference yourself is an example kind of decision that big enterprises face, and it’s hard to reason through, like this AI stuff is really new to everybody. And so we try and help with our reports and insight subscription. Companies navigate that. We also do custom private benchmarking. And so that’s very different from the public benchmarking that we publicize, and there’s no commercial model around that. For private benchmarking, we’ll at times create benchmarks, run benchmarks to specs that enterprises want. And we’ll also do that sometimes for AI companies who have built things, and we help them understand what they’ve built with private benchmarking. Yeah. So that’s a piece mainly that we’ve developed through trying to support everybody publicly with our public benchmarks. Yeah. swyx [00:04:09]: Let’s talk about TechStack behind that. But okay, I’m going to rewind all the way to when you guys started this project. You were all the way in Sydney? Yeah. Well, Sydney, Australia for me. Micah [00:04:19]: George was an SF, but he’s Australian, but he moved here already. Yeah. swyx [00:04:22]: And I remember I had the Zoom call with you. What was the impetus for starting