Rooted Layers

AI insights grounded on research

Rooted Layers is about AI insights grounded on research. I blog about AI research, agents, future of deep learning, and cybersecurity. Main publication at https://lambpetros.substack.com/ lambpetros.substack.com

Episodes

  1. 15 JAN

    The Transformer Attractor

    In 2023, Mamba promised to replace attention with elegant state-space math that scaled linearly with context. By 2024, the authors had rewritten the core algorithm to use matrix multiplications instead of scans. Their paper explains why: “We restrict the SSM structure to allow efficient computation via matrix multiplications on modern hardware accelerators.” The architecture changed to fit the hardware. The hardware did not budge. This is not a story about hardware determinism. It is a story about convergent evolution under economic pressure. Over the past decade, Transformers and GPU silicon co-evolved into a stable equilibrium—an attractor basin from which no alternative can escape without simultaneously clearing two reinforcing gates. The alternatives that survive do so by wearing the Transformer as a disguise: adopting its matrix-multiplication backbone even when their mathematical insight points elsewhere. The thesis: The next architectural breakthrough will not replace the Transformer. It will optimize within the Transformer’s computational constraints. Because those constraints are no longer just technical—they are economic, institutional, and structural. The Two-Gate Trap Every alternative architecture must pass through two reinforcing gates: Gate 1: Hardware Compatibility Can your architecture efficiently use NVIDIA’s Tensor Cores—the specialized matrix-multiply units that deliver 1,000 TFLOPS on an H100? If not, you pay a 10–100× compute tax. At frontier scale ($50–100M training runs), that tax is extinction. Gate 2: Institutional Backing Even if you clear Gate 1, you need a major lab to make it their strategic bet. Without that commitment, your architecture lacks large-scale validation, production tooling, ecosystem support, and the confidence signal needed for broader adoption. Why the trap is stable: These gates reinforce each other. Poor hardware compatibility makes institutional bets unattractive (too risky, too expensive). Lack of institutional backing means no investment in custom kernels or hardware optimization, keeping Gate 1 friction permanently high. At frontier scale, breaking out requires changing both simultaneously—a coordination problem no single actor can solve. The alternatives that survive do so by optimizing within the Transformer’s constraints rather than fighting them. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    25 min
  2. 29/12/2025

    The Orchestration Paradigm: Issue 4 - The Reality

    🎙️ Episode: The Reality – Why Agents Bankrupt Production In this series finale, we leave the research lab and enter the war room. We trace the lineage of agentic AI from Chain-of-Thought to ToolOrchestra, map the terrifying "Unsolved Frontiers" preventing full autonomy, and conduct a brutal audit of what happens when you deploy this to production. This episode isn't for the dreamers. It's for the builders. Topics Covered: The "Dreamer vs. Builder" Gap Lineage: From "Brains in Jars" (CoT) to "Managers" (Compound AI) Unsolved Frontier 1: Recursive Orchestration (Why the VP gets blamed for the intern's mistake) Unsolved Frontier 2: Tool Synthesis (The capability to write your own tools) Production Nightmare: The Cost Attack (Denial of Wallet) The Breakeven Math: Why you lose money until 75k queries/month The 4 Gates: Determining if your team is ready to build this Key Takeaways: The Moat is the Factory: The model weights don't matter; the synthetic data pipeline that built them does. The "Latency Tail" Kills: In a compound system, P99 latency is cumulative. One flaky tool destroys the entire user experience. The Decision Tree: Do not build an orchestrator unless you pass the Volume Gate (>75k queries) and the Team Gate (>3 ML Engineers). References: Su et al. (2025) - The ToolOrchestra Paper Sculley et al. (2015) - Hidden Technical Debt in ML Systems Dean et al. (2013) - The Tail at Scale Catch up on The Orchestration Paradigm series: Issue 1: The Algorithm – (GRPO, Outcome Supervision, and the math of thinking) Issue 2: The Factory – (Synthetic Data Pipelines, 16 H100s, and Benchmarking) Issue 3: The Behavior – (Escalation Ladders, Preference Vectors, and why Agents give up) Issue 4: The Reality – (Production Risks, Unit Economics, and Unsolved Frontiers) How to Consume This Series: 📺 Video: Acts as a TL;DR 🎧 Audio: The Deep Explainer going into the weeds of the paper. 📄 Written Post: Lies BETWEEN the two—the technical blueprint for implementation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    31 min
  3. 29/12/2025

    The Orchestration Paradigm: Issue 3 - The Behavior

    Deep Explainer Episode: The Behavior – Debugging the Ghost in the Machine If you watch an agent long enough, you see patterns nobody programmed. The "Escalation Ladder," the "Map-Reduce" spray, the "Do-While" loop. These are emergent behaviors. We audit the psychology of the orchestrator, explaining Implicit State Machines and the "Embeddings Trap" that fakes generalization. We are debugging the mind of the machine. Topics Covered: Implicit State Machines: How behaviors emerge from the loss landscape Escalation Ladders: The Try-Catch pattern of AI Preference Learning: Attention Injection vs. Hard Constraints Context Window Tax: The "Death Spiral" of long contexts Generalization Trap: Semantic Similarity vs. Economic Reality Key Takeaways: Emergence: Strategies like "Giving Up" are learned, not coded. Soft Control: User preferences (like "Low Cost") are just probabilistic suggestions, not guarantees. Semantic Trap: The model routes to new tools based on description similarity, not verified capability. References: Shinn et al. (2023) - Reflexion Liu et al. (2023) - Lost in the Middle Patil et al. (2023) - Gorilla LLMCatch up on The Orchestration Paradigm series: Issue 1: The Algorithm – (GRPO, Outcome Supervision, and the math of thinking) Issue 2: The Factory – (Synthetic Data Pipelines, 16 H100s, and Benchmarking) Issue 3: The Behavior – (Escalation Ladders, Preference Vectors, and why Agents give up)h Issue 4: The Reality – (Production Risks, Unit Economics, and Unsolved Frontiers) How to Consume This Series: 📺 Video: Acts as a TL;DR of the post 🎧 Audio: The Deep Explainer going into the weeds of the current topic. Click on the audio toggle next to the video, or lookit up as a podcast in all major platforms. 📄 Written Post: Lies BETWEEN the two—the technical blueprint for implementation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    33 min
  4. 29/12/2025

    The Orchestration Paradigm: Issue 2 - The Factory

    NOTE: The video acts as a TL;DR. click on the audio toggle next to it to get the very detailed PODCAST explainer. While the headlines focus on the 8B model beating GPT-5, the real engineering breakthrough wasn’t the model itself. It was the factory that built it. You can download the model weights tomorrow. You cannot download the synthetic data pipeline that generated the training signal. That is the moat. In this second issue, we leave the theoretical blackboard and enter the factory floor. We will analyze the ToolScale synthetic data pipeline that manufactures the training signal, audit the “physics” of benchmarking agents (where “Goodhart’s Law” reigns supreme), and dissect the massive infrastructure requirements—specifically why training stable RL policies requires 16 H100s and specialized gradient accumulation techniques. How to Read This Series Each part is self-contained. You can read them in order or jump to whichever topic interests you most. Every part ends with an Annotated Bibliography pointing to the primary papers with notes on why each one matters. * ML practitioners will learn how to build orchestrated systems. * Researchers will find a comprehensive literature review of tool use and compound AI through the lens of one well-executed paper. * Technical leaders will get concrete cost and performance trade-offs for evaluating orchestration architectures. * Curious minds can understand where AI is heading without needing a PhD to follow along. Prerequisites This series assumes familiarity with machine learning basics like loss functions and gradient descent, neural network fundamentals including attention and transformers, and Python programming sufficient to read pseudocode. If you’re newer to these topics, Parts 02 and 10 include appendices covering RL and agency fundamentals. Start with Issue 1 for the core thesis, then jump to Issue 4 for strategic implications. If you’re purely interested in business implications, Part 12 has the CTO decision tree and unit economics. The Orchestration Paradigm: Issue 2 - The Factory Issue 2: The Factory | Parts 04, 05, 06 In this second issue, we leave the theoretical blackboard and enter the factory floor. We analyze the ToolScale synthetic data pipeline that manufactures the training signal, audit the “physics” of benchmarking agents (where “Goodhart’s Law” reigns supreme), and dissect the massive infrastructure requirements—specifically why training stable RL policies requires 16 H100s and specialized gradient accumulation techniques. Part 4: The ToolScale Dataset This Part dissects ToolScale, the synthetic data pipeline used to train ToolOrchestra. It attempts to solve the “Ground Truth Bottleneck,” the fact that we don’t know the optimal way to solve most problems. Use of human labeling is too expensive and slow, while wild data is too noisy. The authors must manufacture data. The Ground Truth Bottleneck [!NOTE] System Auditor’s Log: In GenAI, data is the new code. The biggest bottleneck for training agents is not compute; it’s the lack of verifiable trajectory data. We have petabytes of text (CommonCrawl), but almost zero logs of “optimal” tool use sequences. Humans don’t write down their thought processes when they use Google. The Synthetic Pipeline The pipeline operates in two phases, creating a closed loop of generation and verification. First, in Phase 1 (Environment Synthesis), they generate the “world.” Instead of letting the agent interact with the live internet which is unpredictable, they generate thousands of virtual APIs and databases. An LLM creates a SQL database schema (e.g., “Library Management System”), fills that database with fake, consistent rows, and generates Python functions to query this database. Then, in Phase 2 (Task Synthesis), they generate the “problems.” An LLM looks at the database and asks a question like “Who borrowed ‘The Great Gatsby’?” Because the database was synthesized, the system knows the answer. It can execute the SQL query to get the ground truth. This creates a labeled dataset of (Question, Tool_Call, Correct_Answer) pairs. Because the environment is synthetic, the system knows the ground truth, enabling automatic verification at scale without human labelers. The “Pass@K” Proxy The critical innovation, and the potential flaw, is in how they define “success.” In standard supervised learning, we measure Exact Match to see if the model output the exact string we expected. In tool use, this is too rigid because there are many ways to query a database. ToolOrchestra uses a Pass@8 filtering criteria during data generation. They generate 8 different solution paths for a single problem using a strong teacher model like GPT-4. * If 0 paths lead to the correct answer, they discard the problem as unsolvable or broken. * If 8 paths lead to the correct answer, they keep the most efficient one. * If some paths fail, they keep the successful ones as positive reinforcement samples. # The Data Filtering Logic # We are optimizing for 'Process Fidelity' not just 'Outcome Accuracy'. def filter_training_data(problem, candidate_trajectories): valid_trajectories = [] target_answer = problem.ground_truth for traj in candidate_trajectories: result = execute_trajectory(traj) # Verification: The weak link. # We assume strict string matching or simple numeric equality # is sufficient to verify the "reasoning". if verify(result, target_answer): valid_trajectories.append(traj) # Selection Bias Risk: # We are selectively training on problems that GPT-4 is GOOD at. # If GPT-4 has a systematic blindspot, our orchestrator inherits it. if len(valid_trajectories) > 0: return select_most_efficient(valid_trajectories) return None The Verification Gap From an auditing perspective, this pipeline introduces Synthetic Bias. First, there is Teacher Bias, meaning the orchestrator can never exceed the reasoning capabilities of the teacher model (GPT-4) that generated the trajectories; it can only become more efficient at executing them. Second, there is Triviality Bias. It is easier to generate verifiable questions about “lookups” (What is the capital of X?) than about “reasoning” (Why did the Roman Empire fall?). This pushes the dataset towards factual retrieval, potentially under-training the “complex reasoning” circuits. The “verifiable ground truth” is a gold standard, but it constrains the domain to problems with singular, verifiable answers. Ambiguous, open-ended tasks, which are often the most valuable, are systematically filtered out. Annotated Bibliography Chen et al. (2021) - Evaluating Large Language Models Trained on Code (Codex): Introduced the “Pass@k” metric. ToolOrchestra adapts this from “Code Generation” to “Tool Trajectory Generation.” Wang et al. (2022) - Self-Instruct: Aligning Language Model with Self Generated Instructions: The blueprint for the “Teacher-Student” synthetic data pipeline. ToolScale is essentially “Self-Instruct” applied to API calls. Gudibande et al. (2023) - The False Promise of Imitation Learning: A critical paper (“The Imitation Game”) arguing that training on synthetic data from stronger models helps with style but not actual reasoning capability. Part 5: Benchmarks and Evaluation Evaluating an orchestrator is harder than evaluating a chatbot. A chatbot is judged on text quality. An orchestrator is judged on state transitions. ToolOrchestra is tested on three primary datasets: Humanity’s Last Exam (HLE), FRAMES, and τ²-Bench. Each targets a different failure mode. Metric Gaming and Benchmark Physics [!NOTE] System Auditor’s Log: Goodhart’s Law states: “When a measure becomes a target, it ceases to be a good measure.” In agentic AI, benchmarks like MMLU or GSM8K are now effectively part of the training set. ToolOrchestra introduces new benchmarks to prove its worth, but we must scrutinize what exactly is being measured. Is it intelligence, or is it just efficient retrieval? Humanity’s Last Exam (HLE) consists of PhD-level questions. Most LLMs fail these not because they can’t write, but because they lack specific domain computations. The benchmark measures Tool Identification, meaning the orchestrator doesn’t solve the physics equation but correctly identifies that WolframAlpha can solve it. The caveat is that this measures the quality of the tools available as much as the orchestrator. If the tool suite lacks a physics engine, the orchestrator fails regardless of its “intelligence.” FRAMES tests multi-hop factual reasoning, such as finding the population difference between the cities where two authors were born. This tests context window management, as the system must retrieve both facts, hold them in memory, and perform arithmetic. The failure mode here is “Distractor Injection.” When retrieving information about Author A, the tool might return 5000 tokens of noise. The benchmark implicitly measures the orchestrator’s ability to filter noise or the robustness of its attention mechanism. τ²-Bench simulates user interactions with varying preferences. This is the only benchmark that tests the Utility Function, checking if the model actually respects the “Cost vs. Accuracy” tradeoff. The metric is a Utility Score (u) defined as u = α ⋅ 𝕀(correct) − (1 − α) ⋅ cost. This formula explicitly defines the “exchange rate” between accuracy and dollars. The Problem with “Accuracy per Dollar” The authors present Accuracy per Dollar as a key metric, but this is potentially misleading. In many production systems, the value of accuracy is non-linear. For example, 99% accuracy on a medical diagnosis task is worth $1M, while 90% accuracy is worth $0 (or negative, due to liability). A linear “Accuracy per Dollar” metric favors systems t

    37 min
  5. 26/12/2025

    The Orchestration Paradigm Series

    The Headline You Probably Missed In December 2025, NVIDIA researchers quietly published a paper that challenges the central dogma of modern AI development. Their claim: an 8-billion parameter model outperforms GPT-5 on Humanity’s Last Exam, a PhD-level reasoning benchmark spanning mathematics, sciences, and humanities, while costing 60% less per query. Not through some architectural breakthrough. Not through better training data. Through a deceptively simple idea: teach a small model to coordinate big ones. The paper is called ToolOrchestra, and across 4 thematic issues, I’m going to take you inside every detail of it. How to Read This Series Each part is self-contained. You can read them in order or jump to whichever topic interests you most. Every part ends with an Annotated Bibliography pointing to the primary papers with notes on why each one matters. ML practitioners will learn how to build orchestrated systems. Researchers will find a comprehensive literature review of tool use and compound AI through the lens of one well-executed paper. Technical leaders will get concrete cost and performance trade-offs for evaluating orchestration architectures. Curious minds can understand where AI is heading without needing a PhD to follow along. Prerequisites This series assumes familiarity with machine learning basics like loss functions and gradient descent, neural network fundamentals including attention and transformers, and Python programming sufficient to read pseudocode. If you’re newer to these topics, Parts 02 and 10 include appendices covering RL and agency fundamentals. Start with Issue 1 for the core thesis, then jump to Issue 4 for strategic implications. If you’re purely interested in business implications, Part 12 has the CTO decision tree and unit economics. The Orchestration Paradigm: Issue 1 - The Algorithm Issue 1: The Algorithm | Parts prep, 01, 02, 03 This bundle covers the economic thesis, the calibration paradox, the RL formulation, and the reward scalarization problem. TL;DR: Why we need RL, and how GRPO works. In this first issue, we dissect the economic and mathematical foundations of ToolOrchestra. We explore why “prompting harder” fails due to calibration limits, how an 8B model uses Reinforcement Learning (GRPO) to learn decision-making policies without a critic, and how to design multi-objective reward functions that balance accuracy, cost, and latency. This is the theory layer. Part 0: The Router-Worker Architecture [!NOTE] System Auditor’s Log: The “ToolOrchestra” system is effectively a specialized distributed system design. It separates the control plane (routing logic) from the data plane (task execution). This series reverse-engineers the paper to understand how this separation is trained, optimized, and deployed. The fundamental economic premise of modern AI is arbitrage. If a difficult task costs $0.05 to solve on a frontier model (like GPT-5), but can be solved for $0.005 by a smaller model using the right tool, then the system that effectively routes between them captures that value. That is the engineering definition of “Orchestration.” It is not about “agents” or “reasoning” in the anthropomorphic sense. It is about training a localized policy to optimize a global cost-reward function across a heterogeneous network of compute providers. This series dissects ToolOrchestra, a system that demonstrates this principle. Unlike monolithic approaches that try to bake every capability into a single checkpoint, ToolOrchestra uses an 8-billion parameter “Router” to dispatch tasks to specialized “Workers”, including code interpreters, search engines, and massive generic models like GPT-4. The Architectural Thesis The central claim is that Routing is a distinct capability from Generation. In a standard monolithic setup, the same weights responsible for generating a Shakespearean sonnet are also responsible for deciding whether to use a calculator for 23 * 491. This is inefficient. It wastes high-entropy compute (creative generation) on low-entropy decisions (tool selection). ToolOrchestra decouples this. It trains a dedicated policy network (the 8B Orchestrator) solely to manage the state machine of the problem solving process. # The Economic Thesis of ToolOrchestra class SystemAudit: def optimize_request(self, task): # The Arbitrage Condition # If the cost of routing + specialized execution is less than # the cost of naive generation, the system is strictly superior. monolith_cost = FrontierModel.estimate_cost(task) # High fixed cost router_cost = LogicModel_8B.inference_cost # Low fixed cost worker_cost = self.router.predict_worker(task).cost if (router_cost + worker_cost) The paper demonstrates that this decoupled architecture, when trained with Reinforcement Learning (RL), outperforms the monolith on its own benchmarks. The 8B router effectively learns a lookup table of “Task Complexity vs. Tool Capability,” allowing it to solve PhD-level physics problems (via delegation) that it could never solve natively. The Engineering Stack Building this requires solving four distinct engineering problems, which form the tiers of our analysis. The Control Theory (RL) You cannot train this system with Supervised Fine-Tuning (SFT) alone. SFT teaches the model syntax (how to format a JSON tool call), but it cannot teach strategy (when to call a tool). There is no “ground truth” for the optimal sequence of calls. We examine how Group Relative Policy Optimization (GRPO) solves this by treating tool use as a gradient-free environment. The Scalarization Problem (Rewards) The system must optimize for three conflicting variables: Accuracy, Latency, and Cost. A router that is 100% accurate but costs $50 per query is useless. We look at how Multi-Objective Reward modeling creates a scalar signal that forces the model to “internalize” the cost of its own actions. The Supply Chain (Data) Where do you get the training data? You cannot scrape “reasoning traces” from the web because they don’t exist. We scrutinize the ToolScale pipeline, a synthetic data factory that generates verifiable state-transitions to bootstrap the learner. The Production Reality Finally, we audit the deployment. Routing logic that works in a controlled benchmark often fails under the distributional shift of production. We analyze the generalization mechanics, how the model handles tool descriptions it has never seen before, and the fragility of relying on prompt-based tool definitions. The Road Ahead This is not a celebration of the paper; it is an audit. We are looking for the mechanics that make the system work and the dependencies that make it break. We begin in Dive 1 by defining the problem: why can’t we just prompt GPT-4 to do this? Annotated Bibliography Su et al. (2025) - ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration: The primary paper analyzed in this series. Introduces the 8B orchestrator concept. Zaharia et al. (2024) - The Shift from Models to Compound AI Systems: The theoretical foundation for why monolithic scaling is hitting diminishing returns, favoring modular architectures. Schick et al. (2023) - Toolformer: Language Models Can Teach Themselves to Use Tools: The precursor to ToolOrchestra, demonstrating self-supervised tool injection. ToolOrchestra expands this to multi-tool and multi-objective settings. Part 1: The Fundamental Problem The Paradox of Capability vs. Routing [!NOTE] System Auditor’s Log: A common objection to orchestration frameworks is: “Why train a small model? Why not just prompt the big model to use tools?” The answer lies in calibration. High-capability generators are often poorly calibrated routers, suffering from “Instrumental Convergence” on their own weights. The fundamental problem ToolOrchestra addresses is not a lack of intelligence, but a misallocation of it. Large Language Models (LLMs) are trained to predict the next token. They maximize the likelihood of the training corpus. They are not inherently trained to minimize the computational cost of their answers, nor are they trained to admit ignorance. When you ask a frontier model a question like “What is the square root of 4913?”, its training objective drives it to generate the tokens that represent the answer. It relies on its internal weights. If those weights contain the answer, it succeeds. If they don’t, it hallucinates. The “Orchestrator” exists to interrupt this process. It inserts a decision node before generation: Should I use my weights, or should I borrow external compute? The Failure of “Prompting Harder” One might assume that prompt engineering could solve this. “You are a helpful assistant who uses tools.” In practice, this fails due to Self-Enhancement Bias. Models tend to over-trust their internal parametric knowledge. A model that has “read” the entire internet often believes it knows current stock prices or obscure mathematical constants, simply because those patterns exist somewhere in its weight matrices. Conversely, aggressive prompting (“ALWAYS verify with tools”) leads to Other-Enhancement Bias, where the model wastes money calling search APIs for trivial queries like “Who is the president of the US?” This is a calibration failure. The model’s confidence score (P(token)) is not correlated with its factual accuracy in a way that maps cleanly to a “Tool Use Threshold.” # The Calibration Gap # Ideally, we want a linear relationship: # High Confidence -> High Probability of Correctness # Low Confidence -> Low Probability of Correctness def should_route(model, query): internal_confidence = model.get_perplexity(query) # The problem: Frontier models are uncalibrated for self-judgment. # They often have high confidence even when wrong (hallucination). # This makes 'internal_confidence' a noisy signal for

    38 min
  6. Neural Architecture Design as a Compositional Language

    27/11/2025

    Neural Architecture Design as a Compositional Language

    Introduction: The Architect’s Secret It’s a common misconception to think of designing a neural network like writing a piece of traditional software. We imagine architects meticulously programming every logical step, crafting a perfect, intricate machine from the ground up. But the reality of modern AI is far more counter-intuitive and fascinating. The most successful AI architectures aren’t rigid blueprints; they are flexible scaffolds. An architect’s job is not to design a flawless machine, but to create a rich environment—a framework of possibilities—where an algorithm can learn and discover solutions on its own. This post explores five of the most surprising and impactful ideas that have shaped this new philosophy, revealing how the field shifted from building models to writing the very language of computation. -------------------------------------------------------------------------------- 1. Architectures Aren’t Blueprints; They’re Scaffolds The “Scaffold Hypothesis” is a simple but profound idea: an architect’s primary role is to create a space of possible computations where an optimization process like gradient descent can easily find effective solutions. This stands in stark contrast to traditional programming, where a developer specifies exact instructions like if x > 5, then do y. In AI architecture, the goal is different. Take the Transformer model’s famous Query-Key-Value (QKV) attention mechanism. The designers didn’t explicitly program the network to “use Q for asking what information is needed, K for signaling what information is available, and V for providing that information.” Instead, they created a mathematical structure where that function was easy to learn. The network discovered this communication protocol on its own through training. This principle reframes the entire design process. The architecture is no longer the final product but the starting point for discovery. The architecture is the hypothesis; training is the proof. This shift explains why simple, repeated structures are so effective. Instead of pre-ordaining a complex function, they provide a flexible, high-dimensional canvas and a simple set of rules, allowing the training process itself to paint the masterpiece. -------------------------------------------------------------------------------- 2. Simpler and Deeper Beats Clever and Custom Early in deep learning’s history, models like AlexNet were built with custom-designed layers, each with a specific, hand-tuned purpose. The field has since learned a powerful lesson: uniformity and simplicity almost always win. The shift began with models like VGGNet and ResNet, which abandoned complex, varied layers in favor of stacking simple, identical blocks over and over again. This wasn’t just a competition between models; it was a clash of design philosophies. The complex Inception model represented a “clever, parallel committee” approach, where designers offered the model multiple hand-tuned computational paths at once. In contrast, ResNet championed a “brute-force sequential” approach with a single, deep compositional rule. ResNet’s victory proved that a simple, repeatable formula was far more scalable and powerful. ...simpler, more uniform building blocks win over complex, hand-tuned modules. The power of this principle is staggering. A massive model like GPT-3 is built by repeating the exact same architectural block 96 times. This principle of uniform, repeatable blocks became the bedrock of modern AI. But to truly scale it, researchers had to solve a fundamental problem: how do you make a network hundreds of layers deep without the signal getting lost? The answer came in a single, elegant formula. -------------------------------------------------------------------------------- 3. A Single, Simple Formula Unlocked Extreme Depth A breakthrough paper from 2015 revealed a disarmingly simple formula that would shatter the perceived limits of network depth: the “residual connection.” The formula is just x_out = x_in + F(x_in). This small addition caused a massive paradigm shift. Before ResNet, architects asked, “What should each layer compute?” After ResNet, the question became, “What refinement should each layer add?” By making each layer learn a modification to the input (F(x_in)) rather than an entirely new representation, the model’s task became dramatically easier. This changed the network’s job from replacing information at each layer to accumulating refinements. The final output is no longer the result of a complex transformation, but the sum of the original input plus a series of small, learned adjustments: Output = Input + Block₁(Input) + Block₂(...) + .... This is the foundational compositional operator of modern deep learning. This simple change created “gradient highways” that allowed networks to be built with 100+ layers, enabling the extreme depth that powers today’s most capable models. -------------------------------------------------------------------------------- 4. Sophisticated Specialization Emerges From Identical Blocks One of the most stunning validations of the “scaffold hypothesis” came from analyzing models like BERT. BERT is constructed from a stack of identical Transformer blocks; architecturally, layer 2 is the same as layer 10. Yet, after training, these identical layers learn to perform distinct and specialized roles. Analysis revealed a clear functional hierarchy that emerged on its own: * Early layers: Focus on syntax and surface-level features (e.g., making determiners attend to their nouns). * Middle layers: Learn more abstract semantic relationships and entity recognition. * Late layers: Handle complex, long-range tasks like coreference resolution (e.g., linking the pronoun “it” back to its subject paragraphs earlier). This specialization was not designed by an architect; it was discovered by the training process as an efficient way to solve the problem. The uniform stack of blocks provided the capacity, and optimization found the structure. The architecture didn’t specify this hierarchy—training discovered it! This demonstrates that a well-designed architectural scaffold doesn’t need to encode complex behaviors itself. It only needs to create a space where those behaviors are easy for the model to learn. -------------------------------------------------------------------------------- 5. Today’s AI Is Built With Standardized “Lego” Blocks The modern process of designing a new state-of-the-art AI model has evolved significantly. Instead of inventing novel architectures from scratch, designers now select high-performing components from a shared, validated library—much like building with a set of standardized Lego blocks. This library includes proven solutions for every part of a model’s core block: * Position Encodings: RoPE (Rotary Position Embeddings) * Normalizations: RMSNorm * MLP Variants: SwiGLU (Swish-Gated Linear Unit) A model like Llama 3 is a perfect example of this philosophy. It is a stack of 80 identical blocks, where each block is a carefully chosen composition of these best-in-class, off-the-shelf components. This design can be elegantly captured in just a few lines of code: class LlamaBlock: attention = GroupedQueryAttention( position_encoding=RoPE ) norm1 = RMSNorm() mlp = SwiGLU() norm2 = RMSNorm() # Stack 80 identical blocks model = Stack([LlamaBlock() for _ in range(80)]) This modularity allows the entire field to advance more rapidly. Researchers can innovate on a single component and instantly plug it into a wide range of existing models. This shifts the focus from bespoke architecture design to engineering better training methods and scaling up proven formulas with confidence. -------------------------------------------------------------------------------- Conclusion: From Building Models to Writing Grammars The journey from AlexNet to modern LLMs validates a powerful set of principles. We learned that architectures are scaffolds, not blueprints (1); that simple, deep repetition beats clever customization (2); that a single formula could unlock this depth by reframing computation as refinement (3); that specialization emerges from this uniformity rather than needing to be designed (4); and that this entire process has culminated in a modular, ‘Lego-like’ design philosophy that accelerates the entire field (5). The story of neural architecture design is a journey away from crafting specific models and toward designing compositional languages. The most impactful breakthroughs have consistently been those that provided simpler, more general, and more scalable building blocks. This evolution has validated the core theme: the goal is not to build a machine, but to create a rich scaffold that allows intelligence to emerge. We’re not building neural networks anymore. We’re building grammars of computation, and training is the process of writing programs in those grammars. Read the next post on this series: Deep Dive A good way to read this long post, is to play the podacast audio, while following the text under it. It outlines a new “grammar” emerged in the last 13 years of deep learning research progress. The post outlines several of the seminal papers and their incremental contributions to the field of AI. So you get a historical flashback as a treat :) How the deep learning field evolved from designing specific models to designing languages of reusable components Introduction: The Scaffold Hypothesis A profound shift is occurring in how we think about neural network architecture. Rather than viewing architecture design as “building the perfect model,” researchers increasingly recognize it as creating computational scaffolds that training discovers how to exploit. This is a departure from traditional software engineering. When you write a program, you specify exactly what happens: if x > 5,

    17 min

About

Rooted Layers is about AI insights grounded on research. I blog about AI research, agents, future of deep learning, and cybersecurity. Main publication at https://lambpetros.substack.com/ lambpetros.substack.com