Rooted Layers

AI insights grounded on research

Rooted Layers is about AI insights grounded on research. I blog about AI research, agents, future of deep learning, and cybersecurity. Main publication at https://lambpetros.substack.com/ lambpetros.substack.com

  1. 18 FEB

    The Five-Component Pattern for AI Evidence Governance

    This is for teams shipping LLM-assisted analysis in high-stakes workflows — protocol security, compliance auditing, code review — where “the model said so” isn’t good enough. Part 2 of Measuring Trust: The Gap Between AI Auditing and Correctness The Problem Last month, our AI verification tool produced a polished summary claiming “all obligations properly mapped” — while the same run’s data pointed to ABI wrappers, test files, and RPC helpers that had nothing to do with the actual protocol logic. Both documents were confident. Both were AI-generated. They contradicted each other. This isn’t a bug in the model. It’s a failure in evidence governance. We had no rule for which artifact to trust when outputs disagreed. No procedure for tracking that disagreement. No way to prevent the contradiction from silently propagating into external claims. Recent work makes this sharper: models hallucinate with high certainty even when they “know” the correct answer in nearby settings (Simhi et al., 2025). Confidence and correctness decouple in practice. And once report generation is cheap, the bottleneck shifts to tracing and verification effort (Rasheed et al., 2026). Our aim is not only higher answer quality; it is lower verification burden per claim, so auditing remains cheaper than redoing the analysis. This post describes the operational pattern we built to fix that. Where to Look The tool is open source: eip-verify on GitHub. Governance artifacts: * Contradiction register (8 tracked contradictions) * Evidence ledger (claim-level evidence tracking) * Scoring rubric (five-dimension quality assessment) * Scored dataset (81 mappings, 47 direct-adjudication) The strongest contribution isn’t a claim that verification is solved. It’s a workflow where every trust decision is inspectable, disputable, and updateable. Evidence artifact links are pinned to commit ca15d40. The Part 1 narrative and metric derivations are pinned to later commits (13260c8, 6a21cce) because they were written after the evidence pack was frozen. References * Chen, J. et al. (2025). Rethinking All Evidence: Enhancing Trustworthy RAG via Conflict-Driven Summarization. arXiv:2507.01281. * Chen, R. (2025). Evidence-Bound Autonomous Research (EviBound). arXiv:2511.05524. * Habli, I. et al. (2025). The BIG Argument for AI Safety Cases. arXiv:2503.11705. * Ji, Y. et al. (2026). MedRAGChecker: Claim-Level Verification for Biomedical RAG. arXiv:2601.06519. * Li, Z. et al. (2025). SOPBench: Evaluating Language Agents at Following SOPs and Constraints. arXiv:2503.08669. * Liang, P. et al. (2023). Holistic Evaluation of Language Models. TMLR. * Lu, K. et al. (2025). Med-R²: Crafting Trustworthy LLM Physicians via EBM. arXiv:2501.11885. * Rasheed, R. A. et al. (2026). From Fluent to Verifiable: Claim-Level Auditability for Deep Research Agents. arXiv:2602.13855. * Sackett, D. L. et al. (1996). Evidence Based Medicine: What It Is and What It Isn’t. BMJ, 312(7023), 71–72. * Simhi, A. et al. (2025). Trust Me, I’m Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer. arXiv:2502.12964. * Tabassi, E. (2023). AI Risk Management Framework (AI RMF 1.0). NIST AI 100-1. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    33 min
  2. 18 FEB

    Measuring Trust: The Gap Between AI Auditing and Correctness

    The Problem Nobody Wants to Hear About Every blockchain runs on a promise: the code does what the spec says it should. For Ethereum, that promise passes through a chain of documents and codebases. An EIP (Ethereum Improvement Proposal) describes a protocol change in human language — think “this is how the new fee calculation must work.” That proposal gets translated into a formal Python reference called the execution specs. Then independent teams — geth in Go, Reth in Rust, Besu in Java — write their own implementations from the same spec. If any client gets it wrong, the network can fork. The Ethereum Foundation’s Protocol Security team audits this chain manually. They read the EIP, read the spec code, read the client code, and check that the obligations line up. It’s slow, expert-level work, and it has to happen every time the protocol changes. The Foundation issued an RFP asking whether LLMs could help. I built a proof of concept to find out. Five iterations, one month, 47 directly adjudicated obligation mappings later, I can tell you something that surprised me: The system got dramatically better at exposing where it might be wrong. It did not get dramatically better at being right. That gap — between auditability and correctness — turned out to be the most important finding of the entire project. And it applies far beyond Ethereum. The Setup The tool, eip-verify, works in five phases. Given an EIP number, it: * Extracts every obligation from the EIP text * Locates where each obligation lives in the Python execution specs * Analyzes the spec code flow and flags gaps * Finds the corresponding implementation in the client (we tested with geth) * Analyzes the client code and maps it back to the obligation Each phase produces inspectable artifacts — CSVs, JSON, the actual prompts and model outputs. You can trace any claim back to the exact moment the LLM made it. This turned out to matter more than I expected. Why These EIPs Not all EIPs are equally useful for testing a verification tool. I chose three deliberately: * EIP-1559 — the fee-market overhaul. Complex, mature, deeply embedded in client code. Most of the scored data comes from here. This is the stress test. * EIP-2930 — access list transactions. Structurally simpler. Good for establishing a baseline and confirming the pipeline works on a less complex obligation set. * EIP-7702 — a recent Prague-fork addition. Used as a negative control: run 21570420032 targeted geth v1.13.14, which predates 7702 support, and the pipeline correctly reported the EIP as unimplemented. A second run completed cleanly against a supporting version. Both are excluded from the scored set. The scored dataset has 47 directly adjudicated mappings — where I reviewed the model’s transcript against actual source code — plus 34 compare-derived proxy rows from cross-run alignment scoring (primarily Haiku-family expansion). Findings below draw from the 47 direct rows unless noted. Most come from EIP-1559; treat these as operational signal from a real codebase, not a benchmark-grade estimate. The Agent: Deliberately Simple A common assumption in AI tooling is that more sophisticated orchestration produces better results. Five iterations of this project taught me the opposite. The “agent” in eip-verify is an LLM with file-system access — read, write, bash, grep, glob — and a capped number of conversation turns. That’s it. No multi-agent framework. No retrieval-augmented generation. No custom orchestration. The total agent adapter is ~80 lines of Python. I used Claude (via the Claude Agent SDK) because the SDK made this workflow trivial: give the model shell access, point it at a codebase, constrain its turns, capture everything it produces. The same architecture should be portable to other tool-using LLMs — OpenAI’s function calling, local models with tool support — though we haven’t tested that claim. Earlier iterations tried harder: * Repo-map-heavy guided flows (PoC 4.x): High context breadth increased complexity and diagnosis cost without reliable trust gains. * Exploratory multi-agent architectures (PoC 1): Useful discovery power, weak comparability unless constrained by a single adjudication contract. * RAG-like layered retrieval: Opaque error surfaces — when something went wrong, you couldn’t tell where. The simple pipeline won because of one criterion: when the output is wrong, can you figure out where and why? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    29 min
  3. 15 JAN

    The Transformer Attractor

    In 2023, Mamba promised to replace attention with elegant state-space math that scaled linearly with context. By 2024, the authors had rewritten the core algorithm to use matrix multiplications instead of scans. Their paper explains why: “We restrict the SSM structure to allow efficient computation via matrix multiplications on modern hardware accelerators.” The architecture changed to fit the hardware. The hardware did not budge. This is not a story about hardware determinism. It is a story about convergent evolution under economic pressure. Over the past decade, Transformers and GPU silicon co-evolved into a stable equilibrium—an attractor basin from which no alternative can escape without simultaneously clearing two reinforcing gates. The alternatives that survive do so by wearing the Transformer as a disguise: adopting its matrix-multiplication backbone even when their mathematical insight points elsewhere. The thesis: The next architectural breakthrough will not replace the Transformer. It will optimize within the Transformer’s computational constraints. Because those constraints are no longer just technical—they are economic, institutional, and structural. The Two-Gate Trap Every alternative architecture must pass through two reinforcing gates: Gate 1: Hardware Compatibility Can your architecture efficiently use NVIDIA’s Tensor Cores—the specialized matrix-multiply units that deliver 1,000 TFLOPS on an H100? If not, you pay a 10–100× compute tax. At frontier scale ($50–100M training runs), that tax is extinction. Gate 2: Institutional Backing Even if you clear Gate 1, you need a major lab to make it their strategic bet. Without that commitment, your architecture lacks large-scale validation, production tooling, ecosystem support, and the confidence signal needed for broader adoption. Why the trap is stable: These gates reinforce each other. Poor hardware compatibility makes institutional bets unattractive (too risky, too expensive). Lack of institutional backing means no investment in custom kernels or hardware optimization, keeping Gate 1 friction permanently high. At frontier scale, breaking out requires changing both simultaneously—a coordination problem no single actor can solve. The alternatives that survive do so by optimizing within the Transformer’s constraints rather than fighting them. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    25 min
  4. 29/12/2025

    The Orchestration Paradigm: Issue 4 - The Reality

    🎙️ Episode: The Reality – Why Agents Bankrupt Production In this series finale, we leave the research lab and enter the war room. We trace the lineage of agentic AI from Chain-of-Thought to ToolOrchestra, map the terrifying "Unsolved Frontiers" preventing full autonomy, and conduct a brutal audit of what happens when you deploy this to production. This episode isn't for the dreamers. It's for the builders. Topics Covered: The "Dreamer vs. Builder" Gap Lineage: From "Brains in Jars" (CoT) to "Managers" (Compound AI) Unsolved Frontier 1: Recursive Orchestration (Why the VP gets blamed for the intern's mistake) Unsolved Frontier 2: Tool Synthesis (The capability to write your own tools) Production Nightmare: The Cost Attack (Denial of Wallet) The Breakeven Math: Why you lose money until 75k queries/month The 4 Gates: Determining if your team is ready to build this Key Takeaways: The Moat is the Factory: The model weights don't matter; the synthetic data pipeline that built them does. The "Latency Tail" Kills: In a compound system, P99 latency is cumulative. One flaky tool destroys the entire user experience. The Decision Tree: Do not build an orchestrator unless you pass the Volume Gate (>75k queries) and the Team Gate (>3 ML Engineers). References: Su et al. (2025) - The ToolOrchestra Paper Sculley et al. (2015) - Hidden Technical Debt in ML Systems Dean et al. (2013) - The Tail at Scale Catch up on The Orchestration Paradigm series: Issue 1: The Algorithm – (GRPO, Outcome Supervision, and the math of thinking) Issue 2: The Factory – (Synthetic Data Pipelines, 16 H100s, and Benchmarking) Issue 3: The Behavior – (Escalation Ladders, Preference Vectors, and why Agents give up) Issue 4: The Reality – (Production Risks, Unit Economics, and Unsolved Frontiers) How to Consume This Series: 📺 Video: Acts as a TL;DR 🎧 Audio: The Deep Explainer going into the weeds of the paper. 📄 Written Post: Lies BETWEEN the two—the technical blueprint for implementation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    31 min
  5. 29/12/2025

    The Orchestration Paradigm: Issue 3 - The Behavior

    Deep Explainer Episode: The Behavior – Debugging the Ghost in the Machine If you watch an agent long enough, you see patterns nobody programmed. The "Escalation Ladder," the "Map-Reduce" spray, the "Do-While" loop. These are emergent behaviors. We audit the psychology of the orchestrator, explaining Implicit State Machines and the "Embeddings Trap" that fakes generalization. We are debugging the mind of the machine. Topics Covered: Implicit State Machines: How behaviors emerge from the loss landscape Escalation Ladders: The Try-Catch pattern of AI Preference Learning: Attention Injection vs. Hard Constraints Context Window Tax: The "Death Spiral" of long contexts Generalization Trap: Semantic Similarity vs. Economic Reality Key Takeaways: Emergence: Strategies like "Giving Up" are learned, not coded. Soft Control: User preferences (like "Low Cost") are just probabilistic suggestions, not guarantees. Semantic Trap: The model routes to new tools based on description similarity, not verified capability. References: Shinn et al. (2023) - Reflexion Liu et al. (2023) - Lost in the Middle Patil et al. (2023) - Gorilla LLMCatch up on The Orchestration Paradigm series: Issue 1: The Algorithm – (GRPO, Outcome Supervision, and the math of thinking) Issue 2: The Factory – (Synthetic Data Pipelines, 16 H100s, and Benchmarking) Issue 3: The Behavior – (Escalation Ladders, Preference Vectors, and why Agents give up)h Issue 4: The Reality – (Production Risks, Unit Economics, and Unsolved Frontiers) How to Consume This Series: 📺 Video: Acts as a TL;DR of the post 🎧 Audio: The Deep Explainer going into the weeds of the current topic. Click on the audio toggle next to the video, or lookit up as a podcast in all major platforms. 📄 Written Post: Lies BETWEEN the two—the technical blueprint for implementation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit lambpetros.substack.com

    33 min
  6. 29/12/2025

    The Orchestration Paradigm: Issue 2 - The Factory

    NOTE: The video acts as a TL;DR. click on the audio toggle next to it to get the very detailed PODCAST explainer. While the headlines focus on the 8B model beating GPT-5, the real engineering breakthrough wasn’t the model itself. It was the factory that built it. You can download the model weights tomorrow. You cannot download the synthetic data pipeline that generated the training signal. That is the moat. In this second issue, we leave the theoretical blackboard and enter the factory floor. We will analyze the ToolScale synthetic data pipeline that manufactures the training signal, audit the “physics” of benchmarking agents (where “Goodhart’s Law” reigns supreme), and dissect the massive infrastructure requirements—specifically why training stable RL policies requires 16 H100s and specialized gradient accumulation techniques. How to Read This Series Each part is self-contained. You can read them in order or jump to whichever topic interests you most. Every part ends with an Annotated Bibliography pointing to the primary papers with notes on why each one matters. * ML practitioners will learn how to build orchestrated systems. * Researchers will find a comprehensive literature review of tool use and compound AI through the lens of one well-executed paper. * Technical leaders will get concrete cost and performance trade-offs for evaluating orchestration architectures. * Curious minds can understand where AI is heading without needing a PhD to follow along. Prerequisites This series assumes familiarity with machine learning basics like loss functions and gradient descent, neural network fundamentals including attention and transformers, and Python programming sufficient to read pseudocode. If you’re newer to these topics, Parts 02 and 10 include appendices covering RL and agency fundamentals. Start with Issue 1 for the core thesis, then jump to Issue 4 for strategic implications. If you’re purely interested in business implications, Part 12 has the CTO decision tree and unit economics. The Orchestration Paradigm: Issue 2 - The Factory Issue 2: The Factory | Parts 04, 05, 06 In this second issue, we leave the theoretical blackboard and enter the factory floor. We analyze the ToolScale synthetic data pipeline that manufactures the training signal, audit the “physics” of benchmarking agents (where “Goodhart’s Law” reigns supreme), and dissect the massive infrastructure requirements—specifically why training stable RL policies requires 16 H100s and specialized gradient accumulation techniques. Part 4: The ToolScale Dataset This Part dissects ToolScale, the synthetic data pipeline used to train ToolOrchestra. It attempts to solve the “Ground Truth Bottleneck,” the fact that we don’t know the optimal way to solve most problems. Use of human labeling is too expensive and slow, while wild data is too noisy. The authors must manufacture data. The Ground Truth Bottleneck [!NOTE] System Auditor’s Log: In GenAI, data is the new code. The biggest bottleneck for training agents is not compute; it’s the lack of verifiable trajectory data. We have petabytes of text (CommonCrawl), but almost zero logs of “optimal” tool use sequences. Humans don’t write down their thought processes when they use Google. The Synthetic Pipeline The pipeline operates in two phases, creating a closed loop of generation and verification. First, in Phase 1 (Environment Synthesis), they generate the “world.” Instead of letting the agent interact with the live internet which is unpredictable, they generate thousands of virtual APIs and databases. An LLM creates a SQL database schema (e.g., “Library Management System”), fills that database with fake, consistent rows, and generates Python functions to query this database. Then, in Phase 2 (Task Synthesis), they generate the “problems.” An LLM looks at the database and asks a question like “Who borrowed ‘The Great Gatsby’?” Because the database was synthesized, the system knows the answer. It can execute the SQL query to get the ground truth. This creates a labeled dataset of (Question, Tool_Call, Correct_Answer) pairs. Because the environment is synthetic, the system knows the ground truth, enabling automatic verification at scale without human labelers. The “Pass@K” Proxy The critical innovation, and the potential flaw, is in how they define “success.” In standard supervised learning, we measure Exact Match to see if the model output the exact string we expected. In tool use, this is too rigid because there are many ways to query a database. ToolOrchestra uses a Pass@8 filtering criteria during data generation. They generate 8 different solution paths for a single problem using a strong teacher model like GPT-4. * If 0 paths lead to the correct answer, they discard the problem as unsolvable or broken. * If 8 paths lead to the correct answer, they keep the most efficient one. * If some paths fail, they keep the successful ones as positive reinforcement samples. # The Data Filtering Logic # We are optimizing for 'Process Fidelity' not just 'Outcome Accuracy'. def filter_training_data(problem, candidate_trajectories): valid_trajectories = [] target_answer = problem.ground_truth for traj in candidate_trajectories: result = execute_trajectory(traj) # Verification: The weak link. # We assume strict string matching or simple numeric equality # is sufficient to verify the "reasoning". if verify(result, target_answer): valid_trajectories.append(traj) # Selection Bias Risk: # We are selectively training on problems that GPT-4 is GOOD at. # If GPT-4 has a systematic blindspot, our orchestrator inherits it. if len(valid_trajectories) > 0: return select_most_efficient(valid_trajectories) return None The Verification Gap From an auditing perspective, this pipeline introduces Synthetic Bias. First, there is Teacher Bias, meaning the orchestrator can never exceed the reasoning capabilities of the teacher model (GPT-4) that generated the trajectories; it can only become more efficient at executing them. Second, there is Triviality Bias. It is easier to generate verifiable questions about “lookups” (What is the capital of X?) than about “reasoning” (Why did the Roman Empire fall?). This pushes the dataset towards factual retrieval, potentially under-training the “complex reasoning” circuits. The “verifiable ground truth” is a gold standard, but it constrains the domain to problems with singular, verifiable answers. Ambiguous, open-ended tasks, which are often the most valuable, are systematically filtered out. Annotated Bibliography Chen et al. (2021) - Evaluating Large Language Models Trained on Code (Codex): Introduced the “Pass@k” metric. ToolOrchestra adapts this from “Code Generation” to “Tool Trajectory Generation.” Wang et al. (2022) - Self-Instruct: Aligning Language Model with Self Generated Instructions: The blueprint for the “Teacher-Student” synthetic data pipeline. ToolScale is essentially “Self-Instruct” applied to API calls. Gudibande et al. (2023) - The False Promise of Imitation Learning: A critical paper (“The Imitation Game”) arguing that training on synthetic data from stronger models helps with style but not actual reasoning capability. Part 5: Benchmarks and Evaluation Evaluating an orchestrator is harder than evaluating a chatbot. A chatbot is judged on text quality. An orchestrator is judged on state transitions. ToolOrchestra is tested on three primary datasets: Humanity’s Last Exam (HLE), FRAMES, and τ²-Bench. Each targets a different failure mode. Metric Gaming and Benchmark Physics [!NOTE] System Auditor’s Log: Goodhart’s Law states: “When a measure becomes a target, it ceases to be a good measure.” In agentic AI, benchmarks like MMLU or GSM8K are now effectively part of the training set. ToolOrchestra introduces new benchmarks to prove its worth, but we must scrutinize what exactly is being measured. Is it intelligence, or is it just efficient retrieval? Humanity’s Last Exam (HLE) consists of PhD-level questions. Most LLMs fail these not because they can’t write, but because they lack specific domain computations. The benchmark measures Tool Identification, meaning the orchestrator doesn’t solve the physics equation but correctly identifies that WolframAlpha can solve it. The caveat is that this measures the quality of the tools available as much as the orchestrator. If the tool suite lacks a physics engine, the orchestrator fails regardless of its “intelligence.” FRAMES tests multi-hop factual reasoning, such as finding the population difference between the cities where two authors were born. This tests context window management, as the system must retrieve both facts, hold them in memory, and perform arithmetic. The failure mode here is “Distractor Injection.” When retrieving information about Author A, the tool might return 5000 tokens of noise. The benchmark implicitly measures the orchestrator’s ability to filter noise or the robustness of its attention mechanism. τ²-Bench simulates user interactions with varying preferences. This is the only benchmark that tests the Utility Function, checking if the model actually respects the “Cost vs. Accuracy” tradeoff. The metric is a Utility Score (u) defined as u = α ⋅ 𝕀(correct) − (1 − α) ⋅ cost. This formula explicitly defines the “exchange rate” between accuracy and dollars. The Problem with “Accuracy per Dollar” The authors present Accuracy per Dollar as a key metric, but this is potentially misleading. In many production systems, the value of accuracy is non-linear. For example, 99% accuracy on a medical diagnosis task is worth $1M, while 90% accuracy is worth $0 (or negative, due to liability). A linear “Accuracy per Dollar” metric favors systems t

    37 min

About

Rooted Layers is about AI insights grounded on research. I blog about AI research, agents, future of deep learning, and cybersecurity. Main publication at https://lambpetros.substack.com/ lambpetros.substack.com