Unlimited Signals

Unlimited Data Works

Unlimited Signals is a long-form podcast from Unlimited Data Works about research, technology, and applied systems thinking. Each episode starts from papers, tools, or technical ideas and pushes toward signal, mechanism, and practical use.

  1. May 6

    How AI Agents manage other agents

    This episode explores the technical evolution of AI systems from simple task automation to sophisticated multi-agent orchestration. It synthesizes three distinct perspectives: a hands-on builder's guide for local agents, Stripe’s enterprise-scale "blueprint" architecture, and cutting-edge research from Sakana AI on using reinforcement learning to automate the coordination of frontier models. The discussion focuses on how to move past basic prompting into building reliable, autonomous systems that can reason, self-correct, and scale. Core Insights and ArchitectureA fundamental distinction exists between classic automation and AI agents. While traditional automation is deterministic—requiring every branch to be defined in advance—an agent is characterized by "wiggle room," using tools and goals to improvise workarounds when standard paths fail. For individual builders, the "architecture" of an agent is surprisingly simple: it is essentially a folder structure on a local machine. In this model, instructions live in markdown files (like CLAUDE.md), memory is stored in flat text files, and "tools" are simply scripts the agent knows how to execute. At the enterprise level, reliability is achieved through a "blueprint" architecture. This approach "sandwiches" AI reasoning between deterministic nodes—fixed code that handles verifiable tasks like parsing, running tests, or linting. By constraining AI to specific reasoning steps while using standard code for validation, organizations like Stripe can submit over 1,300 AI-generated pull requests per week without sacrificing codebase integrity. The RL Conductor and Test-Time ScalingThe most advanced orchestration model presented is the RL Conductor, a 7B parameter model trained via reinforcement learning (specifically GRPO) to act as a meta-orchestrator. Rather than relying on human-designed templates, the Conductor learns to divide complex problems, delegate subtasks to a pool of specialized worker models (like GPT-5 or Gemini Pro), and design communication topologies. Key research findings include: Delegation Power: A small 7B model can outperform much larger individual models by effectively leveraging the complementary skills of a worker pool. Task Adaptivity: The system dynamically allocates compute, using more steps for harder problems (like competitive coding) and fewer for simple information retrieval. Recursive Scaling: By allowing the Conductor to call itself, researchers unlocked "test-time scaling," where the model can adapt its strategy on the fly if an initial attempt fails. Practical Takeaways and Design HeuristicsFor engineers building agentic systems, the sources offer several practical "rules of the road": Instruction Integrity: You may use AI to draft your agent's instructions, but you must manually read and edit every line. One misguided sentence in a core instruction file can lead to massive "drift" and wasted tokens. Model Routing: Efficiency requires sending simple tasks to smaller, cheaper models and reserving high-reasoning models for strategic decisions. The Context Game: Context management is the primary constraint. Instructions must be "thinned" regularly to prevent the agent from being weighed down by a "giant anchor" of overhead tokens. Security Scoping: Agents should have their own restricted accounts and credentials, rather than sharing the user's personal access, to contain the "blast radius" of potential failures.

    37 min
  2. May 4

    Agentic RAG and the autonomous researcher

    This episode explores the transition of AI from static information retrieval to the "autonomous researcher" model. It breaks down how Agentic Retrieval-Augmented Generation (Agentic RAG) moves beyond simple keyword matching to create systems that can plan, reason, and verify their own findings. Using the Feynman technique, the discussion simplifies the complex machinery of multi-agent systems into an understandable framework for building robust AI tools Main Ideas and Strategic InsightsThe core shift discussed is from "Naive RAG"—a simple retrieve-then-read process—to Agentic RAG, where autonomous agents dynamically manage retrieval strategies. The strongest insight is that "context engineering" has become the primary job of AI engineers. This involves managing the "RAM" of the LLM (the context window) by writing, selecting, compressing, and isolating information to prevent "context poisoning" or performance degradation.Another major theme is the use of specialized agent roles. Rather than one model doing everything, systems like MA-RAG use a "Planner" to decompose complex queries into sub-tasks and an "Extractor" to filter out noise from retrieved documents. This role-based separation allows smaller models to handle simpler tasks while reserving high-capacity models for final answer synthesis. Practical Takeaways and Engineering Best Practices Implement "Scratchpads": Use external memory or state objects to store plans and notes. This prevents the agent from losing track of its objective when the context window becomes token-heavy. Temporal Awareness: Traditional RAG treats facts as static, but real-world data evolves. Using "Temporal Agents" to extract time-stamped triplets allows the system to answer questions like "What was true in 2021?" versus now. Evaluation Loops: Rely on expert-curated "Golden Answers" for ground truth, but use "LLM-as-a-judge" to scale evaluation during development. Tool-Integrated Reasoning (TIR): Train models to call tools like search engines or code interpreters as native reasoning steps (the "think-action-observation" loop) rather than just relying on prompt engineering. Caveats and Open QuestionsWhile powerful, these systems come with significant "token overhead." Multi-agent interactions can consume up to 15 times more tokens than a single chat, leading to higher latency and costs. Furthermore, the performance of an agent is heavily dependent on the underlying LLM's capacity; smaller models often struggle with the multi-hop reasoning required for complex planning. A major open question remains how to effectively automate the "invalidation" of outdated facts in a knowledge graph without constant human oversight.Solid Claims vs. SpeculationIt is a proven claim that Agentic RAG frameworks (like TempAgent or MA-RAG) significantly outperform Naive RAG on multi-hop benchmarks like HotpotQA and MultiTQ. It is also verified that context management techniques like summarization and trimming are essential for long-running agent trajectories.However, the idea that AI can function as a fully "autonomous researcher" in high-stakes fields like medicine or law without a "human-in-the-loop" is still speculative. While systems like Google's "Co-Scientist" show promise in generating hypotheses, the sources emphasize that human review remains crucial for ensuring findings align with real-world requirements

    56 min
  3. May 2

    The Shift Toward Autonomous Self-Evolution in AI Agents

    The Proposer-Solver Framework The core of this evolution is a co-evolutionary loop involving two roles: a Proposer (or Challenger) and a Solver. In the Dr. Zero and R-Zero frameworks, both models are initialized from the same base LLM. The Proposer is rewarded for generating tasks at the edge of the Solver's current capabilities, while the Solver is rewarded for successfully navigating these challenges using external tools like search engines. In search-agent contexts, the external search engine acts as a "teacher," providing the objective feedback needed to validate answers without human labels.Scaling Context through Recursion A major technical insight involves Recursive Language Models (RLMs), which address the "context rot" seen when LLMs handle very long prompts. Rather than feeding a massive document into the model's limited context window, RLMs load the prompt as a variable in a programming environment (REPL). The model then writes code to peek into, decompose, and recursively call itself over small snippets of the data, allowing it to process contexts two orders of magnitude beyond its native limit.Efficiency without Backpropagation New methods like Training-Free GRPO demonstrate that agent performance can be enhanced without costly parameter updates or gradient-based training. Instead of changing the model's weights, the system distills "experiential knowledge" from successful and failed attempts into a "token prior" or a hierarchical skill library. This knowledge is then injected into the prompt during inference, allowing a frozen model to achieve gains that previously required massive supervised fine-tuning. Practical Takeaways Small Model Parity: Through structured memory designs like ALMA or skill distillation in SKILLRL, smaller open-source models (e.g., 7B or 8B parameters) can match or exceed the performance of much larger frontier models like GPT-4o on specific tasks. Inference-Time Scaling: AI performance can be scaled at "test-time" by allowing the model more time/compute to think recursively and use tools, rather than just relying on pre-trained knowledge. Curriculum Generation: Systems that generate their own training data can act as "mid-training" amplifiers, making subsequent fine-tuning on human data significantly more effective. The Stability Ceiling: Research indicates that self-evolution is not yet infinite; models often experience a performance plateau or a "model collapse" after several iterations, where they begin to amplify their own biases or lose diversity. Data Quality Decay: As the Proposer generates more difficult questions, the Solver's ability to provide accurate pseudo-labels via majority voting decreases, leading to noisier training signals. Inference Costs: While "training-free" methods save on GPU hours, recursive calls can lead to high variance in inference costs and latency depending on task complexity. It is empirically validated that data-free agents can match supervised performance in constrained domains like competitive math and multi-hop search. However, whether these self-evolutionary dynamics can generalize to open-ended, subjective domains like creative writing or dialogue remains a speculative hurdle for future research. Additionally, while scaling inference compute via recursion shows promise, its long-term stability across diverse real-world task distributions is still being explored

    43 min
  4. May 1

    AI EVOLUTION FROM PROMPTS TO SELF-IMPROVING ARCHITECTURES

    This episode explores the transition of Large Language Models (LLMs) from reactive chatbots to autonomous, self-optimizing agents. We synthesize research on automated prompt engineering, the emerging maturity model of context and intent engineering, and the critical reliability gaps that surface during long-term delegation.MAIN IDEAS AND INSIGHTS The Maturity Pyramid of Agent Engineering: Prompting is evolving from a craft into a structured four-level hierarchy: Prompt Engineering (PE): The baseline of individual query formulation. Context Engineering (CE): Designing the informational environment (memory, tools, state) in which an agent operates. Intent Engineering (IE): Encoding organizational goals and trade-off hierarchies to ensure agents pursue the right outcomes. Specification Engineering (SE): Creating machine-readable corporate policies and standards to govern multi-agent systems at scale. LLMs as Optimizers: Models can now autonomously refine their own instructions. Tools like Automatic Prompt Engineer (APE) and OPRO demonstrate that LLMs can conduct black-box optimization to find prompts that outperform human-designed baselines. Self-referential systems like Promptbreeder use LLMs to mutate and evolve both task-prompts and the mutation instructions themselves, using natural language as the substrate for improvement. The Reliability Gap: While AI can "breed" better instructions, it often fails during extended delegation. The DELEGATE-52 benchmark reveals that even frontier models (e.g., GPT-5.4, Claude 4.6 Opus) corrupt an average of 25% of document content over 20 delegated interactions. Sparse Critical Failures: Document degradation is rarely a gradual "death by a thousand cuts." Instead, models maintain near-perfect performance for several rounds before suffering sparse "critical failures"—single round-trips where 10-30% of content is suddenly lost or corrupted. Structured vs. Natural Language: LLMs are significantly more reliable at manipulating repetitive, structured files (code, JSON, Science & Engineering data) than natural language prose or lexically rich documents. Model Scale and Cost: Counter-intuitively, larger models are often more cost-effective for prompt optimization because they generate more concise instructions, which reduces the downstream cost of scoring those prompts. The "Goldilocks" Band: Prompt optimization is most effective for models in a specific capability range. If a model is too weak, it cannot follow complex evolved instructions; if it is too strong, it may already be "saturated," meaning bare-seed prompts already match its internal optimal behavior. Speculation vs. Claim: The "Four-Level Pyramid" is a proposed framework for managing corporate AI maturity; while independent authors are converging on this taxonomy, it is a management model rather than an established technical law. Measurement Deficit: There are currently no standardized metrics for "context relevance" or "intent alignment" without costly expert A/B testing. Tool Limitations: Adding a basic agentic harness (tools for file reading/writing) does not necessarily reduce document corruption in delegated tasks and can sometimes increase it due to long-context overhead. As AI systems grow more autonomous, the human role is shifting from tactical (writing phrases) to architectural (designing environments and encoding intent). Reliability remains the primary bottleneck for delegated work, requiring a move beyond "prompt art" toward rigorous state engineering.

    1h 17m
  5. Apr 30

    The Statefulness Revolution: From AI Wrappers to Agentic Infrastructure

    The era of "stateless" AI is ending. For years, developers have struggled with LLMs that "forget" project conventions, hallucinate across long sessions, and buckle under context window limits. But a new wave of research—spanning Oxford, Peking University, and Tencent—is revealing how context is being codified into a persistent, version-controlled, and self-evolving infrastructure.In this episode, we break down the fundamental shifts from five groundbreaking papers that move us beyond simple prompt engineering toward Loosely-Structured Software (LSS). We explore how agents are learning to manage their own memory via "sawtooth" context profiles, Git-style version control for reasoning, and three-tier documentation architectures. Whether you are an investor looking for the next layer of the AI stack or an indie developer trying to scale an agentic workforce, these are the new "physics" of software development 2. Key Insights Memory as a Navigable Codebase: Advanced frameworks like the Git-Context-Controller (GCC) reframe agent memory as a file system where agents can COMMIT milestones, BRANCH to explore experiments, and MERGE distilled reasoning. The "Sawtooth" Context Profile: Models like StateLM maintain high accuracy by proactively pruning their own context—reading data, taking notes, and then "forgetting" the raw tokens to stay within optimal performance limits. Meta-Context Engineering (Skill Evolution): Top-performing systems decouple how to learn (meta-level skills) from what is learned (base-level artifacts), allowing agents to evolve their own operational protocols. Documentation is Machine Code: In large codebases (100k+ lines), documentation is no longer just for humans; it is the "hard drive" agents require to maintain consistency and follow architectural conventions. Managing Runtime Entropy: As multi-agent systems scale, they hit a "complexity ceiling" where coordination overhead outweighs utility; solving this requires Loosely-Structured Software (LSS) design patterns like Semantic Routers and Lenses. 3. Actionable Takeaways Adopt a Three-Tier Context Architecture: Organize project knowledge into a Hot Memory Constitution (always-loaded rules), Specialized Domain Agents (area experts), and a Cold Memory Knowledge Base (on-demand specifications). Implement "Active Forgetting" Tools: Equip agents with tools like deleteContext to manually prune their history once a task is distilled into a persistent note. Codify Experience into Specification: If you have to explain a domain rule twice to an agent, codify it into a machine-readable .md spec that specialized agents can retrieve via protocols like MCP. Use Semantic Design Patterns: Implement a Semantic Lens to filter information for a specific step and a Mediator to prevent agents from polluting each other's memory during collaboration. Maintenance Overhead (Strong Claim): Vasilopoulos reports that maintaining machine-readable specs adds roughly 1–2 hours per week of manual labor for a 100k-line project. Risk of Spec Staleness (Strong Claim): Agents trust documentation absolutely; out-of-date specifications lead to "silent failures" where code is syntactically correct but logically conflicting. Orchestration Costs (Skeptical View): Introducing Lenses and Routers increases token consumption and latency because the system requires additional agent calls to manage its own context. Small Model Fragility (Strong Claim): 8B-scale models are significantly more prone to over-classification (false positives) in safety tasks when compared to larger models, requiring aggressive "early safe return" mechanisms in retrieval.

    56 min
  6. Apr 8

    The Rise of Agentic Intelligence: From Open-Weight Reasoning to Silent Thinking

    This episode explores the fundamental shift in artificial intelligence from passive sequence generators to autonomous agents capable of multi-step reasoning, planning, and tool interaction. We dive into the recent release of OpenAI’s open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, which bring frontier reasoning capabilities to the open-source community. We also examine architectural breakthroughs like Meituan’s LongCat-Flash, which introduces Zero-computation Experts to optimize efficiency. Finally, the episode discusses the cutting-edge technical paradigms of Implicit Reasoning—where AI "thinks" silently in latent space rather than through visible text—and Dynamic Speculative Planning, a framework that accelerates agentic workflows by adaptively predicting future actions Key Takeaways: ​Open-Weight Reasoning Frontier: OpenAI has released gpt-oss-120b and gpt-oss-20b, open-weight models designed for agentic workflows with strong instruction following and tool use. gpt-oss-120b matches or exceeds proprietary models like o3-mini on canonical reasoning and coding benchmarks.​Architectural Efficiency with MoE: The LongCat-Flash model introduces Zero-computation Experts, allowing the model to dynamically allocate computational resources based on token significance. This allows it to activate an average of 27B parameters out of a 560B total, optimizing both training throughput and inference speed.​The Paradigm Shift to Agentic RL: Traditional RL focused on single-turn alignment, but Agentic Reinforcement Learning (Agentic RL) reframes LLMs as autonomous decision-makers operating in dynamic, partially observable environments (POMDPs).​Lossless Acceleration via DSP: Dynamic Speculative Planning (DSP) provides a way to reduce agent latency by having a "draft" model predict multiple future steps that a "target" model verifies in parallel. By using online RL to adjust the number of speculative steps, DSP can reduce total costs by 30% without sacrificing performance.​Thinking Without Words: Research is shifting from explicit Chain-of-Thought (CoT), which is verbose and resource-intensive, toward Implicit Reasoning. This silent reasoning happens internally within the model’s latent representations, leading to faster inference and more diverse reasoning paths.​Safety in Open Models: While open-weight models like gpt-oss follow safety policies by default, they present a different risk profile because actors can fine-tune them to bypass refusals. However, evaluations show that even with adversarial fine-tuning, these models do not currently reach "High" capability thresholds for biological or cyber risksproducers note : this is the final episode of season 1. season 2 is coming soon

    56 min
  7. Apr 8

    The Ideas Behind the AI Revolution: Principles of Deep Learning

    This episode explores the core principles and underlying ideas of deep learning. We delve into the fundamental taxonomy of machine learning—supervised, unsupervised, and reinforcement learning—and examine the mechanics of how neural networks use parameters and loss functions to learn from data. The discussion also addresses the "unreasonable effectiveness" of deep learning, explaining why massive, overparameterized networks often perform better than simpler models, and concludes with a critical look at the ethical imperatives regarding bias, transparency, and accountability that every AI practitioner must face. Deep Learning is Built on Core Ideas, Not Just Code: The field is centered on understanding the principles that allow models to be applied to novel situations where no existing "recipe" for success currently exists.The Supervised Learning Pipeline: Training a model is essentially a search through a family of mathematical equations to find the specific parameters that minimize a "loss function," which quantifies the mismatch between model predictions and real-world data.The Advantage of Depth: While both shallow and deep networks can technically approximate any function, deep networks are more efficient, producing significantly more linear regions per parameter and generally achieving better results on complex tasks like image processing.The Mystery of Effectiveness: It is scientifically surprising that deep networks work so well; they often have far more parameters than training examples, yet they reliably fit complex functions and generalize to new data rather than simply memorizing the training set.

    54 min
  8. Apr 7

    Mastering the Lakehouse: A Deep Dive into MLOps and the Future of LLMOps

    In this episode, we explore the evolving landscape of Machine Learning Operations (MLOps) through the lens of Databricks’ updated "Big Book of MLOps." We break down the essential equation that defines the field—MLOps = DataOps + DevOps + ModelOps—and discuss how a unified, data-centric approach on the Lakehouse platform accelerates business value. From the foundational principles of environment separation to the cutting-edge challenges of productionizing Large Language Models (LLMs), we provide a comprehensive roadmap for building robust, scalable, and efficient AI workflows Takeaways The Power of Unified Governance: A central theme is the move toward a unified governance solution for both data and AI assets using Unity Catalog. By managing models, feature tables, and volumes in one place, organizations can ensure consistent access controls, trace lineage from data to model, and significantly improve asset discoverability. "Deploy Code" Over "Deploy Models": For most use cases, the sources recommend a "deploy code" approach. In this workflow, code—rather than a static model artifact—is promoted through development, staging, and production environments. This ensures that the entire pipeline is rigorously tested and reproducible in the production environment. Real-Time Serving and Monitoring: Modern MLOps requires more than just batch processing. Databricks Model Serving provides a serverless, highly available way to deploy models as REST APIs. To ensure long-term stability, Lakehouse Monitoring is used to automatically detect data drift and model quality degradation, triggering alerts or retraining when performance deviates from expectations. The Shift to LLMOps: The arrival of Generative AI introduces new challenges, such as prompt engineering and the need for human feedback in the evaluation process. While LLMOps shares the same modular foundation as traditional MLOps, it focuses more on packaging "chains" or "agents" and managing the unique cost/performance trade-offs of large-scale models. Leveraging Proprietary Data with RAG: To overcome the limitations of static training data, the sources highlight Retrieval Augmented Generation (RAG). RAG connects LLMs to real-time, domain-specific data via vector databases, allowing the model to act as a reasoning engine that provides accurate, up-to-date responses without the massive overhead of full pre-training.

    1 hr

About

Unlimited Signals is a long-form podcast from Unlimited Data Works about research, technology, and applied systems thinking. Each episode starts from papers, tools, or technical ideas and pushes toward signal, mechanism, and practical use.