AI Papers: A Deep Dive

paperdive.ai

Long-form deep dives into new research on Artificial Intelligence, AI agents and the engineering practice of building them - one paper per episode. We unpack the motivating problem, how the method actually works, the math that matters, what the experiments do and don't show, and the strongest critique against the result. The goal isn't a five-minute summary; it's the kind of conversation you'd have with a colleague who actually read the paper. Topics span large language models, autonomous agents, agentic coding, reinforcement learning for agent training, evaluation and benchmarks, alignment, and the practical engineering decisions that make agentic systems actually work in production. Most papers are pulled from arXiv, often within days of release. Hosted by AI voices generated with ElevenLabs. Episode scripts are produced by a multi-stage Claude pipeline working from a close reading of the source paper. New episodes daily.

  1. 22小时前

    The Bug Where Smart Assistants Read a Fact and Still Forget It

    The Bug Where Smart Assistants Read a Fact and Still Forget It Source: https://arxiv.org/abs/2606.27472 Paper was published on June 25, 2026 This episode was AI-generated on June 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A frontier model can read that you moved to the suburbs and still insist it has no idea where you live — and neither a bigger model nor 24x more memory closes that gap. This paper argues every AI lab that shipped persistent memory in 2026 is treating a behavior problem as a storage problem, and shows the one intervention that actually moves the needle. Key Takeaways: - Why a model can score 92% reading the full conversation but drop to 77% maintaining the same facts from compressed notes — and why that 'supersession gap' is a maintenance problem, not a comprehension one - The 13-to-1 result showing the failure is real and one-directional, not statistical noise - Why a bigger model and 24x more memory both fail to close the gap, with the desk-clutter intuition for why extra storage helps and hurts in equal measure - How a reinforcement-learning reward that targets which version of a fact is *current* nearly doubles held-out accuracy on a small model (9% to 16.7%) - The training curve that 'switches on' exactly when the behavior is learned — the cleanest evidence it's real learning, not luck - Why the headline training result is a single-seed proof of mechanism, and the specific cracks (lenient matching, small question counts, one kind of scale) the episode is honest about 00:00 - A fact it read and lost: The cold open on a model that has the answer in front of it and still says it has no information, and the 15-point gap that frames the episode. 01:45 - Why memory becomes a sticky note: How real systems compress conversations into a notes field, why the agent must actively overwrite stale facts, and the definition of supersession. 04:00 - Is the failure even real?: The clever same-questions experiment comparing full-context to bounded-memory, yielding a 13-to-1 result that isolates maintenance from comprehension. 06:37 - Does a smarter model save you?: Comprehension scales toward solved while the bounded-memory line stalls — proving the skill that scaled isn't the skill that's failing. 07:38 - Can you just buy more memory?: Pulling apart conversation length from compression ratio reveals 24x more memory recovers exactly nothing, even though all answers changed. 11:33 - Training the model to keep facts current: Building a reinforcement-learning reward that rewards temporal currency directly, why synthetic data becomes curriculum, and how GRPO scores without a critic. 16:16 - The curve that switches on: The small model nearly doubles its held-out accuracy, with a training curve that turns on precisely when the behavior is acquired. 18:06 - Where the result actually lands: The honest reservations — single seed, lenient matching, small counts, one kind of scale — and why the diagnosis is strong while the training claim is a first data point. 21:44 - Train the habit or change the substrate?: The bigger reframe that current memory is a learned policy, not a side effect of intelligence, and the closing question posed to listeners. Recommended Reading: - LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory: The benchmark this episode's diagnosis is built on — its knowledge-update questions are exactly what Patel runs under full-context versus bounded-memory conditions. (https://arxiv.org/abs/2410.10813) - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models: The paper that introduced GRPO, the critic-free reinforcement learning method whose self-terminating batch behavior the episode leans on as evidence of real learning. (https://arxiv.org/abs/2402.03300)

    24 分钟
  2. 22小时前

    Why You Can't Fine-Tune Foresight Into an AI Agent

    Why You Can't Fine-Tune Foresight Into an AI Agent Source: https://arxiv.org/abs/2606.27483 Paper was published on June 25, 2026 This episode was AI-generated on June 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A team taught a language model to forecast the future before acting — and it learned the format flawlessly while learning none of the actual skill, slapping a confident 100% on plans that were vague and contradictory. The unsettling claim: fine-tuning can install a perfect imitation of an ability with nothing underneath it. This episode unpacks the format-capability gap, a three-stage fix that makes a model's self-confidence honest, and exactly where the evidence outruns the story. Key Takeaways: - Why teaching a model the shape of foresight produces a confident lie instead of an actual plan — the 'format-capability gap' between eliciting an ability and installing one - The three-stage apprenticeship that injects the capability first (200B tokens of trajectories), teaches the format second, and makes the confidence number honest third - How a 'verbalized Q-value' plus a Brier-score calibration reward locks the model's self-confidence honest so it can't be gamed - Why confidence that drops to 5% on an unsolvable problem is more useful for deployed agents than confidence that stays high and wrong - The honest limits: the full fix runs on one proprietary 2B model, the gains over a simpler baseline are about two points, and the calibration evidence is hand-picked rather than measured - Why skipping the supervised warm-start collapses the whole pipeline — reward can sharpen a skill but can't conjure one 00:00 - Perfect form, empty content: The cold open lays out the central failure: a model that learned the foresight format almost perfectly while producing hollow, confidently-wrong plans. 01:48 - Why the next agents live or die on this: Eric and Cassidy frame why a trustworthy confidence signal matters for long-running agents, and trace the idea of mental rehearsal back to Kenneth Craik's 1943 work. 02:38 - The format-capability gap: The naive fix — fine-tuning on look-ahead blocks — fails the same way across three models, illustrating that post-training elicits abilities but can't install new ones. 05:11 - A world model written out loud: The pivot to a folded-in, verbalized world model and the 'Q-value spoken out loud' framing that makes the model's own estimate gradeable against reality. 07:43 - Build the skill before the report: The three-stage apprenticeship — capability injection via teacher-written future blocks, format teaching, and a reinforcement-learning stage with stacked grounding, calibration, and task rewards. 12:53 - Watching the confidence wobble: Confidence trajectories that rise, dip on caught errors, and stay low on unsolvable problems — the behavior the design is aiming for, with a flag that these cases are hand-picked. 14:56 - The headline is smaller than the story: The benchmark results: modest ~two-point gains, real payoff on multi-hop reasoning, near-free efficiency, and a collapse when the warm-start is skipped. 17:36 - Where the ideas outrun the evidence: The steelman critique — incremental gains, a single unreproducible proprietary model, anecdotal calibration, leaky teacher blocks, and a loosely-used Q-value label — set against the well-documented diagnosis. Recommended Reading: - Dream to Control: Learning Behaviors by Latent Imagination: The Dreamer line of work the episode contrasts against — a separate latent world-model simulator the agent dreams inside, exactly the 'bolted-on module' framing this paper deliberately refuses. (https://arxiv.org/abs/1912.01603) - ReAct: Synergizing Reasoning and Acting in Language Models: The reactive step-see-step paradigm the episode positions as the status quo the paper is trying to move agents beyond. (https://arxiv.org/abs/2210.03629) - Language Models (Mostly) Know What They Know: A direct empirical counterpoint to the episode's skepticism about self-graded confidence, asking whether a model's own calibration signal can be trusted across many episodes. (https://arxiv.org/abs/2207.05221)

    23 分钟
  3. 22小时前

    How a Tiny Model Too Weak to Plan Cuts a Bigger Agent's Hallucinations by 80%

    How a Tiny Model Too Weak to Plan Cuts a Bigger Agent's Hallucinations by 80% Source: https://arxiv.org/abs/2606.27806 Paper was published on June 26, 2026 This episode was AI-generated on June 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. A neural network with about five thousand parameters — too weak to solve half the tasks on its own — slashes a GPT-4o-mini agent's hallucinations by eighty percent. The trick isn't intelligence, it's checkability: in an agent, one false claim at step three corrupts everything downstream, and a cheap proofreader that catches just the right error stops the cascade. Key Takeaways: - Why a hallucination inside an agent isn't a single mistake — it's a corruption of the agent's belief that manufactures more errors downstream - How GILP staples a weak-but-grounded trained model onto a smart-but-lying LLM, using a consistency gate that only re-prompts the steps worth doubting - The 'hallucination contraction' math: the agent's error rate can only shrink, never grow, even when the little model is wrong - Why a tiny MLP grounds the agent as well as a 99%-accurate graph transformer — the backbone doesn't need to be a good planner to be a useful error signal - Where the evidence is thin: the big success-rate jumps come from a simulator fit to just five real runs, and the cross-domain test came back inconclusive 01:12 - One false word, three broken actions: How a single dropped precondition gets written into the transcript as fact and cascades into a chain of broken actions. 02:50 - Two world models, opposite ways to fail: The trade-off between a weak trained predictor whose errors are measurable and an LLM world model that plans well but lies confidently. 04:31 - Staple the dumb model to the brain: How GILP runs a per-step loop that grounds the LLM before it acts and fires a consistency gate when the two predictions diverge. 07:20 - Why the error rate can only shrink: The hallucination-contraction proof showing the agent's error rate can only go down, under one stated assumption, even with an imperfect backbone. 09:04 - The eighty percent that actually holds: The live-data result — hallucinated-state rate dropping from 18% to under 4% — plus why the bigger success numbers come from a conservative simulator. 11:04 - A smarter checker buys nothing: The surprising plateau where a tiny MLP grounds the agent as well as a near-perfect graph transformer, and what that reveals about grounding. 12:39 - The asterisk Finn held all episode: The honest limits: a simulator fit to five real episodes, easy live tasks that hit 100% both ways, and an inconclusive out-of-domain test. 14:47 - Stop building bigger verifiers: The bigger reframing — treating hallucination as belief corruption — and the fork builders face between a tiny trained checker and a less-lying big model. Recommended Reading: - Reasoning with Language Model is Planning with World Model: Introduces using an LLM as its own world model for planning, exactly the 'fluent tour guide who invents landmarks' approach GILP grounds with a trained checker. (https://arxiv.org/abs/2305.14992) - ReAct: Synergizing Reasoning and Acting in Language Models: The agent paradigm where the model re-reads its own transcript as fact each step — the 'journal you only plan from' that lets one false token corrupt everything downstream. (https://arxiv.org/abs/2210.03629) - Reflexion: Language Agents with Verbal Reinforcement Learning: An alternative to GILP's external checker — having the agent verbally self-critique and revise, useful contrast for the episode's 'make the big model stop lying' fork. (https://arxiv.org/abs/2303.11366) - Survey of Hallucination in Natural Language Generation: Background on the single-answer hallucination framing the episode argues breaks down inside agents, where a hallucination compounds rather than being one isolated mistake. (https://arxiv.org/abs/2202.03629)

    17 分钟
  4. 22小时前

    How to Backpropagate Blame Through a Team of Chatbots — And When It Backfires

    How to Backpropagate Blame Through a Team of Chatbots — And When It Backfires Source: https://arxiv.org/abs/2606.28187 Paper was published on June 26, 2026 This episode was AI-generated on June 29, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Split a strong language model into a team of specialist agents and it can actually do worse than the single model alone. This episode unpacks a method that borrows gradients from deep learning to find exactly which agent dropped the ball — a fix that nearly doubled one system's accuracy, and collapsed another from 71 to 7. Key Takeaways: - Why a team of specialist agents can underperform a single model — fluent and helpful while getting the user's actual goals wrong - How GBC reframes credit assignment by weighting each agent connection by real influence and tracing the error backward to a culprit - Why the gradients only do diagnosis (the MRI) while a separate LLM optimizer does the plain-English repair (the surgeon) - The empirical reversal: plain 'loudness' beats the theoretically favored value-weighted attribution as a blame signal - The core claim that attribution quality predicts optimization quality — the bottleneck is the diagnosis, not the fixer - The honest limits: one backbone regressed from 71 to 7, no ablation isolating attribution from a competent optimizer, and token-level precision never directly verified 00:00 - When the team loses to one model: Sets up the puzzle that a team of specialist agents can score worse than a single model, and previews the doubling-versus-collapse hook. 01:58 - Whose move actually lost the game?: Frames multi-agent failure as the old reinforcement-learning problem of credit assignment — knowing which step in the relay dropped the baton. 03:40 - Putting a sensitivity meter on the arrows: Explains how GBC treats the team as a graph, weights each connection by how much one agent influenced the next, and traces blame backward. 06:30 - Are gradients actually fixing anything?: Clarifies that the gradients only diagnose; a separate LLM optimizer reads the blame report and rewrites prompts in plain English. 08:30 - Why the fancier metric loses: Walks through the four attribution variants and the surprising finding that raw loudness beats value-weighted attribution as a blame signal. 12:28 - From worst-in-class to best-in-class: Presents the headline results on MultiWOZ and τ-bench retail where the Qwen team roughly doubled its accuracy and beat the single-agent baseline. 14:49 - The same fix that gutted Llama: Pulls on the reservations: backbone variance, no ablation isolating the gradient attribution, and the unverified token-level precision. 18:28 - Maybe the scan, not the surgeon: Lands on the durable principle that the quality of the blame signal may matter more than the optimizer, and poses the fork between sharper diagnosis and end-to-end training. Recommended Reading: - Why Do Multi-Agent LLM Systems Fail?: The episode cites this paper by name as the diagnosis of why agent teams fall apart — bad hand-offs, information omission, weak verification — the exact failure modes GBC tries to localize. (https://arxiv.org/abs/2503.13657) - TextGrad: Automatic 'Differentiation' via Text: The textual-gradient optimizer lineage the episode places GBC against — it works off a global verdict, which is precisely the credit-assignment limitation GBC sets out to fix. (https://arxiv.org/abs/2406.07496) - Learning Important Features Through Propagating Activation Differences (DeepLIFT): Shrikumar et al. 2017, the source of the gradient-times-input attribution idea whose value-weighted variants the episode says surprisingly lose to raw loudness. (https://arxiv.org/abs/1704.02685) - DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines: Part of the LLM-pipeline-optimization lineage the episode names as pouring effort into the 'fixer' rather than the diagnosis signal GBC argues is the real bottleneck. (https://arxiv.org/abs/2310.03714)

    20 分钟
  5. 2天前

    How DeepSeek Made One User Faster Without Slowing Down the Crowd

    How DeepSeek Made One User Faster Without Slowing Down the Crowd Source: https://raw.githubusercontent.com/deepseek-ai/DeepSpec/main/DSpark_paper.pdf Paper was published on 2026-06-27 This episode was AI-generated on June 27, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. DeepSeek tore out the fast-text part of its flagship model two weeks into running it — and the replacement makes each user's words come back up to 85% faster while serving the same crowd on the same GPUs. The twist: their winning drafter is the 'dumber' one that guesses words blind, and the whole system works partly because a sloppy production shortcut accidentally made the math more correct. By the end you'll understand the two moves that break a trade-off everyone assumed was iron. Key Takeaways: - Why position-one accuracy carries enormous leverage in speculative decoding — and how a 'tall cliff' parallel drafter beats a flat-but-coherent autoregressive one - How DSpark's semi-autoregressive design keeps a deep parallel backbone but adds a tiny cheap correction head to stop the draft's tail from rotting - Why aggressive drafting blew up DeepSeek's last production system, and how making draft length a live, load-aware decision fixes the throughput-versus-latency trade - The causality trap in load-aware scheduling — and how using stale, two-step-old data accidentally restores the lossless guarantee instead of breaking it - The honest critique: the offline quality numbers and the production numbers never meet in one experiment, and the win is partly over a deliberately timid single-token baseline - Why the headline isn't one magic multiplier but a better Pareto frontier — more speed and more users on the same hardware 01:30 - Why the slow part is the bottleneck: Explains why one-token-at-a-time generation wastes a parallel GPU and turns a math monster into a typewriter. 02:07 - The junior, the expert, and free speed: Walks through speculative decoding and the rejection-sampling math that makes it exactly lossless. 03:00 - Two camps, both half-right: Lays out autoregressive versus parallel drafters and the 'of problem' multi-modal collision that wrecks parallel accuracy. 04:45 - When the incoherent drafter won: Introduces position-wise conditional acceptance and the cliff-versus-plateau chart that reveals first-token leverage. 06:36 - A sliver of memory beats stacked depth: Describes the semi-autoregressive Markov head, why it must stay strictly local, and the accepted-length gains it buys. 10:07 - The second bottleneck that killed production: Shifts to the verification term and why longer drafts steal batch capacity from other users under heavy load. 11:51 - Express lanes that open with traffic: Explains the load-aware scheduler, the confidence head, and the temperature-scaling fix for overconfident estimates. 13:43 - How stale data fixed the cheating problem: Shows how greedy admission collapses the combinatorial problem and how two-step-old estimates accidentally restore losslessness. 16:46 - Does the whole machine actually hold up?: Presents the production results, the disowned 661% number, and the Pareto frontier as the honest headline. 19:17 - The catch the paper sometimes blurs: Critiques the gap between offline and production numbers, the timid baseline, the heuristic scheduler, and the dropped RNN head — then names the durable ideas. Recommended Reading: - Fast Inference from Transformers via Speculative Decoding: The original speculative-decoding paper and the rejection-sampling foundation DSpark builds on without modifying — essential for understanding the lossless guarantee the episode keeps returning to. (https://arxiv.org/abs/2211.17192) - EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty: The autoregressive drafter lineage (Eagle3) that DSpark benchmarks its accepted-length gains against, representing the 'coherent but sequential' camp the episode contrasts with parallel drafters. (https://arxiv.org/abs/2401.15077) - Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads: The canonical parallel multi-head drafter that exemplifies the 'fire out the whole block at once' camp whose multi-modal collision problem DSpark's semi-autoregressive head is designed to fix. (https://arxiv.org/abs/2401.10774) - On Calibration of Modern Neural Networks: Introduces temperature scaling, the exact single-dial calibration fix DSpark applies to its confidence head so the scheduler can trust survival-probability magnitudes, not just their ranking. (https://arxiv.org/abs/1706.04599)

    23 分钟
  6. 4天前

    Why Raw Profiler Data Made an AI Worse at Writing GPU Code

    Why Raw Profiler Data Made an AI Worse at Writing GPU Code Source: https://arxiv.org/abs/2606.26453 Paper was published on June 24, 2026 This episode was AI-generated on June 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. Feeding a language model detailed hardware measurements about its GPU code made the code slower than telling it nothing at all — and that counterintuitive result is the foundation for a system that wrote a kernel from scratch beating the human experts who hand-tuned the production version. The fix wasn't more data; it was a deterministic layer that pre-digests measurements into expert-style diagnoses. You'll learn why interpretation beats raw access, and exactly where the headline claims hold up and where they're thinner than they look. Key Takeaways: - Why raw hardware counters made the model slower (1.8x) than giving it no profiling data at all (3.3x) — and why that gap is the paper's most confident result - How KernelPro splits 'reading the profiler' from 'writing the code,' encoding 15 expert heuristics as deterministic tools that output diagnoses, not numbers - Why the SASS disassembly tool caught 37 kernels silently falling back to slow scalar code that no utilization metric could have detected - How the Monte Carlo Tree Search uses log-scaled rewards and a hard correctness wall to avoid being seduced by easy wins on garbage code - The production case study where a from-scratch kernel climbed from 14x slower to 1.23x faster than expert engineers over 18 iterations — and why the skeptic calls it an N-of-one result - Where the claims weaken: speedups measured against unoptimized PyTorch, unfair cross-system comparisons, and a 'headline' search-memory feature that didn't clear significance 00:45 - How can information make you worse?: The hosts establish the central paradox — raw profiler data underperforming silence — and why writing fast GPU code is the bottleneck under all of modern AI. 01:56 - What actually makes a kernel fast?: A primer on GPU kernels, the memory hierarchy, and why the expert's scarce skill is diagnostic reasoning, not reading numbers off a profiler. 04:02 - The category error everyone was making: Why jamming interpretation and creative code-writing into one step fails, and how KernelPro's 15 micro-profiling tools encode expert heuristics as trigger-analysis-prescription rules. 07:56 - Checking the receipt against the kitchen: How three profilers each see something the others can't, and why reading the literal compiled machine instructions caught 37 silent scalar fallbacks. 10:35 - The search that refuses to quit: How the tree search treats each node as a full compiled kernel, uses asymmetric branching, and log-scales rewards with a hard correctness wall to stay patient through repeated failure. 14:46 - Does it actually hold up?: The KernelBench results and ablation ladder, followed by the skeptic's caveats about PyTorch-eager baselines and unfair cross-system comparisons. 17:26 - Beating the humans, once: The production case study where KernelPro wrote a from-scratch CUDA kernel that edged past expert engineers — and a careful debate over what a single 1.23x result really proves. 20:46 - Same speed, less power: A preliminary energy-aware experiment cutting power by 12% with no speed cost, plus an honest accounting of which of the paper's own features underperformed. 22:54 - Diagnose first, then prescribe: The takeaway reframe — that raw data without interpretation misleads — and the open question of whether hand-coded heuristics are the future or a crutch. Recommended Reading: - KernelBench: Can LLMs Write Efficient GPU Kernels?: The standard 250-task benchmark across three difficulty tiers that this episode's KernelPro system is evaluated on — including the PyTorch-eager baseline caveat the hosts flagged. (https://arxiv.org/abs/2502.10517) - Mastering the Game of Go with Deep Neural Networks and Tree Search: The AlphaGo paper that popularized the Monte Carlo Tree Search algorithm KernelPro adapts — useful for understanding the explore-versus-exploit framing the hosts spent time on. (https://doi.org/10.1038/nature16961)

    25 分钟
  7. 4天前

    How an AI Reviewer Learned to Stop Going Easy on AI Writing

    How an AI Reviewer Learned to Stop Going Easy on AI Writing Source: https://arxiv.org/abs/2606.26294 Paper was published on June 24, 2026 This episode was AI-generated on June 26, 2026. The script was written by an AI language model and the host voices were synthesized by Eleven Labs. The producer is not affiliated with Anthropic or Eleven Labs. An AI paper-reviewer was caught accepting machine-written papers nearly twice as often as human ones — and the researchers found a mechanical recipe to train that bias right out. The trick is letting the test itself evolve alongside the thing it grades, without the measurements turning to nonsense. It's a concrete proposal for how self-improving AI might escape the tiny island of coding and math where clean, fixed scoring exists. Key Takeaways: - Why recursive self-improvement only works where there's a cheap, trustworthy way to score output — and why a moving judge normally breaks that - The 'controlled utility evolution' trick: freeze the judge inside an epoch, swap only at boundaries against fixed real-world 'anchor' data - Why erasing old scores when a new judge takes over is the load-bearing step — without it, a stricter judge changes essentially nothing - How the system trained self-preference bias out of an AI reviewer by trapping it with the exact papers that fooled it earlier - The surprise that the proof grader's biggest gain came from getting less strict — learning calibration, not cruelty - Where the skeptic wins: the whole framework is only as good as its imperfect anchor, and the writing results were never checked by a human 00:00 - The judge who changes nothing: A gymnastics-judging analogy sets up the paper's central trick — a stricter standard only counts if you wipe the old scores. 01:28 - The island self-improvement is stuck on: Why systems that improve themselves only work where there's a clean, cheap, trustworthy way to score output. 02:14 - Why a frozen judge fails you: The breeding-program setup explains stationarity, the Red Queen idea, and the three ways a fixed test goes wrong. 05:29 - Move the judge, break the stopwatch?: How epochs, held-out anchors, and conservative scoring let the judge change without destroying the ability to measure progress. 08:17 - Wipe the board, re-rank the winners: Selective erasure is shown to be the entire mechanism — without deleting stale scores, a stricter judge reshuffles nothing. 10:56 - Does any of it make code better?: Co-evolving a code reviewer alongside a coder yields higher success and fewer tokens, with improvements that help both roles at once. 12:51 - Catching the reviewer that gets fooled: Self-preference bias is measured cold, then trained out with an adversarial trap built across an epoch boundary. 15:55 - When the judge develops taste: The evaluators evolve their own rubrics from a one-line prompt — and the grader's biggest gain came from getting less strict, not harsher. 17:13 - Where the skeptic wins: The honest limits: everything rides on imperfect anchors, the writing was never read by humans, the proof results are thin, and the long-run guarantees are absent. 20:54 - Where the hard problem now lives: The reframe to take away — the test is just another agent that can be improved or biased — and the open fork on whether to let evaluators move at all. Recommended Reading: - Gödel Agent: A Self-Referential Framework for Agents Recursively Self-Improvement: The self-improving agent lineage this episode's Red Queen Gödel Machine extends, where an agent rewrites its own code to get better. (https://arxiv.org/abs/2410.04444) - Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents: The breeding-program-of-code framing the episode describes, where variants are scored, kept, and bred — the static-judge predecessor this paper reacts against. (https://arxiv.org/abs/2505.22954) - LLM Evaluators Recognize and Favor Their Own Generations: Documents the self-preference bias that is the episode's central villain — AI judges going easy on AI-written text. (https://arxiv.org/abs/2404.13076) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena: The foundational LLM-as-a-judge paper behind the evaluator agents the episode's framework co-evolves and the biases it tries to train out. (https://arxiv.org/abs/2306.05685)

    23 分钟

关于

Long-form deep dives into new research on Artificial Intelligence, AI agents and the engineering practice of building them - one paper per episode. We unpack the motivating problem, how the method actually works, the math that matters, what the experiments do and don't show, and the strongest critique against the result. The goal isn't a five-minute summary; it's the kind of conversation you'd have with a colleague who actually read the paper. Topics span large language models, autonomous agents, agentic coding, reinforcement learning for agent training, evaluation and benchmarks, alignment, and the practical engineering decisions that make agentic systems actually work in production. Most papers are pulled from arXiv, often within days of release. Hosted by AI voices generated with ElevenLabs. Episode scripts are produced by a multi-stage Claude pipeline working from a close reading of the source paper. New episodes daily.

你可能还喜欢