Deep Dive

Deep Dive

How LLM inference actually works. Why the Strait of Hormuz could move oil prices 40 percent. What happens when AI starts automating AI research. Each episode picks one topic — usually tech, AI, or geopolitics — and goes deep. 30+ primary sources, every claim confidence-tagged, ~18 minutes per topic. For listeners tired of takes without numbers. Also on YouTube: youtube.com/@DeepDiveAIShow

  1. The Walls That Breathe: How the Backrooms Aesthetic Became AI Generation's Killer App

    19 HR AGO

    The Walls That Breathe: How the Backrooms Aesthetic Became AI Generation's Killer App

    Kane Parsons spent 160 hours hand-crafting nine minutes of Backrooms found footage in 2022. A solo creator with Veo 3 in 2026 produces nine comparable minutes in an afternoon for under a hundred dollars. On May 29, A24 releases Backrooms, directed by Parsons, age 20 — the youngest director A24 has ever financed. While that was being shot, a parallel economy of AI-generated Backrooms videos surged what appears to be 4,550 percent in four weeks. This episode is about why the alignment isn't lucky. Six structural properties make the Backrooms aesthetic uniquely positioned for AI generation. No faces. No hands. Repetitive modular geometry that sits on the manifold the model was trained on — fluorescent lights, drop ceilings, drywall, carpet. A narrow color palette inside roughly ten colors. Mood-based audio with no narrative dialogue. And the load-bearing one — the aesthetic embraces low fidelity. AnimateDiff temporal-coherence failures, the "walls that breathe" meme, perspective drift. Every other AI video genre is fighting the model's artifacts. The Backrooms turns them into features. Then the tooling stack. Three years after Stable Diffusion 1.5 shipped, the creator community is still on it — not SDXL, not Flux, not Sora. SD 1.5 plus AnimateDiff plus ControlNet won because the LoRA ecosystem matured here first, AnimateDiff was built for SD 1.5 architecturally, and SD 1.5 runs on four gigabytes of VRAM. Sora 2 has higher fidelity, more physically consistent video, and OpenAI just announced its discontinuation. Why Sora didn't win this niche is itself a lesson — three reasons, all about what the long-tail creator actually wants. The 4,550 percent surge has four overlapping triggers in the February through May window. A24 marketing cycle. Sora's vacated tier opening to Veo 3 Lite at five cents per second. Five-times month-over-month growth in AI-video order volume. YouTube's January AI-content enforcement wave that wiped 16 channels with 35 million subscribers — and explicitly spared aesthetic-AI content. Then the bifurcation. Kane Pixels on one side — 3 million subscribers, A24 distribution, Chiwetel Ejiofor in the cast. AI long-tail on the other — thousands of faceless channels, 20 billion aggregate TikTok views on hashtag Backrooms, network operators clearing 40 to 60 thousand a month at 85 to 89 percent margins. Hand-crafted in 2022: 17 hours of labor per finished minute. Local ComfyUI plus AnimateDiff today: 6 cents of electricity per minute. Every major Backrooms wiki has banned AI submissions while AI uploads dominate by volume. The A24 film is the consolidating moment. Plus three internet-IP precedents on what happens when canon-keepers lose control, and five predictions on what happens next. CHAPTERS 00:00 Cold open — walls that breathe 01:55 Show intro and roadmap 02:56 The 4chan post that started the Backrooms 04:08 Six properties that align with AI generation 06:42 The tooling stack — SD 1.5 + AnimateDiff 08:52 Why Sora didn't win this niche 10:37 The 4,550 percent surge — four triggers 12:43 Two ecosystems that barely overlap 15:07 The closing canon — wikis ban AI 16:01 Three internet-IP precedents 17:14 The A24 film and the consolidating moment 18:41 Predictions and closing SOURCES A24 Backrooms press materials + Variety/Deadline coverage Kane Parsons / Kane Pixels — YouTube channel, January 2022 onward Backrooms Wikidot canon submission rules (Nov 2024 revision) Backrooms Wiki on Fandom — AI content policy CivitAI — Liminal Space + Backrooms Level 0 LoRA pages Stability AI — Stable Diffusion 1.5 release (Oct 2022) Guo et al. 2023 — AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models OpenAI — Sora discontinuation announcement (web/app April 26, 2026; API Sept 2026) Google Trends — AI Backrooms search volume (May 2026) YouTube — January 2026 AI-content enforcement wave coverage Adavia Davis — AI YouTube network revenue disclosures

    21 min
  2. The Pipeline Is the Package: SLSA Provenance Failed Its First Real Test

    20 HR AGO

    The Pipeline Is the Package: SLSA Provenance Failed Its First Real Test

    May 11, 2026. Between 19:20 and 19:26 UTC. Six minutes. An attacker published 84 malicious versions across 42 TanStack packages on npm — including React Router with 12.7 million weekly downloads. Every malicious version was signed with valid SLSA build provenance. Every check that npm performs passed. The build pipeline that produced the malicious artifacts was the real TanStack pipeline. The attestation wasn't lying. This episode walks through what happened in those six minutes, what SLSA was supposed to prevent versus what it actually prevents, the six-year arc from SolarWinds Orion to TanStack that built today's trust architecture, and the AI cyber trilogy of the last twelve weeks that makes this the wrong moment for any of it to fail. The attack chain reconstructed minute by minute. Pwn Request via pull request target. Cache poisoning across the workflow boundary. OIDC token theft from the GitHub Actions runner. Eighty-four npm publishes in sixty seconds, every one producing valid SLSA provenance — because the publishes really came from the official TanStack workflow on the main branch on a hardened build platform. SLSA L3 verified everything it was designed to verify. It just doesn't verify that the inputs to the build script were the intended inputs. Then the lineage. SolarWinds. Codecov. node-ipc. The xz utils Jia Tan incident caught by a Microsoft engineer who noticed half a second of SSH latency on his weekend. tj-actions. Shai-Hulud. Each attack moved the trust failure up the stack. And the AI cyber trilogy of the last twelve weeks. Hagendorff in Nature: LRMs jailbreaking LRMs at 97 percent attack success. UK AISI on GPT-5.5 at 71.4 percent expert cyber tasks. Google Threat Intelligence Group confirming the first criminal AI-built zero-day on the same day TanStack got hit. Frontier cyber offense capability doubling every 3.4 months. Defensive architecture in six layers: SLSA provenance, Sigstore, SBOMs, OIDC trusted publishing across npm/PyPI/RubyGems/crates.io, pre-publish package analysis (the layer that actually caught TanStack), runtime detection. What an engineer can do this week: audit pull-request-target workflows, pin third-party actions to commit SHAs, namespace caches by workflow. Plus regulation (U.S. weaker after EO 14306, EU stronger via the Cyber Resilience Act), the maintainer economics problem nobody is fixing, and five predictions for the next twelve months. The thesis: the frameworks help. The tools help. What actually catches the next one is somebody paying attention to a five-times tarball-size anomaly. CHAPTERS 00:00 Cold open — the 6-minute TanStack window 00:47 Show intro and roadmap 01:18 Callback — LiteLLM, the same pattern weeks earlier 01:58 The attack chain, minute by minute 05:56 SLSA — what it actually guarantees 08:12 The 6-year lineage — SolarWinds to TanStack 11:51 The AI cyber trilogy of the last 12 weeks 15:18 Defensive architecture, six layers 17:42 What an engineer can do this week 18:44 Regulation — U.S. weaker, EU stronger 20:25 The maintainer economics problem 22:23 Predictions and closing SOURCES TanStack incident postmortem (May 11, 2026) Snyk + Socket — TanStack 42-package compromise analysis SLSA v1.0 specification (slsa.dev) Sigstore project documentation Executive Orders 14028, 14144, 14306 EU Cyber Resilience Act (Regulation 2024/2847) Sonatype — 2025 State of the Software Supply Chain Hagendorff et al. — Nature 2026 (LRM-on-LRM jailbreak) UK AISI — GPT-5.5 evaluation (May 7, 2026) Google Threat Intelligence Group — first criminal AI-built 0-day (May 11, 2026) Andres Freund — xz utils backdoor discovery (oss-security) Tidelift — 2024 State of the Open Source Maintainer Report Verizon — 2025 DBIR

    25 min
  3. How RAG Actually Works (and Why Most Production Systems Are Broken)

    1 DAY AGO

    How RAG Actually Works (and Why Most Production Systems Are Broken)

    Retrieval-Augmented Generation is in every production LLM application now. Most of them fail in similar, specific ways — and the fixes are mostly not about the LLM. This episode walks through the pipeline layer by layer, from chunking to embeddings to vector indexes to hybrid retrieval to reranking, with empirical numbers from two production RAG systems built for this show — including the one that caught two real factual errors in an already-published episode. The thesis: the 80/20 of RAG quality lives in retrieval, not in the language model at the end. Anthropic's Contextual Retrieval reduced retrieval failure rate by 67 percent without touching the LLM. That's the shape of the problem. What's actually covered. The Lewis et al. 2020 paper that named RAG, and how modern production diverges from it. Why your cosine-similarity thresholds are probably wrong (empirical distribution on text-embedding-3-small: off-topic 0.10 to 0.25, narrative match 0.50 to 0.65, sequel-grade overlap 0.65 to 0.70 — set thresholds from observed distribution, not textbook defaults). HNSW, IVF, Product Quantization — when each wins at scale, and why a billion-vector index needs 6 terabytes of RAM at full precision. Hybrid retrieval with BM25 plus dense embedding, plus reranking — Anthropic's 5.7 to 1.9 percent failure cascade as the cleanest published demonstration. Then the production failure modes. Junk retrieval. Missing context. Hallucination on grounded generation. Stale data. Multi-document reasoning failures. Lost in the middle. And the seventh: wrong-topic evidence retrieval. The "Cheng versus Costello" pattern — the verify-claims-rag system flagged a script claim as wrong, citing evidence about a different study. The retrieval surfaced a related-but-different paper and the judge couldn't tell. Demonstrated live on the script for this episode. RAG versus long context. Claude 4.7 at 1 million tokens. GPT-5.5 at 1 million. Gemini 2 at 2 million. The 2024 question — is RAG obsolete — has a clearer 2026 answer. No. But the line moved. RULER showed the headline 1 million-token context claims drop to roughly 60 percent effective recall on real long-document tasks even when Needle in a Haystack says 99 percent. The 2026 default architecture is compound: long context for cross-document reasoning, RAG for fresh data and citation, light fine-tuning for output format. Plus five predictions on where the field is going through end of 2026. Companion to the show's "How LLM Inference Actually Works" — same shape, different layer. CHAPTERS 00:00 Cold open — two errors caught in a published episode 01:16 Today's pipeline 03:11 Chunking 05:57 Embeddings and the threshold table 08:57 Vector indexes 11:35 Hybrid retrieval and reranking 14:12 What breaks in production 17:51 Cheng-vs-Costello pattern + EP34 catches 19:26 RAG vs long context 21:17 The frontier and predictions 24:43 Closing — the trust layer SOURCES Lewis et al. 2020 — RAG (arxiv 2005.11401) Karpukhin et al. 2020 — Dense Passage Retrieval (arxiv 2004.04906) Malkov & Yashunin 2016 — HNSW (arxiv 1603.09320) Khattab & Zaharia 2020 — ColBERT (arxiv 2004.12832) Cormack et al. 2009 — Reciprocal Rank Fusion Anthropic — Introducing Contextual Retrieval (anthropic.com/news/contextual-retrieval) Asai et al. 2023 — Self-RAG (arxiv 2310.11511) Yan et al. 2024 — Corrective RAG (arxiv 2401.15884) Edge et al. 2024 — GraphRAG (arxiv 2404.16130) Liu et al. 2023 — Lost in the Middle (arxiv 2307.03172) Hsieh et al. 2024 — RULER (arxiv 2404.06654) Databricks — Long Context RAG Capabilities (Oct 2024) Cheng et al. 2025 — sycophancy follow-up, N=1,604 (arxiv 2510.01395) Costello et al. 2024 — DebunkBot, Science, N=2,190 Notion — Turbopuffer migration architecture Klarna — 2024 AI customer-support announcement + 2025 walkback coverage MongoDB — Voyage AI acquisition press

    25 min
  4. The Mandate That Couldn't Be Met: A Palo Alto CVE and What It Says About Federal Cybersecurity

    2 DAYS AGO

    The Mandate That Couldn't Be Met: A Palo Alto CVE and What It Says About Federal Cybersecurity

    CVE-2026-0300. Unauthenticated remote code execution as root on Palo Alto firewalls. CVSS 9.3. Disclosed May 6, 2026. CISA added it to the Known Exploited Vulnerabilities catalog the same day and set a federal civilian patch deadline of May 9. The first patch batch ships May 13. The federal mandate predates the patch by four days. This is a structural problem. Binding Operational Directive 22-01 — issued November 2021 — gives federal civilian agencies two paths to KEV compliance: apply the vendor patch, or remove the product from the network. Mitigations are explicitly temporary. When CISA used the standard KEV instrument here, agencies inherited an impossible deadline. The closest historical analog is Ivanti Connect Secure in January 2024 — but there, CISA issued Emergency Directive 24-01, which explicitly accepts mitigation as compliance. Standard KEV doesn't. And the market response is the counterintuitive twist. Palo Alto Networks stock went up after disclosure: +5.63% on May 7, +3.79% on May 8. PANW closed near 450 dollars. Three analysts raised price targets the same week. The 25 to 28 percent drop that month in 2024 was a platformization guidance cut, not the CVE that followed in April. The market has learned to price critical edge-device CVEs as routine. This episode walks through what CVE-2026-0300 actually is (the User-ID Authentication Portal — and it is not default-on, only enabled when admins turn it on for BYOD or guest SSO; Shodan finds 67 instances exposed on port 6081 versus 225 to 263 thousand total PAN-OS deployments), what an attacker does with root on a firewall (network pivot, SSL forward-proxy key extraction, persistence surviving reset and upgrade), how Unit 42 frames attribution as CL-STA-1132 — likely state-sponsored, with EarthWorm tool reuse as inference toward Chinese-nexus actors but never confirmation — and the live policy collision: National Cyber Director Sean Cairncross and acting CISA chief Nick Andersen are debating a permanent move to three-day standard KEV deadlines at the exact moment this CVE demonstrates three-day deadlines cannot work when patches need seven or more. Plus the pattern. Edge appliances are now the number one attack vector for state actors. Trend Micro reports the edge-device share of all exploitation incidents went from 3 percent to 22 percent in a single year — roughly an 8x increase. PAN-OS in 2024 and again in 2026. Ivanti twice in 2024 and 2025. Cisco IOS XE in 2023. Fortinet across three years. Citrix NetScaler in 2023. Same architecture. Same outcome. When the mandate is impossible and the market doesn't care, the only thing that gets fixed is the next CVE. CHAPTERS 00:00 Cold open — the impossible sequence 01:15 Intro 01:35 The CVE itself 07:24 What attackers do with root on a firewall 10:00 Attribution — CL-STA-1132 14:10 The mandate-then-patch gap 17:19 The market response — PANW stock went UP 19:44 The pattern — edge appliances as #1 attack vector 21:12 What defenders should do this week 22:27 Three signals to watch 23:30 Closing thesis SOURCES Palo Alto Security Advisory CVE-2026-0300 (security.paloaltonetworks.com) Unit 42 Threat Brief — CL-STA-1132 CISA Known Exploited Vulnerabilities catalog CISA Binding Operational Directive 22-01 (Nov 2021) FBI IC3 advisories on edge-device exploitation Reuters + SC Media + CSO Online — Cairncross / Andersen 3-day default reporting Shadowserver Foundation — global PAN-OS exposure scans Wiz + Help Net Security + BleepingComputer — CVE-2026-0300 coverage Trend Micro — 2026 edge-device exploitation share VulnCheck — 2024 zero-day catalog (75 zero-days, ~1 in 3 network/security appliances) Five Eyes Joint Cybersecurity Advisory — Feb 2025, edge-device default-compromised posture Palo Alto Networks SEC filings — FY26 guidance, market share, customer base

    24 min
  5. The Last Independent: Why Cerebras IPOs at $30 Billion This Tuesday

    2 DAYS AGO

    The Last Independent: Why Cerebras IPOs at $30 Billion This Tuesday

    NVIDIA bought Groq for about $20 billion on Christmas Eve 2025 — Jonathan Ross and roughly 90 percent of Groq's engineering team moved into NVIDIA in what was structured as a licensing deal. SambaNova raised a down round in February 2026 at roughly $2.2 billion, off a 2021 peak of $5.1 billion. Cerebras Systems prices its IPO this Tuesday evening at roughly a $30 billion implied valuation, trading Wednesday on Nasdaq under ticker CBRS — the largest AI-infrastructure listing since Arm in September 2023. That makes Cerebras the last independent fast-inference pure-play in AI chips on the public markets. The bet investors are pricing is simple. Reasoning models — OpenAI's o-series, DeepSeek R1, extended-thinking Claude — generate 10 to 100 times more tokens per query than ChatGPT-3.5 did. Memory bandwidth, not compute, becomes the binding constraint. A wafer-scale chip built around 44 GB of on-die SRAM, running Llama 70B at 2,100 tokens per second versus 30 to 100 tokens per second on H100, exploits that shift in a way no GPU cluster can mechanically match. That's why OpenAI signed a $20 billion-plus capacity agreement. The risk is harder. Cerebras's biggest committed customer is also the customer building the chip designed to replace it. OpenAI's Titan accelerator, co-developed with Broadcom on TSMC 3nm, enters mass production in the second half of 2026 — about six months after the IPO. The 180-day insider lockup expires around November 2026, coinciding with the Titan production ramp. This episode is the structural argument: why wafer-scale matters now, how 84 dies become one chip via custom scribe-line lithography, what the $237.8 million GAAP net income actually means (driven by a $363 million non-cash forward-contract gain — operating loss was $145.9 million), the 86 percent UAE customer concentration that migrated rather than disappeared, the circular OpenAI deal that makes the IPO possible, the Graphcore precedent ($2.8 billion peak to $500 million SoftBank sale), and three signals to watch — first-day open versus offering, Q2 earnings, and OpenAI Titan production timing. Cerebras is being IPO'd as AI infrastructure. It may end up trading as a single-customer business. November 2026 is when we find out which. CHAPTERS 00:00 Cold open — last independent 01:20 Intro 01:40 Why now — reasoning models break GPU economics 04:36 The chip — wafer-scale architecture 06:48 The IPO — what the numbers say 09:12 Customer concentration — 86 percent UAE 11:04 The OpenAI deal 11:55 The circular financing structure 13:50 The existential bet — OpenAI Titan 15:17 The cautionary frame — Graphcore precedent 16:15 Three signals to watch 17:30 Closing thesis SOURCES Cerebras S-1 (April 2026, SEC EDGAR) Bloomberg + Yahoo Finance + CNBC — IPO mechanics + Groq acquisition TechCrunch — "OpenAI's cozy partner Cerebras" The Information — OpenAI $20B+ Cerebras MRA Tom's Hardware — OpenAI Titan + Broadcom 10GW Cerebras Hot Chips 2024 — WSE-3 architecture Artificial Analysis — independent inference benchmarks (2,100 tok/s Llama 70B) arXiv 2402.16363 — LLM inference roofline analysis arXiv 2503.11698 — independent academic WSE-3 vs H100/B200 comparison Reuters + CNBC + SiliconANGLE — CFIUS clearance March 2025 + G42 context Sacra + TSGInvest — SambaNova Series E down round CNBC — NVIDIA acquires Groq for ~$20B (Christmas Eve 2025) TechCrunch — Graphcore peak valuation $2.77B Series E (Dec 2020) Constellation Research + SemiAnalysis — hyperscaler captive silicon (Trainium3, TPU v7, Maia)

    18 min
  6. Platform Engineering at AI-Native Companies: What's Actually Different

    5 DAYS AGO

    Platform Engineering at AI-Native Companies: What's Actually Different

    Meta's Llama 3 training run, 405 billion parameters, used 16,384 H100 GPUs for 54 days. Over those 54 days, the cluster experienced 419 unexpected interruptions — roughly one failure every three hours. And that's the run Meta calls a success. They hit 90 percent effective training time. This is the substrate platform engineers at AI-native companies are operating on. This episode is what's actually different about platform engineering at companies like OpenAI and Anthropic, compared to the traditional shape — Stripe, Netflix, Block, Google. Engineering tone, not hype. The verified primary-source view: OpenAI's two Kubernetes scaling posts at 2,500 and 7,500 nodes (5 API servers, 5 etcd, 70 GB heap per API server, 200,000 IPs in use at peak, MPI gang scheduling via the Coscheduling plugin). OpenAI's Postgres scaled for 800 million ChatGPT users on a single primary plus 50 read replicas. Anthropic's September 2025 postmortem disclosing three serving platforms (first-party, Bedrock, Vertex), three hardware backends (Trainium, NVIDIA, TPU), sticky routing, tens of chips per request. The compute portfolios: Anthropic with roughly 7 gigawatts disclosed across AWS Project Rainier (~500K Trainium2 chips), Google plus Broadcom (up to 1M TPUs), Microsoft-NVIDIA ($30B / 1 GW Grace Blackwell + Vera Rubin), and SpaceX Colossus 1 (220K NVIDIA GPUs / 300 MW). OpenAI's Stargate at $500B / 10 GW. The new problem classes: training cluster reliability (Meta cluster MTTF goes from 47.7 days at 8 GPUs to 14 minutes at 131,072 GPUs — reliability collapses non-linearly). NCCL collectives. Gang scheduling primitives (Kueue versus Volcano, properly distinguished). Inference at p99 (PagedAttention, RadixAttention, continuous batching — three independent optimizations). Prefill versus decode disaggregation. Heterogeneous fleets across H100, H200, B200, GB200, Trainium2, TPU v5p, Ironwood. HBM and U.S. energy as the binding constraints, not GPU FLOPS. What stays the same: the reliability discipline. SLOs, error budgets, on-call, blameless postmortems, observability. Anthropic's September 2025 postmortem reads like a Google SRE Book chapter. What doesn't transfer: substrate-specific tooling. You can't canary a 16,000-GPU job mid-flight. Three platforms inside one company. Training is a batch-scheduler problem. Inference is a request/response problem. Agents are a durable-workflow problem. Above all three, a chip-portability layer. Same craft. Different physics. CHAPTERS 00:00 Cold open — Llama 3.1 reliability data 00:33 Intro 00:59 The traditional platform charter 02:24 What's disclosed at OpenAI + Anthropic 04:40 Anthropic infrastructure deep dive 07:10 Team structure (OpenAI by workload, Anthropic by portability) 07:48 The new problem classes 08:20 Training cluster reliability + Meta MTTF curve 09:52 Gang scheduling — Kueue vs Volcano 10:26 Training frameworks — DeepSpeed, FSDP, Megatron 11:15 Inference at p99 — PagedAttention, RadixAttention 11:58 Prefill vs decode disaggregation 12:38 Heterogeneous fleets 13:14 Capacity planning + HBM as the binding constraint 14:28 What stays the same 15:46 Why "more load-bearing" 16:59 Closing thesis SOURCES OpenAI Kubernetes posts (2018, 2021) + Postgres scaling Anthropic September 2025 postmortem Anthropic Managed Agents + Code Execution with MCP AWS Project Rainier, Google-Broadcom, MS-NVIDIA, SpaceX Colossus OpenAI Stargate (Jan 2025) Llama 3.1 paper + Meta cluster MTTF (arXiv 2410.21680) DeepSeek V3 paper · vLLM PagedAttention (SOSP 2023) · SGLang RadixAttention Latent Space — NVIDIA Dynamo team (prefill/decode disaggregation) Google Borg paper · Netflix Tech Blog (Spinnaker, Atlas, Eureka)

    18 min
  7. How LLMs Got 3× Faster Without Getting Smarter: Speculative Decoding, Explained

    5 DAYS AGO

    How LLMs Got 3× Faster Without Getting Smarter: Speculative Decoding, Explained

    Two language models running side by side are faster than one. A 60-million-parameter model drafting tokens for an 11-billion-parameter model gave Google a 2-to-3× speedup with mathematically guaranteed identical output. The smaller model is wrong about a third of the time. The bigger model only verifies in parallel. And somehow you come out ahead. That's speculative decoding. The original paper landed the same day as ChatGPT — November 30, 2022. Today it runs inside Google Search, vLLM, TensorRT, and every major LLM serving stack on the planet. This episode is the sequel to "How LLM Inference Actually Works." The mechanism. The four-line proof that says you cannot lose quality, ever. The Leviathan formula — three numbers (acceptance rate, draft length, cost ratio) that determine the speedup. Plug them in and you get the answer. The architecture progression: small-LLM drafts (2022) → MEDUSA (2024, prediction heads on the target) → EAGLE (2024, predict feature vectors) → EAGLE-3 (2025, multi-layer feature fusion, 3.0-6.5×) → Lookahead Decoding (no draft model at all). Block Verification (ICLR 2025) — the original inventor still evolving the algorithm. The honest production reality. Research papers say 5-6×. vLLM at production concurrency reports 1.2 to 2.5×. The Red Hat gpt-oss-120B benchmark hits +9.5 to 20.7 percent throughput improvement, not 3×. Acceptance rate below 0.55 turns the technique net-negative. Math at 0.518 actively hurts; code above 0.8 hits 6×+. Two case studies: Cursor's 13× speedup from using the file you're editing as the draft (not a draft model, structural prior). Morph Fast Apply at 10,500 tokens per second on a 7B model. The whole AI-code-editor category runs on this trick. MagicDec — counterintuitive long-context exception where speculative decoding helps MORE at larger batch. Five testable predictions. Closing thesis: two LLMs running together are faster than one. The math is as old as ChatGPT itself. And it is the reason your AI is faster every six months. CHAPTERS 00:00 Cold open — Two LLMs faster than one 01:10 EP2 recap — memory-bound inference 02:04 The mechanism — draft + verify 04:20 The four-line proof — why it's lossless 06:03 The Leviathan formula 07:26 Architecture progression: small-LLM → MEDUSA → EAGLE → EAGLE-3 → Lookahead 09:33 Block Verification (ICLR 2025) 10:07 Production reality — research vs serving 11:15 SpecDecode-Bench falloff + MagicDec exception 12:41 The α=0.55 floor + domain spread 13:12 Cursor 13× (file-as-draft) + Morph 10,500 tps 14:28 What spec decoding enabled (Realtime Voice, AI-code-editor) 14:59 The frontier — SSD, DFlash, speculative cascades 16:30 Five predictions 17:49 Closing thesis SOURCES Nov 30 2022 — Leviathan, Kalman, Matias (Google) "Fast Inference from Transformers via Speculative Decoding" Feb 2 2023 — Chen et al. (DeepMind) "Accelerating LLM Decoding with Speculative Sampling" Jan 2024 — MEDUSA paper (multiple decoding heads) Jan 2024 — EAGLE paper (feature-level autoregression) Mar 2025 — EAGLE-3 (NeurIPS 2025, multi-layer feature fusion) Nov 2023 — Lookahead Decoding (LMSYS / Hao AI Lab) ICLR 2025 — Block Verification (Leviathan co-authored) Aug 2024 — MagicDec long-context paper ICLR 2026 — Speculative Speculative Decoding Dec 2025 — Google DFlash (block-diffusion on TPU v5p) Apr 2026 — Red Hat gpt-oss-120B production benchmark on H200 Oct 2024 — vLLM speculative decoding blog (2.8× CNN/DailyMail at QPS=1) May 2024 — Cursor "Editing files at 1000 tokens/sec" Apr 2026 — Anthropic "Code execution with MCP" (98.7% token reduction) Berkeley EECS-2025-224 — Liu, "Efficient LLM System with Speculative Decoding" Google Research 2025 — "Looking back at speculative decoding"

    19 min
  8. The Regulation Anthropic Asked For: How a Withheld Model Triggered Trump's AI Pre-Release Vetting EO

    6 MAY

    The Regulation Anthropic Asked For: How a Withheld Model Triggered Trump's AI Pre-Release Vetting EO

    April 29, 2026: the Trump White House drafts an executive order to bring Anthropic back for federal use. May 4: the same White House is now considering a separate executive order — mandatory pre-release government vetting of frontier AI models, routed through NSA, ONCD, and DNI. Five days. The catalyst, every outlet agreed, was Mythos — the model Anthropic withheld in April. Anthropic spent three years asking for AI regulation. They got it. From the administration that doesn't trust them. This episode is what that paradox means. The lobbying record is real. Anthropic publicly endorsed SB 1047. Detailed support on the company blog, August 2024. Lobbying disclosures show systematic engagement with the Biden EO, NIST AI RMF, and AISI's voluntary testing framework. The advocacy was for a specific kind of regulation — public-facing, NIST-administered, voluntary, transparent. What's getting drafted is a different thing. NSA evaluations are classified. The Office of the National Cyber Director is not a research lab. Mandatory pre-release vetting routed through national security agencies inverts the accountability surface from public-and-adversarial to classified-and-deferential. On the UK AI Safety Institute's capture-the-flag cyber benchmark, Mythos hit 73 percent — up from Opus 4.6 at 16 percent. RSP v3.0 dropped cyber operations from the formal Responsible Scaling framework five weeks before Mythos's preview. The withholding decision was a real product call against a real capability surface — and the catalyst for an EO Anthropic almost certainly didn't want. Five testable predictions. Closing thesis: the regulation Anthropic asked for got built. The administration that doesn't trust Anthropic is now writing it. CHAPTERS 00:00 Cold open — April 7, April 29, May 4 00:27 The Trump pre-release vetting EO 00:59 Recap for returning listeners 01:35 Anthropic asked for regulation. They got it. 02:24 Mythos capabilities — UK AISI cyber benchmark 03:29 Anthropic's lobbying history (SB 1047, AI EO, AISI, NIST) 04:48 RSP v3.0 — dropping cyber ops five weeks before Mythos 07:53 Pre-release vetting — what classified evals change 09:27 National-security-flavored regulation vs civilian framework 11:37 First-access vs blocking — why the design matters 14:11 Five predictions 15:16 Closing thesis SOURCES May 4 — NYT broke story (Trump considering mandatory pre-release vetting EO) May 4-5 — Axios, Bloomberg, Tom's Hardware, USNews, MSN syndication April 29 — Draft EO to bring Anthropic back for federal use April 27 — Dean Ball, WBUR On Point + Techdirt follow-up April 8 — Project Glasswing launch April 7 — Anthropic Mythos Preview announcement February 24 — Anthropic Responsible Scaling Policy v3.0 August 2024 — Anthropic public endorsement of SB 1047 Trump Day 1 — Biden AI EO rescinded Past episodes — Mythos (EP14), Mythos Bifurcation (EP24)

    16 min

About

How LLM inference actually works. Why the Strait of Hormuz could move oil prices 40 percent. What happens when AI starts automating AI research. Each episode picks one topic — usually tech, AI, or geopolitics — and goes deep. 30+ primary sources, every claim confidence-tagged, ~18 minutes per topic. For listeners tired of takes without numbers. Also on YouTube: youtube.com/@DeepDiveAIShow

You Might Also Like