Inference Time Tactics

0,0 (0)
TECHNOLOGIES
TOUTES LES 2 SEMAINES

A podcast exploring the emerging field of inference-time compute—the next frontier in AI performance. Hosted by the Neurometric team, we unpack how models reason, make decisions, and perform at runtime. For developers, researchers, and operators building AI infrastructure.

-3 J

Voice Intelligence at Scale: From Call of Duty to Fraud Detection with Modulate AI

Every day billions of voice conversations happen across games, customer service calls, and financial transactions. Almost none of them are understood by machines. In this episode of Inference Time Tactics, Calvin Cooper and Yash Sharma sit down with Carter Huffman, CTO and co founder of Modulate, to explore the AI systems that can finally understand voice conversations in real time. Modulate’s model Velma 2.0 powers voice intelligence across industries. From moderating voice chat in games like Call of Duty to detecting fraud in financial calls and analyzing customer support conversations, their system uses ensembles of specialized models to capture tone, intent, emotion, and conversational dynamics. Instead of relying on giant foundation models, Velma orchestrates over 100 specialized models to deliver higher accuracy at dramatically lower cost. We talked about: The challenge of processing a trillion hours of annual global voice traffic. Scaling real-time moderation for massive platforms like Call of Duty. Capturing nuance, tone, and sarcasm beyond basic text transcripts. Ensemble architecture utilizing over 100 specialized models. Orchestration layers that trim compute costs by identifying optimal model subsets. Achieving order-of-magnitude cost savings compared to large foundational models. Applying "exploration vs. exploitation" optimization to shifting conversation data. Future development of "context graphs" to map participant intent and causality. Resources Mentioned: NeuroMetric Audio Leaderboard: https://leaderboard.neurometric.ai/?leaderboard=audio Connect with Modulate: Website: https://www.modulate.ai/ LinkedIn: https://www.linkedin.com/in/carter-huffman-a9aba05b Velma: https://www.modulate.ai/velma Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Hosts: Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc Yash Sharma https://x.com/yash_j_sharma https://www.linkedin.com/in/yashjsharma/

32 min
16 JANV.

From GPU Scarcity to GPU Waste: Solving the Utilization Crisis

In this episode of Inference Time Tactics, Cooper and Byron sit down with Charlie and Anil from Rapt AI to tackle one of the industry's most expensive problems: GPU underutilization. With half a trillion dollars invested in GPU infrastructure running at just 20-30% utilization, Rapt AI is building AI-powered orchestration that automatically analyzes workloads and matches them to the right compute resources—no guesswork required. We talked about: Why half a trillion dollars in GPU infrastructure runs at only 20-30% utilization—and how a 5% drop costs $200,000 per $2M investment. How Rapt AI's platform continuously analyzes workloads and auto-optimizes GPU allocation, letting customers run 4-14 models per GPU. Real results: moving workloads from H100s to A100s at 40% of the cost, and reducing GPU footprints from 184 to under 50 while improving performance. Why 2026 becomes the year of inference as agentic workloads create unprecedented infrastructure chaos. The shift from supply problems to optimization problems—and why abstraction layers matter across multi-vendor environments. Power as the next crisis: tokens-per-watt emerging as the critical metric alongside tokens-per-dollar. How intelligent orchestration frees up data scientists and ML ops teams from infrastructure tuning to focus on AI innovation. Connect with Rapt AI: Website: https://www.rapt.ai/ LinkedIn (Anil Ravindranath): https://www.linkedin.com/in/anilravindranath LinkedIn (Charlie Leeming): https://www.linkedin.com/in/charlieleeming/ Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Hosts: Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc Byron Galbraith https://x.com/bgalbraith https://www.linkedin.com/in/byrongalbraith

40 min
22/12/2025

Lessons from the Leading Edge: What 420 AI Deployments Reveal About Enterprise Success

In this episode of Inference Time Tactics, Rob, Cooper, and Byron sit down with Shawn Rogers, CEO of BARC US to unpack fresh data from 421 organizations actively deploying AI in production. Shawn shares what separates the 20% of AI leaders from everyone else, why cost surprises are hitting harder than expected, and how the pressure to "just do AI" is causing companies to skip critical foundations—often to their detriment. We talked about: Why multi-model strategies and small language models are becoming essential for enterprise AI. The seven foundational areas that help AI leaders deploy twice as many projects as everyone else. Why 51% of deployments face unexpected cost overruns—and which expenses hit hardest. Data quality jumping to the #1 challenge, affecting 44% of production deployments. The IT satisfaction paradox: top resource at the start, lowest satisfaction scores at scale. How responsible AI priorities shifted as human-in-the-loop dropped from 36% to 21%. Resources Mentioned: Lessons from the Leading Edge: Successful Delivery of AI/GenAI https://barc.com/research/successful-ai-genai-delivery/ Connect with BARC: Website: https://barc.com/ LinkedIn (Shawn Rogers): https://www.linkedin.com/in/shawnrogers/ Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Hosts: Rob May https://x.com/robmay https://www.linkedin.com/in/robmay Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc Byron Galbraith https://x.com/bgalbraith https://www.linkedin.com/in/byrongalbraith

44 min
16/12/2025

The Thinking Algorithm Leaderboard: Why No Single Model Wins

In this episode of Inference Time Tactics, Cooper and Byron break down NeuroMetric's Thinking Algorithm Leaderboard and what it reveals about building production-ready AI agents. They share why prompt engineering with a single model won't cut it for enterprise use cases, explore the impact of inference-time compute strategies, and discuss what they learned from testing 10 models across real CRM tasks—from surprising token inefficiency to catastrophic failures in SQL generation. We talked about: Why NeuroMetric built the first leaderboard combining models with inference-time compute strategies. How Salesforce's CRMArena-Pro reflects real multi-step business tasks better than pure reasoning benchmarks. The jagged frontier: no single model or technique dominates across all tasks. Why GPT 20B was surprisingly token inefficient—twice as slow as GPT 120B for similar accuracy. How GPT-5 nano's conversational style broke SQL generation tasks completely. Trading accuracy for speed: two-model ensembles versus five, and saving 20+ seconds per task. Throughput constraints as a hidden bottleneck when scaling to production volumes. Future directions: LLM-guided search, task clustering, and compression to specialized small models. Resources Mentioned: CRMArena-Pro from Saleforce: https://www.salesforce.com/blog/crmarena-pro/ Thinking Algorithm Leaderboard: https://leaderboard.neurometric.ai/ Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Hosts: Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc Guest/s: Byron Galbraith https://x.com/bgalbraith https://www.linkedin.com/in/byrongalbraith

29 min
05/11/2025

Benchmarking Generalization: How AI Learns Beyond Training Data

In this episode of Inference Time Tactics, Rob and Cooper from Neurometric sit down with Yash Sharma, an AI researcher whose work is reshaping how we understand model generalization. Yash recently completed his PhD at the Max Planck Institute for Intelligent Systems and has held research roles at Google Brain, Meta AI, Amazon, Borealis AI, and IBM Research. His studies on compositional generalization, adversarial robustness, and long-tail benchmarks reveal when and why models succeed—or fail—at reasoning beyond their training data. If you’re designing inference-time systems, building agents that need reliability, or just want to understand what “generalization” actually means in practice, this conversation bridges deep theory with actionable insight—clear, technical, and strategically grounded. Key Topics What it really means for AI systems to generalize beyond their training data Why large language models still fail in novel or unpredictable scenarios How inference-time compute can both amplify and reveal generalization limits What these limits mean for building reliable, agentic AI systems How to benchmark generalization in real-world settings Yash’s “Let It Wag!” benchmark for testing long-tail and under-represented concepts Why genuine scientific breakthroughs (like curing cancer) require more than scaling test-time compute Connect with Yash Sharma: Yash Sharma Let It Wag! Benchmark Paper: Pretraining Frequency Predicts Compositional Generalization of CLIP (NeurIPS 2024 Workshop) Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Rob May https://x.com/robmay https://www.linkedin.com/in/robmay Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc

37 min
03/10/2025

Solving the Cold Start Problem in AI Inference

In this episode of Inference Time Tactics, Rob, Cooper, and Byron sit down with Prashanth Velidandi, co-founder of InferX, to explore how serverless inference is tackling the AI “cold start problem.” They dig into why 90% of the model lifecycle happens at inference—not training—and how cold starts and idle GPUs are crippling efficiency. Prashanth explains InferX’s snapshot technology, what it takes to deliver sub-second cold starts, and why inference infrastructure—not just models—will define the next era of AI. We talked about: Why inference represents 90% of the model lifecycle, compared to the training focus most of the industry has. How cold starts and idle GPUs create massive inefficiencies in AI infrastructure. InferX’s snapshot technology that enables sub-second model loading and higher GPU utilization. The challenges of explaining and selling deeply technical infrastructure to the market. Why enterprises care about inference efficiency, cost, and reliability more than model size. How serverless inference abstracts away infrastructure complexity for developers. The coming explosion of multi-agent systems and billions of specialized models. Why sustainable innovation in AI will come from inference infrastructure. Connect with InferX Prashanth Velidandi https://inferx.net https://x.com/pmv_inferx https://www.linkedin.com/in/prashanth-velidandi-98629b115 Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Rob May https://x.com/robmay https://www.linkedin.com/in/robmay Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc Byron Galbraith https://x.com/bgalbraith https://www.linkedin.com/in/byrongalbraith

35 min
30/09/2025

From MIT Decoding Research to Today’s Inference Tradeoffs

Check out the latest episode of Inference Time Tactics. Our guest is Pawan Deshpande, founder, product leader, and angel investor in companies like Anthropic and Toast, with roles at Google, Scale AI and Domino Data Lab. Hosts Rob May & Calvin Cooper sit down with Pawan to cover: Early MIT NLP research applied to today’s inference-time tradeoffs How to evaluate enterprise agents in practice Training data plus inference filtering in real deployments Open source adoption realities in the enterprise Where durable value lives in the stack Connect with Pawan Deshpande Website: https://pawandeshpande.com/ Academic / Research Works & Thesis Decoding Algorithms for Complex Natural Language Tasks (MIT thesis, 2007) Randomized Decoding for Selection-and-Ordering Problems Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Hosts: Rob May https://x.com/robmay https://www.linkedin.com/in/robmay Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc

31 min
22/09/2025

Drag, Drop, and Deploy: Rethinking How We Build AI Systems

In this episode of Inference Time Tactics, Rob, Cooper, Byron, and Dave share product updates for Neurometric’s Inference Time Compute Studio and what they reveal about the shift from single models to full AI systems. They discuss why wiring models together at scale is so challenging, how a drag-and-drop interface can make experimenting with inference strategies easier, and why open source, benchmarking, and community feedback are key to building the next generation of composable AI systems. We talked about: Why AI is shifting from single models to full systems and what that means for builders. The challenges of wiring multiple models together at scale and running them in production. How Neurometric’s drag-and-drop interface simplifies testing inference strategies without code. Why open-source models are becoming increasingly competitive with commercial solutions. The lack of standardization in AI stacks and why the industry still feels like the “early web” era. How inference-time compute can balance performance, cost, and latency across different tasks. Why benchmarks alone are insufficient and how domain-specific evaluations can fill the gap. The role of community feedback in shaping priorities for benchmarks and new primitives. Connect with Neurometric: Website: https://www.neurometric.ai/ Substack: https://neurometric.substack.com/ X: https://x.com/neurometric/ Bluesky: https://bsky.app/profile/neurometric.bsky.social Hosts: Rob May https://x.com/robmay https://www.linkedin.com/in/robmay Calvin Cooper https://x.com/cooper_nyc_ https://www.linkedin.com/in/coopernyc Guest/s: Byron Galbraith https://x.com/bgalbraith https://www.linkedin.com/in/byrongalbraith Dave Rauchwerk https://x.com/elevenarms https://www.linkedin.com/in/dave-rauchwerk-0ba82822

20 min

Tout afficher (13)

Chaîne

Everywhere Ventures
Création

NeuroMetric AI
Années d’activité

2025 - 2026
Épisodes

13
Classification

Tous publics
Site web de l’émission

Inference Time Tactics

Entreprenariat

Entreprenariat

Chaque semaine
Affaires

Affaires

15 janv.
Affaires

Affaires

01/05/2025
Entreprenariat

Entreprenariat

Chaque semaine

Inference Time Tactics

Voice Intelligence at Scale: From Call of Duty to Fraud Detection with Modulate AI

From GPU Scarcity to GPU Waste: Solving the Utilization Crisis

Lessons from the Leading Edge: What 420 AI Deployments Reveal About Enterprise Success

The Thinking Algorithm Leaderboard: Why No Single Model Wins

Benchmarking Generalization: How AI Learns Beyond Training Data

Solving the Cold Start Problem in AI Inference

From MIT Decoding Research to Today’s Inference Tradeoffs

Drag, Drop, and Deploy: Rethinking How We Build AI Systems

À propos

Informations

Plus de contenus par Everywhere Ventures

Inference Time Tactics

Épisodes

Voice Intelligence at Scale: From Call of Duty to Fraud Detection with Modulate AI

From GPU Scarcity to GPU Waste: Solving the Utilization Crisis

Lessons from the Leading Edge: What 420 AI Deployments Reveal About Enterprise Success

The Thinking Algorithm Leaderboard: Why No Single Model Wins

Benchmarking Generalization: How AI Learns Beyond Training Data

Solving the Cold Start Problem in AI Inference

From MIT Decoding Research to Today’s Inference Tradeoffs

Drag, Drop, and Deploy: Rethinking How We Build AI Systems

À propos

Informations

Plus de contenus par Everywhere Ventures