Platform Engineering Playbook Podcast

vibesre

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution. Read the playbook at https://platformengineeringplaybook.com

  1. 4 HR AGO

    AI Agents Are About to Break Kubernetes — Unless We Standardize Now

    What happens when hundreds of AI agents start running in your Kubernetes cluster but can't communicate with each other? By 2026, this isn't a hypothetical problem—it's the reality platform engineers are facing right now. In this episode of Platform Engineering Playbook, we dive deep into the CNCF's new cloud-native agentic standards and what they mean for your infrastructure. We'll break down why these standards exist, how they solve critical interoperability challenges, and most importantly—what you need to implement today to stay ahead. **What You'll Learn:** • How to prepare your platform for the AI agent explosion • CNCF's new agentic workflow validation requirements • Why IBM, Red Hat, and Google just donated their LLM inference blueprint • The latest enterprise networking developments with multi-cloud SD-WAN • How the cloud-native community reached 19.9 million developers **Episode Timestamps:** 0:00 Cold Open - The AI Agent Problem 2:30 Platform Engineering News Roundup 8:15 Deep Dive: Cloud Native Agentic Standards 15:45 Analysis: What These Standards Actually Mean Whether you're managing existing Kubernetes workloads or planning your AI strategy, this episode gives you the practical insights to build resilient, future-ready platforms. **Sources & References:** • CNCF Cloud Native Agentic Standards: https://www.cncf.io/blog/2026/03/23/cloud-native-agentic-standards/ • Colt Multi-Cloud SD-WAN Launch: https://totaltele.com/colt-targets-enterprise-digital-transformation-with-multi-cloud-sd-wan-launch/ • CNCF Developer Community Report: https://cloudnativenow.com/kubecon-cloudnativecon-europe-2026/cncf-and-slashdata-report-finds-cloud-native-developer-community-has-reached-19-9-million/ • Kubernetes LLM Inference Blueprint: https://thenewstack.io/llm-d-cncf-kubernetes-inference/ • CNCF AI Platform Certifications: https://cloudnativenow.com/kubecon-cloudnativecon-europe-2026/cncf-nearly-doubles-certified-kubernetes-ai-platforms-adds-agentic-workflow-validation/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    20 min
  2. 1 DAY AGO

    How to Monitor LLMs in Production Before They Drain Your Budget

    **Are you burning through your LLM budget with zero visibility into why?** You're not alone - 73% of production deployments are facing this exact problem right now. In today's Platform Engineering Playbook, we tackle the monitoring crisis plaguing AI infrastructure and break down five game-changing developments reshaping how we deploy and secure production systems. **🎯 What You'll Learn:** • How to implement proper LLM observability using Grafana Cloud, OpenLIT, and OpenTelemetry • Step-by-step rollout strategies that won't break your production environment • Why Teleport's new Beams could revolutionize AI agent security in your infrastructure **📺 Chapters:** 0:00 - Cold Open: The LLM Budget Crisis 2:15 - Today's Platform Engineering News 8:30 - Deep Dive: LLM Monitoring in Production 15:45 - Implementation Walkthrough **🔥 This Week's News:** • Teleport Beams: Trusted runtimes for AI agents • Metal3 joins CNCF incubation at KubeCon Europe 2026 • Cloudflare's Gen 13 servers deliver 2x edge compute performance • New AI-compatible certification frameworks • Pi-Hole deployment strategies for network-wide ad blocking Perfect for platform engineers, DevOps teams, and infrastructure leaders dealing with AI workloads in production. **Sources & References:** • https://grafana.com/blog/ai-observability-llms-in-production/ • https://cloudnativenow.com/kubecon-cloudnativecon-europe-2026/teleport-launches-beams-to-provide-trusted-runtimes-for-ai-agents-in-production-infrastructure/ • https://www.cncf.io/blog/2026/03/23/metal3-at-kubecon-cloudnativecon-europe-2026-meet-the-cncfs-freshly-incubated-bare-metal-project/ • https://blog.cloudflare.com/gen13-launch/ • https://letsdatascience.com/news/made-in-usa-introduces-ai-compatible-certification-framework-b0b018ac • https://thenewstack.io/pihole-docker-network-adblocking/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    20 min
  3. 2 DAYS AGO

    Helm Security Is Broken. WebAssembly Fixes It.

    **What if 94% of Helm chart vulnerabilities could be prevented with one unexpected technology?** Today's Platform Engineering Playbook dives deep into the surprising intersection of WebAssembly and Kubernetes security, plus breaking news that every platform engineer needs to know. **What You'll Learn:** • How WebAssembly is revolutionizing Helm chart security (spoiler: it's not replacing Kubernetes) • Why Trivy is under attack again and what it means for your CI/CD pipelines • Critical findings from a security audit of 22,511 AI coding skills • Whether AI will evolve code or make it extinct • GrapheneOS's bold stance against age verification laws **Episode Breakdown:** 0:00 - Cold Open: The 94% Helm vulnerability stat that will shock you 2:30 - Today's platform engineering headlines 5:15 - Deep Dive: WebAssembly + Kubernetes security analysis 15:45 - Practical implementation strategies for platform teams Perfect for platform engineers, DevOps professionals, and anyone building resilient cloud-native infrastructure. Get the technical depth you need with practical insights you can implement immediately. **Sources & References:** • WebAssembly & Helm Security: https://thenewstack.io/helm-webassembly-kubernetes-security/ • Trivy Attack Analysis: https://socket.dev/blog/trivy-under-attack-again-github-actions-compromise • AI Coding Skills Audit: https://thenewstack.io/ai-agent-skills-security/ • AI Programming Future: https://thenewstack.io/ai-programming-languages-future/ • GrapheneOS News: https://www.tomshardware.com/software/operating-systems/grapheneos-refuses-to-comply-with-age-verification-laws #PlatformEngineering #DevOps #CloudNative #Kubernetes

    15 min
  4. 5 DAYS AGO

    The Kubernetes AI Pattern That Cuts GPU Costs

    **87% of AI workloads are sitting idle on GPUs right now** - yet companies keep buying more hardware. What if the problem isn't capacity, but how we're running AI on Kubernetes? In today's Platform Engineering Playbook, we tackle the massive inefficiencies plaguing AI infrastructure at scale. You'll discover why traditional Kubernetes patterns break down with AI workloads, what's actually happening under the hood when you try to serve ML models in production, and concrete strategies to fix GPU utilization without throwing more money at the problem. **What You'll Learn:** • Why current Kubernetes-native AI patterns are failing at scale • The hidden bottlenecks destroying your GPU efficiency  • Runtime security developments from Grafana Labs and Miggo • Amazon ECR's new pull-through cache support for Chainguard • How to evolve from Kubernetes Gatekeeper to full-stack governance with OPA **Timestamps:** 0:00 Cold Open - The AI Infrastructure Crisis 2:15 Today's Platform Engineering News 8:30 Deep Dive: Kubernetes + AI at Scale 15:45 Under the Hood Analysis 22:10 Actionable Takeaways Whether you're scaling AI workloads or just trying to understand why your GPU bills keep growing while performance stays flat, this episode gives you the platform engineering perspective you need. **Sources & References:** • Building Kubernetes-native AI infrastructure: https://thenewstack.io/kubernetes-native-ai-infrastructure/ • Grafana Cloud and Miggo runtime protection: https://grafana.com/blog/grafana-cloud-and-miggo-for-runtime-protection/ • Amazon ECR Chainguard support: https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-ecr-pull-through-cache-chainguard/ • AWS Cloud 20 years retrospective: https://aws.amazon.com/blogs/aws/20-years-in-the-aws-cloud-how-time-flies/ • LLM Compressor v0.10: https://developers.redhat.com/articles/2026/03/18/llm-compressor-010-faster-compression-distributed-gptq • Kubernetes Gatekeeper to OPA governance: https://www.pulumi.com/blog/kubernetes-gatekeeper-full-stack-governance-opa/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    23 min
  5. 6 DAYS AGO

    You’re Monitoring the Wrong Kubernetes Metrics

    **Are 73% of Kubernetes clusters really flying blind?** According to recent industry reports, most K8s deployments are drowning in meaningless metrics while missing the signals that actually matter for performance and cost optimization. In today's Platform Engineering Playbook, we tackle the Kubernetes observability crisis head-on. You'll discover why traditional monitoring approaches are failing platform teams and learn actionable strategies to build metrics that drive real business value. **What You'll Learn:** • Why most K8s metrics collection strategies are fundamentally broken • How to identify and implement performance indicators that actually matter • Practical frameworks for establishing effective observability in your clusters • Real-world approaches to turning metrics into cost savings and performance gains **Episode Breakdown:** 00:00 - Cold Open: The K8s Observability Crisis 02:30 - Industry News Roundup 08:45 - Deep Dive: Fixing Kubernetes Metrics (Part 1) **Today's News:** Container security innovations from Chainguard, Grafana's new cost optimization tools, custom metrics scaling strategies, and the latest observability trends including AI integration challenges. Perfect for platform engineers, DevOps teams, and engineering leaders looking to move beyond vanity metrics to actionable observability. **Sources & References:** - CNCF Kubernetes Metrics Best Practices: https://www.cncf.io/blog/2026/03/18/understanding-kubernetes-metrics-best-practices-for-effective-monitoring/ - Grafana Cost Optimization Guide: https://grafana.com/blog/from-signals-to-savings-optimizing-cloud-costs-with-grafana-assistant-and-mcp-servers/ - Chainguard Container Security Analysis: https://thenewstack.io/chainguard-os-packages-containers/ - Datadog Custom Metrics Scaling: https://www.datadoghq.com/blog/autoscaling-custom-metrics/ - Grafana Observability Standards Report: https://grafana.com/blog/observability-survey-OSS-open-standards-2026/ - AI in Observability Survey: https://grafana.com/blog/observability-survey-AI-2026/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    18 min
  6. 18 MAR

    The AI Security Hole Your Red Team Is Missing

    **87% of enterprise AI deployments have a critical security vulnerability that red teams aren't even testing for.** Are you one of them? In today's Platform Engineering Playbook, we expose the massive security hole plaguing enterprise AI systems and dive deep into prompt injection attacks that are slipping past traditional security measures. Plus, we cover the latest platform engineering news that's reshaping how enterprises build and deploy. **What You'll Learn:** • The hidden AI security vulnerability affecting 9 out of 10 enterprise deployments • Step-by-step breakdown of how prompt injection attacks work in production • Actionable security strategies for platform engineers deploying AI agents • Microsoft's aggressive PostgreSQL push and what it means for your data strategy • Cloudflare's evolution from legacy architecture to modern SASE solutions **Timestamps:** 0:00 Cold Open - The 87% Problem 1:30 Introduction 3:00 Deep Dive: The AI Security Crisis 8:45 How Prompt Injection Attacks Actually Work 15:20 Platform Engineer Action Items Whether you're currently deploying AI systems or planning your enterprise AI strategy, this episode delivers the security insights and platform engineering intelligence you need to stay ahead of emerging threats. **Sources & References:** • AI Security Research: https://thenewstack.io/red-teaming-enterprise-ai-agents/ • PostgreSQL on Azure: https://azure.microsoft.com/en-us/blog/from-legacy-to-leadership-how-postgresql-on-azure-powers-enterprise-agility-and-innovation/ • Cloudflare SASE Evolution: https://blog.cloudflare.com/legacy-to-agile-sase/ • AI Tooling Survey: https://newsletter.pragmaticengineer.com/i/189777574/2-most-used-ai-tools • Azure DevOps MCP Server: https://devblogs.microsoft.com/devops/azure-devops-remote-mcp-server-public-preview/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    19 min
  7. 17 MAR

    Your Kubernetes Monitoring Is Blind to AI Attacks

    **Is your Kubernetes cluster blind to AI model poisoning attacks?** 73% of companies running AI workloads can't detect when their models are compromised - and traditional monitoring tools are completely useless against these threats. In today's Platform Engineering Playbook, we dive deep into why AI workloads are breaking traditional Kubernetes observability strategies and what platform teams need to do about it. Plus, we cover the latest developments shaking up the cloud native ecosystem. **What You'll Learn:** ✅ Why traditional Kubernetes monitoring fails with AI workloads ✅ How to detect AI model poisoning in production environments ✅ Critical AWS security vulnerabilities affecting managed services ✅ New authentication strategies for Kubernetes registry mirrors ✅ Latest developments from the cloud native community **Timestamps:** 0:00 Cold Open - The AI observability crisis 1:30 Today's Platform Engineering News 8:45 Deep Dive: AI Workloads vs Traditional Monitoring 15:20 The Real-World Impact on Autoscaling Whether you're running AI workloads today or planning for tomorrow, this episode gives you the strategies and tools to maintain visibility and security in your Kubernetes environments. **Sources & References:** - Why AI workloads are breaking traditional Kubernetes observability strategies: https://thenewstack.io/ai-kubernetes-observability-practices/ - AWS Launches Managed Openclaw on Lightsail Amid Critical Security Vulnerabilities: https://www.infoq.com/news/2026/03/aws-lightsail-openclaw-security/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global - LLM Architecture Gallery: https://sebastianraschka.com/llm-architecture-gallery/ - Cursor built a fleet of security agents to solve a familiar frustration: https://thenewstack.io/cursor-open-sources-security-agents/ - Registry Mirror Authentication with Kubernetes Secrets: https://www.cncf.io/blog/2026/03/16/registry-mirror-authentication-with-kubernetes-secrets-2/ - KubeCon + CloudNativeCon Europe 2026 Co-located Event Deep Dive: Open Sovereign Cloud Day: https://www.cncf.io/blog/2026/03/16/kubecon-cloudnativecon-europe-2026-co-located-event-deep-dive-open-sovereign-cloud-day/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    18 min
  8. 16 MAR

    The 6 Types of AI Cloud Infrastructure

    **87% of AI companies are burning cash on the wrong cloud infrastructure - and they have no idea.** In this episode of Platform Engineering Playbook, we expose the costly mistakes plaguing AI infrastructure and reveal the framework that's helping platform teams save millions while scaling smarter. **What You'll Learn:** • The 6 categories of AI cloud infrastructure that matter in 2026 • How to transform inference from dedicated resources into efficient multi-tenant services • A battle-tested evaluation framework from dozens of real-world AI platform implementations • Critical security vulnerabilities in AWS's new Managed OpenClaw service that could impact your infrastructure **Episode Breakdown:** 00:00 Cold Open - The 87% cash burn crisis 02:30 Today's Platform Engineering News 08:15 Deep Dive: AI Cloud Infrastructure Fundamentals **Breaking News Covered:** - AWS Lightsail OpenClaw security situation - New LLM Architecture Gallery release - MCP production roadmap updates - Linux's game-changing performance breakthrough Whether you're architecting AI platforms or optimizing existing infrastructure, this episode delivers actionable insights to help you avoid the expensive mistakes that are crushing 87% of AI companies. **Sources & References:** - AI Cloud Taxonomy 2026: https://thenewstack.io/ai-cloud-taxonomy-2026/ - AWS Lightsail OpenClaw Security: https://www.infoq.com/news/2026/03/aws-lightsail-openclaw-security/ - LLM Architecture Gallery: https://sebastianraschka.com/blog/2026/llm-architecture-gallery.html - MCP Production Roadmap: https://thenewstack.io/model-context-protocol-roadmap-2026/ - Linux Performance Feature: https://www.iowaparkleader.com/linux-finally-catches-up-to-windows-with-a-game-changing-performance-feature/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    18 min

About

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution. Read the playbook at https://platformengineeringplaybook.com

You Might Also Like