Platform Engineering Playbook Podcast

vibesre

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution. Read the playbook at https://platformengineeringplaybook.com

  1. 2 DAYS AGO

    The Kubernetes AI Pattern That Cuts GPU Costs

    **87% of AI workloads are sitting idle on GPUs right now** - yet companies keep buying more hardware. What if the problem isn't capacity, but how we're running AI on Kubernetes? In today's Platform Engineering Playbook, we tackle the massive inefficiencies plaguing AI infrastructure at scale. You'll discover why traditional Kubernetes patterns break down with AI workloads, what's actually happening under the hood when you try to serve ML models in production, and concrete strategies to fix GPU utilization without throwing more money at the problem. **What You'll Learn:** • Why current Kubernetes-native AI patterns are failing at scale • The hidden bottlenecks destroying your GPU efficiency  • Runtime security developments from Grafana Labs and Miggo • Amazon ECR's new pull-through cache support for Chainguard • How to evolve from Kubernetes Gatekeeper to full-stack governance with OPA **Timestamps:** 0:00 Cold Open - The AI Infrastructure Crisis 2:15 Today's Platform Engineering News 8:30 Deep Dive: Kubernetes + AI at Scale 15:45 Under the Hood Analysis 22:10 Actionable Takeaways Whether you're scaling AI workloads or just trying to understand why your GPU bills keep growing while performance stays flat, this episode gives you the platform engineering perspective you need. **Sources & References:** • Building Kubernetes-native AI infrastructure: https://thenewstack.io/kubernetes-native-ai-infrastructure/ • Grafana Cloud and Miggo runtime protection: https://grafana.com/blog/grafana-cloud-and-miggo-for-runtime-protection/ • Amazon ECR Chainguard support: https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-ecr-pull-through-cache-chainguard/ • AWS Cloud 20 years retrospective: https://aws.amazon.com/blogs/aws/20-years-in-the-aws-cloud-how-time-flies/ • LLM Compressor v0.10: https://developers.redhat.com/articles/2026/03/18/llm-compressor-010-faster-compression-distributed-gptq • Kubernetes Gatekeeper to OPA governance: https://www.pulumi.com/blog/kubernetes-gatekeeper-full-stack-governance-opa/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    23 min
  2. 3 DAYS AGO

    You’re Monitoring the Wrong Kubernetes Metrics

    **Are 73% of Kubernetes clusters really flying blind?** According to recent industry reports, most K8s deployments are drowning in meaningless metrics while missing the signals that actually matter for performance and cost optimization. In today's Platform Engineering Playbook, we tackle the Kubernetes observability crisis head-on. You'll discover why traditional monitoring approaches are failing platform teams and learn actionable strategies to build metrics that drive real business value. **What You'll Learn:** • Why most K8s metrics collection strategies are fundamentally broken • How to identify and implement performance indicators that actually matter • Practical frameworks for establishing effective observability in your clusters • Real-world approaches to turning metrics into cost savings and performance gains **Episode Breakdown:** 00:00 - Cold Open: The K8s Observability Crisis 02:30 - Industry News Roundup 08:45 - Deep Dive: Fixing Kubernetes Metrics (Part 1) **Today's News:** Container security innovations from Chainguard, Grafana's new cost optimization tools, custom metrics scaling strategies, and the latest observability trends including AI integration challenges. Perfect for platform engineers, DevOps teams, and engineering leaders looking to move beyond vanity metrics to actionable observability. **Sources & References:** - CNCF Kubernetes Metrics Best Practices: https://www.cncf.io/blog/2026/03/18/understanding-kubernetes-metrics-best-practices-for-effective-monitoring/ - Grafana Cost Optimization Guide: https://grafana.com/blog/from-signals-to-savings-optimizing-cloud-costs-with-grafana-assistant-and-mcp-servers/ - Chainguard Container Security Analysis: https://thenewstack.io/chainguard-os-packages-containers/ - Datadog Custom Metrics Scaling: https://www.datadoghq.com/blog/autoscaling-custom-metrics/ - Grafana Observability Standards Report: https://grafana.com/blog/observability-survey-OSS-open-standards-2026/ - AI in Observability Survey: https://grafana.com/blog/observability-survey-AI-2026/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    18 min
  3. 4 DAYS AGO

    The AI Security Hole Your Red Team Is Missing

    **87% of enterprise AI deployments have a critical security vulnerability that red teams aren't even testing for.** Are you one of them? In today's Platform Engineering Playbook, we expose the massive security hole plaguing enterprise AI systems and dive deep into prompt injection attacks that are slipping past traditional security measures. Plus, we cover the latest platform engineering news that's reshaping how enterprises build and deploy. **What You'll Learn:** • The hidden AI security vulnerability affecting 9 out of 10 enterprise deployments • Step-by-step breakdown of how prompt injection attacks work in production • Actionable security strategies for platform engineers deploying AI agents • Microsoft's aggressive PostgreSQL push and what it means for your data strategy • Cloudflare's evolution from legacy architecture to modern SASE solutions **Timestamps:** 0:00 Cold Open - The 87% Problem 1:30 Introduction 3:00 Deep Dive: The AI Security Crisis 8:45 How Prompt Injection Attacks Actually Work 15:20 Platform Engineer Action Items Whether you're currently deploying AI systems or planning your enterprise AI strategy, this episode delivers the security insights and platform engineering intelligence you need to stay ahead of emerging threats. **Sources & References:** • AI Security Research: https://thenewstack.io/red-teaming-enterprise-ai-agents/ • PostgreSQL on Azure: https://azure.microsoft.com/en-us/blog/from-legacy-to-leadership-how-postgresql-on-azure-powers-enterprise-agility-and-innovation/ • Cloudflare SASE Evolution: https://blog.cloudflare.com/legacy-to-agile-sase/ • AI Tooling Survey: https://newsletter.pragmaticengineer.com/i/189777574/2-most-used-ai-tools • Azure DevOps MCP Server: https://devblogs.microsoft.com/devops/azure-devops-remote-mcp-server-public-preview/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    19 min
  4. 5 DAYS AGO

    Your Kubernetes Monitoring Is Blind to AI Attacks

    **Is your Kubernetes cluster blind to AI model poisoning attacks?** 73% of companies running AI workloads can't detect when their models are compromised - and traditional monitoring tools are completely useless against these threats. In today's Platform Engineering Playbook, we dive deep into why AI workloads are breaking traditional Kubernetes observability strategies and what platform teams need to do about it. Plus, we cover the latest developments shaking up the cloud native ecosystem. **What You'll Learn:** ✅ Why traditional Kubernetes monitoring fails with AI workloads ✅ How to detect AI model poisoning in production environments ✅ Critical AWS security vulnerabilities affecting managed services ✅ New authentication strategies for Kubernetes registry mirrors ✅ Latest developments from the cloud native community **Timestamps:** 0:00 Cold Open - The AI observability crisis 1:30 Today's Platform Engineering News 8:45 Deep Dive: AI Workloads vs Traditional Monitoring 15:20 The Real-World Impact on Autoscaling Whether you're running AI workloads today or planning for tomorrow, this episode gives you the strategies and tools to maintain visibility and security in your Kubernetes environments. **Sources & References:** - Why AI workloads are breaking traditional Kubernetes observability strategies: https://thenewstack.io/ai-kubernetes-observability-practices/ - AWS Launches Managed Openclaw on Lightsail Amid Critical Security Vulnerabilities: https://www.infoq.com/news/2026/03/aws-lightsail-openclaw-security/?utm_campaign=infoq_content&utm_source=infoq&utm_medium=feed&utm_term=global - LLM Architecture Gallery: https://sebastianraschka.com/llm-architecture-gallery/ - Cursor built a fleet of security agents to solve a familiar frustration: https://thenewstack.io/cursor-open-sources-security-agents/ - Registry Mirror Authentication with Kubernetes Secrets: https://www.cncf.io/blog/2026/03/16/registry-mirror-authentication-with-kubernetes-secrets-2/ - KubeCon + CloudNativeCon Europe 2026 Co-located Event Deep Dive: Open Sovereign Cloud Day: https://www.cncf.io/blog/2026/03/16/kubecon-cloudnativecon-europe-2026-co-located-event-deep-dive-open-sovereign-cloud-day/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    18 min
  5. 6 DAYS AGO

    The 6 Types of AI Cloud Infrastructure

    **87% of AI companies are burning cash on the wrong cloud infrastructure - and they have no idea.** In this episode of Platform Engineering Playbook, we expose the costly mistakes plaguing AI infrastructure and reveal the framework that's helping platform teams save millions while scaling smarter. **What You'll Learn:** • The 6 categories of AI cloud infrastructure that matter in 2026 • How to transform inference from dedicated resources into efficient multi-tenant services • A battle-tested evaluation framework from dozens of real-world AI platform implementations • Critical security vulnerabilities in AWS's new Managed OpenClaw service that could impact your infrastructure **Episode Breakdown:** 00:00 Cold Open - The 87% cash burn crisis 02:30 Today's Platform Engineering News 08:15 Deep Dive: AI Cloud Infrastructure Fundamentals **Breaking News Covered:** - AWS Lightsail OpenClaw security situation - New LLM Architecture Gallery release - MCP production roadmap updates - Linux's game-changing performance breakthrough Whether you're architecting AI platforms or optimizing existing infrastructure, this episode delivers actionable insights to help you avoid the expensive mistakes that are crushing 87% of AI companies. **Sources & References:** - AI Cloud Taxonomy 2026: https://thenewstack.io/ai-cloud-taxonomy-2026/ - AWS Lightsail OpenClaw Security: https://www.infoq.com/news/2026/03/aws-lightsail-openclaw-security/ - LLM Architecture Gallery: https://sebastianraschka.com/blog/2026/llm-architecture-gallery.html - MCP Production Roadmap: https://thenewstack.io/model-context-protocol-roadmap-2026/ - Linux Performance Feature: https://www.iowaparkleader.com/linux-finally-catches-up-to-windows-with-a-game-changing-performance-feature/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    18 min
  6. 13 MAR

    Why AI Code Is Killing Your Monitoring Budget

    **Is your monitoring bill about to explode? AI-generated code is creating 10x more observability data than human-written code.** In this deep dive episode of Platform Engineering Playbook, we unpack the hidden observability crisis that's quietly hitting DevOps teams everywhere. While AI accelerates development, it's also flooding your monitoring systems with unprecedented amounts of telemetry data. **What You'll Learn:** ✅ Why AI-generated code produces exponentially more observability data ✅ How to manage exploding monitoring costs without losing visibility ✅ Practical strategies for optimizing telemetry in AI-heavy environments ✅ Real-world approaches to selective instrumentation and data sampling **Episode Breakdown:** 0:00 - Cold Open: The 10x observability data problem 2:15 - Industry news roundup 8:30 - Deep Dive Act 1: Understanding the AI observability explosion 18:45 - Deep Dive Act 2: Technical analysis and root causes **Today's News Coverage:** • CNCF's new etcd debugging improvements for Kubernetes • Uber's MySQL consensus architecture breakthrough • Cloudflare's Account Abuse Protection launch • GitLab Container Virtual Registry updates Perfect for platform engineers, DevOps leads, and SREs dealing with modern observability challenges in AI-driven development environments. **Sources & References:** - https://devops.com/ai-is-forcing-devops-teams-to-rethink-observability-data-management/ - https://www.cncf.io/blog/2026/03/12/making-etcd-incidents-easier-to-debug-in-production-kubernetes/ - https://www.infoq.com/news/2026/03/uber-mysql-uptime-consensus/ - https://blog.cloudflare.com/account-abuse-protection/ - https://about.gitlab.com/blog/using-gitlab-container-virtual-registry-with-docker-hardened-images/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    22 min
  7. 12 MAR

    How Karpenter Fixes Kubernetes Autoscaling

    **Are you throwing money away on Kubernetes compute costs?** 87% of clusters waste up to half their resources on idle nodes - but there's a solution that's changing everything. In today's Platform Engineering Playbook, we dive deep into **Karpenter**, the game-changing autoscaler that's revolutionizing how teams think about Kubernetes resource management. You'll discover why traditional cluster autoscaling falls short and how Karpenter's architecture solves real-world scaling challenges. **What You'll Learn:** ✅ Why 87% of K8s clusters are bleeding money on unused compute ✅ Karpenter's under-the-hood architecture and decision-making process   ✅ Practical evaluation framework for adopting Karpenter in your platform ✅ Latest platform engineering news from Microsoft Azure AI agents, KubeCon India 2026, and more **Timestamps:** 0:00 - Cold Open: The Kubernetes Cost Crisis 2:15 - Today's Platform Engineering News 8:30 - Deep Dive: Karpenter vs Traditional Autoscaling Perfect for platform engineers, DevOps teams, and cloud architects looking to optimize their Kubernetes infrastructure costs and performance. **Sources & References:** - Understanding Karpenter architecture: https://www.datadoghq.com/blog/karpenter-architecture/ - Microsoft Azure Skills Plugin: https://devops.com/microsoft-azure-skills-plugin-gives-ai-coding-agents-a-playbook-for-cloud-deployment/ - KubeCon India 2026 Schedule: https://www.cncf.io/announcements/2026/03/10/cncf-unveils-kubecon-cloudnativecon-india-2026-schedule/ - Cloudflare Security Insights: https://blog.cloudflare.com/attack-surface-intelligence/ - Monitor Karpenter with Datadog: https://www.datadoghq.com/blog/monitor-karpenter-datadog/ #PlatformEngineering #DevOps #CloudNative #Kubernetes

    18 min
  8. 11 MAR

    AI Is Not the Problem — Your Infrastructure Is

    **Why do 70% of AI projects crash and burn before they ever see production?** Spoiler alert: it's not the AI that's broken. In today's Platform Engineering Playbook, we're diving deep into the AI infrastructure crisis that's keeping CTOs awake at night. While everyone's racing to deploy the latest AI models, most organizations are discovering their legacy systems simply can't handle the load. **What You'll Learn:** • The real reason AI projects fail (hint: it's your infrastructure) • How to build a unified data fabric that actually works • Which legacy systems are sabotaging your AI ambitions • Practical strategies for modernizing without breaking everything **Episode Breakdown:** 00:00 - Cold Open: The 70% AI failure rate 02:15 - Platform Engineering News Roundup 08:30 - Deep Dive: The AI Infrastructure Disconnect 15:45 - Building Unified Data Fabrics **Today's News:** Cloudflare & Mastercard's new security partnership, Amazon's R8g instance expansion, Pulumi's Google Sign-In support, Amazon vs. Perplexity AI legal battle, and Together AI's GPU cluster improvements. Perfect for platform engineers, DevOps teams, and technical leaders navigating the AI transformation. **Sources & References:** - AI Infrastructure Crisis Roadmap: https://thenewstack.io/ai-infrastructure-crisis-roadmap/ - Cloudflare & Mastercard Security Partnership: https://blog.cloudflare.com/attack-surface-intelligence/ - Amazon EC2 R8g Regional Expansion: https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-ec2-r8g-instances-additional-regions/ - Pulumi Google Sign-In: https://www.pulumi.com/blog/pulumi-cloud-now-supports-google-sign-in/ - Amazon vs. Perplexity Legal Update: https://www.businessoffashion.com/news/technology/amazon-wins-court-order-blocking-perplexity-ai-shopping-bots/ - Together AI GPU Clusters: https://www.together.ai/blog/new-in-together-gpu-clusters-autoscaling-observability-self-healing #PlatformEngineering #DevOps #CloudNative #Kubernetes

    19 min

About

The Platform Engineering Playbook Podcast is where AI meets open-source infrastructure knowledge—and you're part of the editorial process. Every episode is researched, scripted, and produced with AI, then reviewed by the community and published on GitHub for anyone to improve. Facing tool sprawl across 130+ platforms? Justifying PaaS costs to your CFO? Navigating the Shadow AI crisis hitting 85% of organizations? We tackle the messy realities of platform engineering that most content avoids, delivering data-backed insights and decision frameworks you can use Monday morning. Built for senior engineers, SREs, and DevOps practitioners with 5+ years in production, we dissect cloud economics, AI governance, infrastructure trade-offs, and career strategy—with the receipts to back it up. Think we got something wrong? Have better data? Open a pull request at platformengineeringplaybook.com. This is infrastructure podcasting as a living document, where the community keeps us honest and the content gets better with every contribution. Read the playbook at https://platformengineeringplaybook.com

You Might Also Like