Meta's Llama 3 training run, 405 billion parameters, used 16,384 H100 GPUs for 54 days. Over those 54 days, the cluster experienced 419 unexpected interruptions — roughly one failure every three hours. And that's the run Meta calls a success. They hit 90 percent effective training time. This is the substrate platform engineers at AI-native companies are operating on. This episode is what's actually different about platform engineering at companies like OpenAI and Anthropic, compared to the traditional shape — Stripe, Netflix, Block, Google. Engineering tone, not hype. The verified primary-source view: OpenAI's two Kubernetes scaling posts at 2,500 and 7,500 nodes (5 API servers, 5 etcd, 70 GB heap per API server, 200,000 IPs in use at peak, MPI gang scheduling via the Coscheduling plugin). OpenAI's Postgres scaled for 800 million ChatGPT users on a single primary plus 50 read replicas. Anthropic's September 2025 postmortem disclosing three serving platforms (first-party, Bedrock, Vertex), three hardware backends (Trainium, NVIDIA, TPU), sticky routing, tens of chips per request. The compute portfolios: Anthropic with roughly 7 gigawatts disclosed across AWS Project Rainier (~500K Trainium2 chips), Google plus Broadcom (up to 1M TPUs), Microsoft-NVIDIA ($30B / 1 GW Grace Blackwell + Vera Rubin), and SpaceX Colossus 1 (220K NVIDIA GPUs / 300 MW). OpenAI's Stargate at $500B / 10 GW. The new problem classes: training cluster reliability (Meta cluster MTTF goes from 47.7 days at 8 GPUs to 14 minutes at 131,072 GPUs — reliability collapses non-linearly). NCCL collectives. Gang scheduling primitives (Kueue versus Volcano, properly distinguished). Inference at p99 (PagedAttention, RadixAttention, continuous batching — three independent optimizations). Prefill versus decode disaggregation. Heterogeneous fleets across H100, H200, B200, GB200, Trainium2, TPU v5p, Ironwood. HBM and U.S. energy as the binding constraints, not GPU FLOPS. What stays the same: the reliability discipline. SLOs, error budgets, on-call, blameless postmortems, observability. Anthropic's September 2025 postmortem reads like a Google SRE Book chapter. What doesn't transfer: substrate-specific tooling. You can't canary a 16,000-GPU job mid-flight. Three platforms inside one company. Training is a batch-scheduler problem. Inference is a request/response problem. Agents are a durable-workflow problem. Above all three, a chip-portability layer. Same craft. Different physics. CHAPTERS 00:00 Cold open — Llama 3.1 reliability data 00:33 Intro 00:59 The traditional platform charter 02:24 What's disclosed at OpenAI + Anthropic 04:40 Anthropic infrastructure deep dive 07:10 Team structure (OpenAI by workload, Anthropic by portability) 07:48 The new problem classes 08:20 Training cluster reliability + Meta MTTF curve 09:52 Gang scheduling — Kueue vs Volcano 10:26 Training frameworks — DeepSpeed, FSDP, Megatron 11:15 Inference at p99 — PagedAttention, RadixAttention 11:58 Prefill vs decode disaggregation 12:38 Heterogeneous fleets 13:14 Capacity planning + HBM as the binding constraint 14:28 What stays the same 15:46 Why "more load-bearing" 16:59 Closing thesis SOURCES OpenAI Kubernetes posts (2018, 2021) + Postgres scaling Anthropic September 2025 postmortem Anthropic Managed Agents + Code Execution with MCP AWS Project Rainier, Google-Broadcom, MS-NVIDIA, SpaceX Colossus OpenAI Stargate (Jan 2025) Llama 3.1 paper + Meta cluster MTTF (arXiv 2410.21680) DeepSeek V3 paper · vLLM PagedAttention (SOSP 2023) · SGLang RadixAttention Latent Space — NVIDIA Dynamo team (prefill/decode disaggregation) Google Borg paper · Netflix Tech Blog (Spinnaker, Atlas, Eureka)