CloudChat

Carl and Brandon

Conversations about building software and designing architecture in the cloud natively.

  1. Rolling, Rolling, Rolling…

    3D AGO

    Rolling, Rolling, Rolling…

    Episode 0032 - Rolling, Rolling, Rolling… Logs are ground truth — high-fidelity, event-level data that anchor observability alongside metrics and traces. Carl and Brandon argue the biggest mistake teams make is treating "more logs" as "better logs." If everything is logged, nothing is useful, and they both share recent troubleshooting sessions where verbose, unstructured output forced them into KQL gymnastics just to find the actual error. Brandon walks through a 503 that turned out to be a database fault hidden one layer down, and Carl recounts a customer whose "unplanned" VM reboots were actually planned Kubernetes node maintenance — a story you can only untangle by correlating infrastructure and platform logs. Along the way they cover the six log sources worth thinking about (application, infrastructure, platform/managed service, security, audit, and access logs), with a detour into a customer whose minute-long latency vanished once infra logs revealed a VPN routing New York users through Texas. The middle of the episode is a clinic on log hygiene. Carl walks through log levels — debug/verbose, info, warn, error, fatal — and the distinction Brandon draws between an exception (a code construct) and an error (a log level): a caught exception is an error, an uncaught one becomes fatal. They make the case for structured logging into stores like Kusto or via OpenTelemetry so keys can be projected, indexed, and fed directly into dashboards, and Brandon's tip on not pre-computing expensive log arguments is a reminder that a disabled verbose call still costs CPU if you build its message eagerly. Centralized logging pipelines beat rolling your own helper class — log4-anything frameworks exist for a reason — and UTC alone won't save you when scaled-out instances drift apart in time. Correlation and trace IDs, especially parent/child IDs from OpenTelemetry, are the thread that stitches a single user's journey back together across microservices. Carl and Brandon close on cost and discipline. Logging budgets balloon fast, so production should not be running verbose, retention should be tiered (a month of exceptions is plenty once the fix ships), duplicate destinations like Log Analytics plus Event Hubs plus a storage account should pick one source of truth, and Application Insights-style sampling can collapse repetitive traffic into representative events. Compliance logs that sit for years belong in cold or frozen storage tiers where the access pattern actually matches the cost. Their do's and don'ts land on a simple posture: log with intent, redact secrets and connection strings, standardize across teams, and — especially if AI agents are writing your code — make sure the logging conventions travel with the work. Point an agent at a recent run and ask where the gaps and noise are; it's a fast way to audit whether your logs are actually doing their job. Links Observability and logging concepts OpenTelemetry OpenTelemetry traces and spans W3C Trace Context (correlation IDs) Structured logging overview (Microsoft Learn) Log levels in .NET (LogLevel enum) Logging frameworks log4j (Apache) log4net (Apache) Serilog (structured logging for .NET) Azure platform logging Azure Monitor Logs / Log Analytics Azure diagnostic settings Azure Application Insights sampling Kusto Query Language (KQL) Azure Event Hubs Azure Blob Storage access tiers (hot/cool/cold/archive) Security and supply chain XZ Utils backdoor (CVE-2024-3094) Veritasium: "The Internet Was Weeks Away From Disaster and No One Knew" Related CloudChat episodes Episode 0024 — Operating Excellently Episode 0025 — The Sound of Security Episode 0026 — Are Your Cloud Costs Too Damn High? Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    1h 19m
  2. AI All the Things?

    APR 6

    AI All the Things?

    Episode 0031 - AI All the Things? Traditional sprint ceremonies start getting in the way when AI-assisted development outpaces the cadence they were built for. Carl and Brandon unpack why that happens and what to do about it — starting with the basics. Brandon defines context windows, distinguishes original vibe coding from the sloppy way the term is used today, and walks through the software factory model where requirements, source code, and tests live in separate repos. Carl shares how he continuously refines his Copilot instructions file, instructs the agent to detect and document recurring patterns, and leans on intent-based prompting over tactical step-by-step descriptions — a three-sentence prompt describing preset themes and macOS Focus Mode integration wrote his Swift UI code nearly flawlessly. Both hosts dig into context management: plan mode to review before implementing, the "Ralph Wiggum" pattern of starting fresh sessions with just the plan, and Architectural Decision Records that give future sessions a trail to follow. Different models suit different jobs — Claude for architecture, Codex for implementation — and MCP servers let those models reach Git and GitHub without a copy-paste workflow. Brandon argues AI is a tool like the Internet — some roles will shift, but learning and adapting has always been the core tech-industry skill. Carl backs that up with a study showing senior engineers only see productivity gains when they change their process, not when they bolt AI onto the old one. On the junior side, Carl mentors a developer to focus on data structures and algorithms — not for the implementation details, but for knowing when to apply them. An MIT study pegs realistic job displacement at around 11.7 percent, and cases like Box's layoffs look more like post-COVID overcorrection than proof that AI is replacing everyone. Links AI-Assisted Development GitHub Copilot Anthropic Claude OpenAI Codex Model Context Protocol (MCP) T3 Chat — Compare LLM Outputs Development Concepts Strangler Fig Pattern (Martin Fowler) Test-Driven Development (TDD) Behavior-Driven Development (BDD) Tools Mentioned Swift UI (Apple Developer) Draw.io Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    1h 21m
  3. Local‑First Lifeboats: Architecting for Post‑EOL Usability

    FEB 2

    Local‑First Lifeboats: Architecting for Post‑EOL Usability

    Episode 0030 - Local‑First Lifeboats: Architecting for Post‑EOL Usability This episode is about designing for the last day, not just the launch day. Carl kicks off with the Bose SoundTouch situation: a vendor moves toward EOL on a cloud-tethered API, users push back, and the outcome (at least in spirit) becomes a blueprint we wish was more common: keep the hardware useful by enabling local control paths and leaning on protocols that already work without your cloud. From there we broaden the conversation to the bigger problem: products and services that do something totally reasonable in a LAN suddenly need a round trip to the internet just to respond to a button press. Carl and Brandon talk through concrete "this actually happened" examples and what good looks like. Belkin's Wemo sunset email is a solid reference: clear dates, repeated notices, and a reality check that local APIs and ecosystems like HomeKit and Matter can keep devices working even when a vendor endpoint is shut off. We contrast that with the messier side of the industry: thermostats and other home gear that still function locally, but lose their main value when the cloud connection is removed, and cloud-only platforms like Stadia where "no backend" means "hard stop" (with the one bright spot being things like refunds and a final firmware update to unlock a controller for normal Bluetooth use). On the builder side, we get practical about how to retire things without surprising your users. We cover technical signaling (Deprecation and Sunset headers), the need for human-friendly comms beyond "put it in the docs," and the architecture patterns that make "minimum viable offline" real: local-first state, local discovery and control surfaces, and fallbacks that do not require re-pairing or re-auth when identity systems go away. We also touch on SaaS escrow and continuity as a way to build trust (especially for startups) and close with a simple gut check: if your cloud disappeared tonight, what can your users still do tomorrow morning? Links News and examples we discussed Bose is open-sourcing its old smart speakers instead of bricking them | The Verge Belkin Wemo cloud service end-of-support notice Google Stadia - Strategy change and shutdown (2021–2023) | Wikipedia Google Stadia controller Bluetooth mode help article API deprecation and shutdown mechanics Deprecation HTTP response header (RFC 9745) Sunset HTTP response header (RFC 8594) Smart-home protocols and "local-first" connectivity Matter (Connectivity Standards Alliance) Thread protocol overview (Thread Group) Multicast DNS (mDNS) (RFC 6762) Tools and patterns Local-first software (Ink & Switch) Strangler Fig Application pattern (Martin Fowler) Automerge (CRDT) - GitHub Yjs (CRDT) - GitHub Contracts and continuity SaaS escrow overview (Escrow London) SaaS escrow overview (PRAXIS Escrow) Software escrow overview (EscrowTech) Other links of interest Microsoft Modern Lifecycle Policy EU Right to Repair overview (European Commission) Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    1h 3m
  4. New Year's ☁️ Resolutions

    JAN 5

    New Year's ☁️ Resolutions

    Episode 0029 - New Year's ☁️ Resolutions "In 2026, your cloud is not allowed to have the same incidents for the same reasons as last year." Carl and Brandon treat this episode like a retrospective (the kind any good agile team would run), but instead of talking about sprint tickets, they write a New Year's resolution list on behalf of your cloud team. The format is simple: Stop, Start, Keep. Small, opinionated constraints that change day-to-day habits, not vague wishes about "better reliability, security, and cost." The Stop list hits the repeat-incident patterns: single-region "global" apps, treating infrastructure-as-code as optional (and living in the portal), mystery ownership with no clear tags or escalation path, one-off production fix scripts that never get documented, dashboards that are always green while users are hurting, and "temporary" exceptions that turn into permanent risk. The Start list is the muscle-building: run realistic failover/incident drills, measure change and recovery (DORA-style signals and MTTR, not just uptime), budget reliability and cost together, treat internal platforms like products with golden paths, standardize secrets and identity, and add a regular "delete day" so old environments and artifacts do not drag into the new year. The Keep list is what compounds: automate repetitive toil, invest in observability tied to real user flows, keep blameless postmortems with concrete follow-ups, and keep platform/SRE work visible so it does not get squeezed out by features. We hope you and your team are able to embrace some of these resolutions in the coming year, and hope that listening to more CloudChat is at the top of your list. Happy New Year everybody! Links DORA: What is DevOps? Site Reliability Engineering (SRE Book) Azure Well-Architected Framework AWS Well-Architected Framework Google Cloud Architecture Framework Azure Bicep documentation Terraform documentation Azure Key Vault overview Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    1h 3m
  5. Respect My (DNS) Awe-Thor-Ih-TAY!!

    12/01/2025

    Respect My (DNS) Awe-Thor-Ih-TAY!!

    Episode 0028 - Respect My (DNS) Awe-Thor-Ih-TAY!! Your cloud is humming along, then an edge breaks. What lever do you actually still have to steer users? In this episode, Carl and Brandon dig into DNS as a control plane and why "it is always DNS" keeps being true in 2025. DNS was designed for a slower internet with long TTLs and infrequent changes, but we now treat it like a real-time steering wheel for global failover. That mismatch shows up in outages where the backend is fine but nobody can resolve the hostname that front doors, CDNs, and APIs live behind. We unpack how TTL and caching really work (including negative caching and serve-stale), why modern edge products like Azure Front Door and Cloudflare can still turn into global single points of failure, and how DNS-based load balancers actually behave when you flip weights or priorities. From there we move into patterns and mitigations. We walk through hub-and-spoke vs mesh topologies and where public vs private DNS sit in each, plus concrete strategies for what to do when your edge is broken: bypass patterns, equivalent services, and multi-product designs that let you route around a failing front door. We also hit the observability side so "it is DNS" becomes a graph and an alert instead of a guess in a war room. We close with a look at emerging record types like SVCB/HTTPS and how they may help you advertise alternate endpoints and protocol hints without building another fragile tower of CNAMEs. Links DNS Fundamentals RFC 1034: Domain Names - Concepts and Facilities RFC 1035: Domain Names - Implementation and Specification RFC 2308: Negative Caching of DNS Queries RFC 8767: Serving Stale Data to Improve DNS Resiliency DNS Load Balancing and Edge Services Azure Traffic Manager documentation Azure DNS alias records Amazon Route 53 health checks and failover Cloudflare Load Balancing Akamai Global Traffic Management Azure, AWS, and Cloudflare Outage Reading Azure Front Door service documentation AWS DynamoDB and Route 53 service health history Cloudflare status history Architectures and Private DNS Azure Private DNS zones Azure DNS Private Resolver Azure Virtual WAN DNS guidance Emerging DNS Records and HTTP/3 Service binding (SVCB) and HTTPS resource records Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    1h 5m
  6. Whoops, No VM's!!!

    11/03/2025

    Whoops, No VM's!!!

    Episode 0027 - Whoops, No VM's!!! You've planned for redundancy, scaling, and failover, but what happens when the cloud itself runs out of space? In this episode, Carl and Brandon untangle capacity (what the provider physically or logically has available in a region or zone) versus quota (the soft limit on what you can consume). Mixing the two leads to painful surprises during scale events and failovers. We talk through how capacity shortfalls show up in real life—zones that are full, SKUs that vary by location, and limited supply for GPU-heavy instances, and the patterns that help: design for multiple zones and regions, add retry and fallback logic with flexible SKUs, balance spot with on-demand, and hold a baseline with reservations or time-bound commitments. We close on the business side: the price of headroom, when commitments make sense, and simple pipeline and monitoring checks so "no capacity" errors fail fast instead of 30 minutes into a deploy. Links AWS Auto Scaling allocation strategies AWS EC2 Capacity Reservations AWS insufficient capacity guidance AWS Savings Plans AWS Service Quotas Azure On-demand Capacity Reservations Azure quotas overview Azure region pairs Azure subscription and service limits Azure VM allocation failures Azure VM Scale Sets orchestration modes (Flexible) GCP Compute Engine Reservations GCP quota alerts and monitoring GCP Regional Managed Instance Groups GCP resource availability errors Google Cloud quotas overview Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    51 min
  7. Are Your Cloud Costs Too Damn High???

    10/06/2025

    Are Your Cloud Costs Too Damn High???

    Episode 0026 - Are Your Cloud Costs Too Damn High??? Cloud cost optimization is about designing systems that perform efficiently without wasting money. In this episode, Carl and Brandon break down how AWS, Azure, and Google Cloud help teams rightsize compute, manage storage tiers, and control networking costs. They talk through savings plans, spot instances, lifecycle management, and data transfer strategies that keep performance high and waste low. The discussion then moves into monitoring, automation, and FinOps culture, where budgets, policies, and shared accountability make optimization stick. They cover dashboards, tagging, auto-shutdown routines, and partner-led programs that unlock funding and deeper discounts. Real-world stories from enterprises and startups highlight one key truth: cost management is not a cleanup exercise, it is an ongoing habit that keeps cloud architectures both efficient and sustainable. Links AWS: Well-Architected Framework – Cost Optimization pillar AWS: How to Use AWS Well-Architected with Trusted Advisor for Cost Optimization AWS: AWS Savings Plans AWS: Amazon EC2 Spot Instances Azure: Microsoft Cost Management + Billing (overview) Azure: Quickstart: Start using Cost Analysis Azure: Common cost analysis uses in Cost Management Azure: Control Azure spending and manage bills (learning path) GCP: Create, edit, or delete budgets and budget alerts (Cloud Billing) GCP: Cloud Billing Budget API overview GCP: Committed Use Discounts (Compute) GCP: Understand your bill – pricing & billing (Google Developers) Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    59 min
  8. The Sound of Security

    09/08/2025

    The Sound of Security

    Episode 0025 - The Sound of Security Security is more than a feature, it's a pillar of the Well-Architected Framework. In this episode, Carl and Brandon explore how AWS, Azure, and GCP approach security across identity and access, infrastructure defense, data protection, monitoring, governance, and the shared responsibility model. They compare tools and practices like IAM, RBAC, and conditional access; network firewalls, WAFs, and DDoS protection; encryption at rest and in transit; and incident detection and automated remediation. The conversation also dives into security testing, drift detection with IaC, compliance posture, and how policy enforcement differs across the big three. The episode closes with a reminder that cloud security is always shared, and is never finished. Links AWS: Well-Architected Framework – Security pillar AWS: Identity and Access Management (IAM) AWS: AWS Shield and WAF AWS: Amazon Macie AWS: Amazon GuardDuty AWS: AWS Config Azure: Azure Well-Architected Framework – Security Azure: Microsoft Entra ID (Azure AD) Azure: Azure Role-Based Access Control (RBAC) Azure: Azure Key Vault Azure: Defender for Cloud Azure: Microsoft Sentinel Google Cloud: Google Cloud Architecture Framework – Security Google Cloud: IAM overview Google Cloud: Cloud Armor Google Cloud: Cloud KMS Google Cloud: Data Loss Prevention (DLP) API Google Cloud: Security Command Center Google Cloud: Assured Workloads Visit us at: twitter.com/CloudChatTech discord.cloudchat.tech cloudchatpodcast@gmail.com linkedin.com/company/cloudchat

    1h 7m

Ratings & Reviews

5
out of 5
5 Ratings

About

Conversations about building software and designing architecture in the cloud natively.