Please support this podcast by checking out our sponsors: - Lindy is your ultimate AI assistant that proactively manages your inbox - https://try.lindy.ai/tad - Discover the Future of AI Audio with ElevenLabs - https://try.elevenlabs.io/tad - SurveyMonkey, Using AI to surface insights faster and reduce manual analysis time - https://get.surveymonkey.com/tad Support The Automated Daily directly: Buy me a coffee: https://buymeacoffee.com/theautomateddaily Today's topics: RL training data quality control - Sean Cai argues many reinforcement-learning datasets sold to frontier labs fail internal QC, wasting data budget and training compute. Key keywords: RL data, intake review, active testing, reward hacking, contamination. Agents that persist across sessions - New agent workflows emphasize continuity and clear success criteria, with Codex CLI’s /goal persisting objectives across restarts and long pauses. Key keywords: Codex CLI, /goal, runtime continuation, long-horizon agents. Token costs in CI agents - GitHub details how agentic CI workflows can silently burn tokens, and how proxy-level telemetry plus automated audits can cut spend materially. Key keywords: CI, LLM tokens, observability, MCP, Effective Tokens. Consumer agents inside social apps - Meta’s rumored “Hatch” agent points to assistants embedded directly in Instagram and Facebook, built for socially grounded discovery and commerce. Key keywords: Meta, Hatch, autonomous agent, social graphs, waitlist. Interpreting hidden model intentions - Anthropic’s Natural Language Autoencoders translate internal activations into readable text, helping auditors spot hidden planning or evaluation awareness—while warning about cost and hallucinations. Key keywords: interpretability, NLAs, activations, auditing, alignment. Realtime voice, translation, transcription - OpenAI’s new realtime audio models aim to make voice apps more capable: reasoning during live speech, streaming transcription, and live translation. Key keywords: Realtime API, voice agents, speech-to-text, translation, tool use. Kernel-level GPU inference speedups - PyTorch engineers show In-Kernel Broadcast Optimization can remove costly tensor replication in recommender inference, boosting throughput and cutting latency on GPUs. Key keywords: PyTorch, IKBO, recommender systems, H100, kernels. Local long-context inference on Mac - A new open-source engine targets DeepSeek V4 Flash on Apple Metal, pushing fast local inference with disk-persisted KV state for long context sessions. Key keywords: DeepSeek, Metal, local inference, KV cache, long context. AI and modern vulnerability disclosure - A Linux “quiet fix” embargo broke when others inferred the security impact from public commits—an example of AI accelerating diff analysis and shrinking disclosure windows. Key keywords: Linux security, embargo, AI scanning, coordinated disclosure. Where AI value really accrues - A critique of the ‘first to AGI wins’ story argues intelligence is commoditizing, and durable value will come from distribution, proprietary workflows, and customer relationships. Key keywords: AGI moat, commoditization, applications, data, workflows. DeepMind’s algorithm-discovery push - DeepMind says AlphaEvolve is delivering gains across science and infrastructure and is moving toward broader business use, while also investing in EVE Online’s studio as a complex AI testbed. Key keywords: AlphaEvolve, algorithm discovery, TPU, EVE Online, simulation. Public backlash to AI imagery - Commentary suggests AI-generated images often trigger immediate negative reactions and can harm credibility, highlighting the social cost beyond technical quality. Key keywords: AI images, trust, credibility, perception, content creation. - Essay Calls for Lab-Grade Quality Control Standards for RL Training Data - Codex CLI Adds Persisted /goal Sessions That Automatically Resume After Pauses - CData and Microsoft Outline Blueprint for Enterprise AI Agents Focused on Data Connectivity - Meta’s ‘Hatch’ Autonomous AI Agent Nears Launch With Waitlist and Deep Instagram/Facebook Integration - PyTorch Introduces In-Kernel Broadcast Optimization to Speed Up RecSys Inference - antirez releases ds4.c, a Metal-only local inference engine for DeepSeek V4 Flash - Essay Challenges the ‘First to AGI Wins’ Narrative as AI Models Commoditize - OpenAI Adds ‘Trusted Contact’ Alerts in ChatGPT for Serious Self-Harm Risk - GitHub details how it cut LLM token spend in agentic CI workflows - Perplexity Brings Its ‘Personal Computer’ AI Agent System to a New Mac App - Oura to Detail How Member Feedback and AI Support Shape Its Product in Upcoming Webinar - DeepMind details AlphaEvolve’s growing impact on genomics, grids, TPUs, and commercial optimization - Temporal and Grid Dynamics to Host Webinar on Production-Grade AI Agent Harness Engineering - AI Makes Both Quiet Fixes and Long Vulnerability Embargoes Harder to Sustain - OpenAI Adds Direct Chrome Support for Codex on macOS and Windows - DeepMind Invests in EVE Online Developer to Use the MMO as an AI Research Sandbox - Inside China’s AI Labs: Cultural Advantages, Student Talent, and Chip Constraints - OpenAI launches GPT‑Realtime‑2, Realtime Translate, and Realtime Whisper for live voice apps - Writer Warns AI Art Signals Low Social Literacy and Can Hurt Your Reputation - Ramp Labs Trains RL-Powered Qwen Subagent to Speed Up Spreadsheet Retrieval - Anthropic Unveils Natural Language Autoencoders to Translate AI Activations into Text - re_gent Launches as ‘Git for AI Agents’ to Audit Prompts, Tool Calls, and Code Changes - Developer Says Clients Now Demand AI Chatbots Like Past Web Fads Episode Transcript RL training data quality control Let’s start with a reality check on how frontier labs buy training data. In a May 2026 essay, Sean Cai argues that a lot of off-the-shelf reinforcement learning datasets simply don’t survive internal quality-control at top AI labs. The punchline is practical: bad data doesn’t just waste the purchase order—it wastes the most expensive part of the pipeline, the training compute that chews through it. Cai describes a two-stage QC mindset. First, an “intake” pass to see whether the dataset is even testable and hard to game. Then “active testing,” meaning small training runs designed to flush out failure modes like reward hacking, sycophancy, alignment-faking, and forgetting. The bigger implication is market pressure: vendors increasingly win renewals by shipping audit artifacts—things like false-positive rates, per-skill regressions, and failure triage—rather than vague stories about metrics improving. Agents that persist across sessions Staying with the theme of agents that actually hold up in the real world, OpenAI’s Codex tooling is leaning hard into continuity. Codex CLI version 0.128.0 adds a /goal feature that persists the agent’s objective across restarts, laptop sleep, and long pauses. What’s new is that Codex doesn’t just remember context—it proactively resumes by injecting a developer message when you return, instead of waiting for you to re-prompt. The write-up frames this as a workflow shift: you stop “babysitting an AI session” and instead write a spec-like contract upfront with success criteria and guardrails. That matters because as agent runtimes stretch from minutes to hours, the real bottleneck becomes clarity and control—not raw model capability. Token costs in CI agents Codex is also moving closer to the browser, which is where a lot of real work happens. OpenAI says Codex can now operate inside Google Chrome on macOS and Windows, including working across multiple tabs and running in the background without constantly hijacking your window focus. If this works as advertised, it’s a meaningful step toward in-browser automation that feels less like a demo and more like a daily tool—especially for tasks that live in web apps: admin consoles, dashboards, forms, and multi-step workflows. Consumer agents inside social apps As agents spread into automation pipelines, one unglamorous topic is becoming unavoidable: token spend. GitHub shared how agentic workflows running in CI can rack up large costs quietly—especially when they trigger on every pull request. Their approach is refreshingly operational: capture normalized token telemetry at a proxy layer, emit an artifact that’s easy to analyze, then run daily “meta” jobs to flag anomalies and open issues with concrete fixes. Two big lessons stood out. First, tool definitions can silently bloat every call—so pruning unused registrations saves money immediately. Second, not every step needs an LLM: deterministic commands can fetch context before the agent ever speaks. The broader point is that “agent reliability” now includes budget reliability, not just correctness. Interpreting hidden model intentions On the consumer side, Meta appears to be preparing a new autonomous agent—reportedly codenamed “Hatch.” New traces in Meta’s codebase suggest active rollout work and a waitlist-style launch. The rumored direction is a socially grounded agent that can generate media, help with shopping-style workflows, and support research—while leaning on Instagram and Facebook for discovery and commerce. If Meta ships an agent inside the social feed experience, it raises the competitive stakes in a very different way than yet another standalone chat app. The advantage isn’t just model quality—it’s being embedded where people already spend time, with built-in context from social graphs and creator ecosystems. Realtime voice, translation, transcription Now to the story we teased at the top: interpretability that tries to translate what’s happening inside a model into plain language. Anthropic introduced Natural Language Autoencoders, or NLAs—an approach that turns internal activations into readable explanations, then checks itself