Two months ago I wrote about ripping Notion out of my workflow and replacing it with OpenClaw—a self-hosted AI agent framework running on my Mac Studio. No cloud. No subscription. No black box. Last weekend I shut it down. Disabled 38 cron jobs. Moved 23 LaunchAgents into a _retired-openclaw/ quarantine folder. Killed the Ollama daemon. Archived the directory with a 30-day deletion timer. Everything in that original article still reads as true. Local-first is still right. Data ownership is still right. The critique of SaaS “well-enough” software is still right. What I got wrong was believing OpenClaw was the right vehicle for any of it. This is the post-mortem and the replacement: an agent OS I built on top of the Claude Agent SDK called ClaudeClaw Mission Control. Thirteen themed agents. One daemon. A scheduler I can actually see into. Zero silent failures slipping past me for a week before I notice. Let me explain how I got here. The Setup OpenClaw was doing real work. 38 cron jobs. Morning briefings. Evening summaries. A content pipeline that pulled research from web sources, structured it, scored it, and queued articles for ASTGL. An email triage pass. A model-usage monitor. A nerve-health monitor watching the other monitors. On paper: impressive. In practice: I had no idea if any of it was working. The system was so noisy that when something broke, I learned about it four days later when I noticed my morning briefing hadn’t arrived. Or I didn’t learn about it at all, because the cron job was exiting 0 while the script inside it was crash-looping. That last one is the killer. Let me show you what I mean. What’s Actually Going On Three failure modes hit me in a 48-hour window, and each one was invisible to the system watching the system. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Failure one: successful exits, 100% broken payload. My content pipeline was ingesting URLs, and a regression introduced a trailing-slash bug that made example.com/foo and example.com/foo/ look like different URLs to the dedup layer. Every new article hit a UNIQUE constraint violation inside a subprocess. The outer wrapper caught the error, logged it to a file nobody was reading, and exited 0. For two weeks the cron appeared green while 100% of structurings were crashing. Failure two: PATH-resolved Node. I had the daemon running Node 24 (absolute path, explicit). A subagent it spawned inherited a PATH that fell through to Homebrew’s Node 25. One of the native modules (better-sqlite3) was compiled against 24, so every subagent invocation crashed with ERR_DLOPEN_FAILED and MODULE_VERSION mismatch. The smoke test I’d written passed because it ran from the daemon’s shell. The actual production path failed every time. Failure three: auth expiry with no escape hatch. OpenClaw stored some credentials in pass (the Unix password store). When my GPG key timed out, the daemon couldn’t start. Which meant the health monitor couldn’t start. Which meant the thing that would have told me about the outage was the thing that was out. OpenClaw had no watcher that lived outside the daemon it was watching. None of these are OpenClaw-specific bugs in the upstream sense. They’re pattern problems that emerge anywhere you have: 1. A monolithic daemon responsible for its own monitoring. 2. Flat-file state (HEARTBEAT.md, LEARNINGS.md) that gets appended to rather than queried. 3. Exit codes treated as truth when the real signal is in stderr. 4. No separation between “Did it run?” and “Did it work?” OpenClaw was built for a different job. It was a personal automation gateway—great at “kick off this script at 6:30 AM.” It wasn’t built to be an agent OS with observability. I was using a shovel to drive screws. I also couldn’t ignore the security posture. February’s disclosures—135,000 exposed instances, 15,000 vulnerable to RCE, the ClawHavoc plugin-registry incident, nine CVEs—had pushed me to patch hard and lock down. But every week I spent hardening OpenClaw was a week I wasn’t building what I actually wanted: themed agents that owned workstreams, could be reasoned about individually, and fail loudly. The Fix ClaudeClaw Mission Control is a Node.js daemon built on the Claude Agent SDK. It runs as a single LaunchAgent (com.claudeclaw.app), owns a SQLite store at store/claudeclaw.db, polls a scheduled_tasks table every 60 seconds, and dispatches due tasks to agents by ID. The interesting part isn’t the daemon. It’s the agents. I set up thirteen of them, themed after the small council of a certain fictional kingdom, because if I’m going to stare at this UI every day, I’d rather it amused me. Thirteen themed agents, each owning a workstream. STEWARD drives my mornings and evenings. MAESTER runs the ASTGL content pipeline. WATCHMAN watches the whole system from outside it. Each agent lives in its own directory at agents//, with an agent.yaml (model, personality, cwd, MCP servers) and a CLAUDE.md system prompt. A scheduled task carries an agentId column in the DB, and the dispatcher routes like this: if (shouldRouteViaAgent(task.agentId, listAgentIds())) { const result = await delegateToAgent(task.agentId, task.prompt, { fromAgent: SCHEDULER_FROM_AGENT, chatId: task.chatId, }); return result.text ?? '(empty response)'; } Adding a new agent is now: drop a folder under agents/, write a CLAUDE.md, run schedule reassign . No source changes. The dispatcher picks it up on next tick. That’s the piece I kept trying and failing to get with OpenClaw—modular ownership. In OpenClaw, everything was “the daemon.” In ClaudeClaw, MAESTER owning the content pipeline means if content alerts stop firing, the log line says maester: task failed instead of openclaw-gateway: subprocess exited nonzero. Attribution is free. Adding a new agent is now: drop a folder under agents/, write a CLAUDE.md, run schedule reassign . No source changes. The dispatcher picks it up on next tick. That’s the piece I kept trying and failing to get with OpenClaw—modular ownership. In OpenClaw, everything was “the daemon.” In ClaudeClaw, MAESTER owning the content pipeline means if content alerts stop firing, the log line says maester: task failed instead of openclaw-gateway: subprocess exited nonzero. Attribution is free. The Watchman probes WATCHMAN runs every hour at :05. It has seven probes, each targeting a failure mode that burned me on OpenClaw: 1. Failed tasks. status=’failed’ in the DB. Trivial. 2. Stuck tasks. status=’running’ AND last_run 3. Missed slots. status=’active’ AND next_run 4. Daemon liveness. launchctl print gui/$UID/com.claudeclaw.app—does launchd still have it? 5. Content-pipeline health. Tails the structured log file, parses the JSON, checks for crash shapes. 6. Hidden failures. Scans the last_result text column for ERR_DLOPEN_FAILED, MODULE_VERSION, Traceback, and other “the job exited zero but it sure didn’t work” signals. This is the probe that would have caught my trailing-slash bug in an hour instead of two weeks. 7. Delegation crashes. inter_agent_tasks WHERE status=’failed’ — on-demand agent invocations that blew up. On top of that, there’s a separate LaunchAgent running a healthcheck every 30 minutes that lives outside the main daemon and uses a keychain-backed Telegram token. If the daemon is dead, the healthcheck still delivers the alert. That’s the lesson from failure three: the watcher cannot share fate with the watched. Memory v2 OpenClaw’s memory was HEARTBEAT.md and LEARNINGS.md—flat files I appended to. Eventually they got long enough that the agent stopped reading them usefully, and I had no query surface to pull just the relevant bits. ClaudeClaw’s Memory v2 is a five-layer context stack: 1. Semantic recall—cosine similarity against stored memory embeddings, top 5 by score, chat-scoped. 2. Recent high-importance memories—memories with importance >= 0.7 written in the last 7 days. 3. Consolidation insights—a 30-minute loop that summarizes the short-term buffer into durable notes. 4. Cross-agent hive—stubbed for now; eventually lets MAESTER peek at something STEWARD noted this morning. 5. Conversation history—last N turns. Layers dedupe by memory ID. The whole thing is safe to drop into the SDK’s systemPrompt option. It’s not magic. It’s just queryable instead of append-only, which is the delta between “context I can use” and “a log file I’ll never re-read.” Forum-topic routing instead of bot-per-agent A small but satisfying piece. All thirteen agents post to one Telegram bot, into one supergroup, but each agent has a dedicated forum topic: Alerts → thread 22 (WATCHMAN) ASTGL → thread 23 (MAESTER) Council → thread 24 Steward → thread 25 Whisperers → thread 26 War Room - Security → thread 40 (WAR) One token. One chat. Threaded conversations per domain. The ergonomics are dramatically better than 13 separate bots with 13 separate tokens, which is the architecture I almost built before I remembered that Telegram supergroups have forum topics now. Why This Matters A few things I want to flag for anyone planning something similar. Build the rollback before you build the new thing. I wrote scripts/retire-openclaw.sh with explicit --rollback semantics before I disabled a single cron job. Plists get moved (not deleted) into _retired-openclaw/. Cron jobs get flipped enabled: false with a timestamped backup (jobs.json.bak.pre-retire-20260419). The OpenClaw directory sits untouched for 30 days with a calendar reminder to delete it. If ClaudeClaw had cratered on day two, I was one shell command away from being back on the old system in under a minute. Silent success is worse than loud failure. The design principle I pulled from this whole experience: every job in the system needs someone whose job it is to doubt that