As The Geek Learns

The Geek

Tools and training for IT professionals. Join James Cruce—a systems engineer with 25+ years managing enterprise VMware infrastructure—for practical PowerCLI tutorials, automation tips, and lessons from the trenches. Much to learn, there always is. astgl.com

  1. 4d ago ·  Bonus

    5 Questions to Ask Before You Build the AI Project Your CEO Just Pitched

    5 Questions to Ask Before You Build the AI Project Your CEO Just Pitched You know the email. It shows up Tuesday morning, forwarded with a few lines of enthusiasm and a ChatGPT-drafted proposal attached. "Saw this and thought of us. Can we do this?" The PDF has a logo, bullet points, and exactly zero integration requirements. It also has a six-week timeline and a budget that assumes nothing goes wrong. You have somewhere between 24 and 72 hours before your CEO follows up asking what you think. If you say yes, you're on the hook for a project you didn't scope. If you say no, you're the person who kills ideas. Neither answer is actually available to you. What you need is a third path: a structured evaluation that produces a defensible, professional response in the time it takes to drink your morning coffee. That's what the Technical Reality Check is. Five questions. One page. Every answer points directly at a commitment your organization will have to honor if this project moves forward. Here it is in full. The Technical Reality Check: 5 Questions That Surface What the Proposal Left Out Question 1: What specific business outcome does this solve, and how will we measure success? AI tools generate confident-sounding proposals that describe solutions, not problems. A proposal for "an AI-powered IT ticketing system" describes a technology. It doesn't describe what's broken right now, how broken it is, or what "fixed" looks like in measurable terms. Before any conversation about implementation, you need an answer to: what does success look like in six months, and how will we know we hit it? Ticket resolution time down 30%? First-contact resolution rate up 20%? Those are real answers. "Things will be more efficient" is not. Unmeasurable projects never officially fail. Which means they never stop consuming resources. This question isn't about being difficult. It's about making sure the organization is buying an outcome, not a technology. The red flag: Any proposal where the only success metric is "we deployed it." Question 2: Who owns the ongoing maintenance, security patching, and vendor relationship? Vendor proposals describe launch day. They are almost entirely silent about year two. Every new system creates a permanent maintenance obligation: patching, credential rotation, user access reviews, API deprecations, contract renewals, and a support relationship with a vendor whose incentives are not aligned with yours. If that obligation doesn't have a named owner before the project starts, IT inherits it by default. Forever. Without headcount. This question forces the conversation about operational reality before anyone has signed a contract. The answer also tells you a lot about how seriously the proposal was thought through. If nobody has asked "who maintains this?", nobody has thought past the demo. The red flag: "The vendor handles everything." Vendors handle their system. You handle the integration, the credentials, the user provisioning, the data pipeline, and the 2 AM alert when something breaks between their system and yours. Question 3: What happens to our existing systems, data, and processes? New systems don't exist in a vacuum. They touch your directory, your ticketing system, your identity provider, your backup scope, your audit logs. Each of those integration points is a potential failure mode, a migration cost, or a compliance question. AI-generated proposals routinely skip integration complexity. This isn't because the AI is being deceptive. It's because the AI generating the proposal doesn't know your stack. The proposal was written in a context-free environment. Your environment is anything but. Before committing, you need to know: what does this touch, and what has to move or change for it to work? And who does that work? Data migration alone can turn a "simple" deployment into a multi-month project. Asking this question early is how you find out. The red flag: "It integrates easily with your existing tools." That's a sales phrase, not an engineering estimate. "Easy" is undefined until your systems engineer has looked at the API docs. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Question 4: What's the realistic timeline and resource cost, not the optimistic one? Vendor timelines assume clean data, available staff, smooth approvals, and nothing else on the backlog. Your timeline accounts for your actual team, their current commitments, the security review cycle, the change management process, and the three things nobody predicted. The gap between those two numbers is usually where projects go sideways. Not because the technology failed, but because the plan never accounted for reality. This question also surfaces a common pattern: the timeline was set before IT was consulted. Any timeline that precedes a technical assessment is a guess dressed up as a schedule. You're the one who'll be explaining the delay when the guess turns out to be wrong. The red flag: A go-live date in the proposal. That's not a plan, it's a target somebody made up. Ask who set it and what it was based on. Question 5: What's the exit strategy if this doesn't work as expected? Every vendor says their product works. You need a plan for when it doesn't. When the pricing doubles at renewal. When the company gets acquired and support degrades. When a compliance requirement changes and the product doesn't keep up. Data portability, rollback procedures, and contractual exit terms are not pessimism. They're the difference between a manageable failure and a situation where you're paying for a system that doesn't work because migrating off it is too expensive to contemplate. This question also signals organizational maturity. IT teams that ask exit questions before they sign contracts don't get held hostage. IT teams that don't ask end up managing a five-year sunset project for a tool they stopped believing in three years ago. The red flag: "We can always just stop using it." Can you migrate your data? In what format? At what cost? How long does it take? If nobody has asked those questions, stopping isn't as simple as it sounds. The Checklist in Practice: Walking Through a Real Scenario Here's what a Technical Reality Check pass looks like when you actually run it. Your CEO forwards a ChatGPT-drafted proposal on a Monday morning. The subject line is "AI Agent for IT Ticket Triage." The proposal is two pages. It describes an AI system that reads incoming IT tickets, categorizes them by priority and type, routes them to the right team, and drafts first-response emails automatically. There's a mockup screenshot. There's a line about "easy integration with your existing ITSM." There's a timeline: six weeks to deployment. You open the Technical Reality Check. Q1: What specific business outcome does this solve? The proposal says "reduce response times and improve IT efficiency." No baseline. No metric. You check your current ITSM data: average first response is 4.2 hours, your SLA target is 2 hours, you're meeting it 71% of the time. Now you have a problem worth solving. You write it down: "We need first-response SLA compliance above 85%. Current state: 71%." That's the outcome. If the AI system can't demonstrate a path to that specific number, the conversation is premature. Q2: Who owns maintenance and the vendor relationship? Nobody is named in the proposal. You have a team of four. One of them is already carrying the ITSM admin role. You note: this needs a named owner and a rough estimate of ongoing hours before it can go to planning. You also flag the API integration dependency: your ITSM has a rate-limited API that's caused problems before. Someone needs to read the vendor's API docs before "easy integration" gets treated as a fact. Q3: What happens to existing systems and data? Your ticketing data includes ticket histories, customer records, and some attachments. The proposal doesn't mention data handling. You note two questions: where does ticket data go once the AI processes it, and what are the data residency requirements given that you handle some HIPAA-adjacent systems? That second question alone could be a blocker. You don't know yet, but you know to ask. Q4: What's the realistic timeline and resource cost? Six weeks assumes nothing else is happening. Your team is currently in the middle of a server migration that runs through the end of the month. Realistically, this project can't start until mid-next-month, and your most experienced engineer (the one who'd need to own the integration) is at 90% utilization. You write down: "Realistic start: six weeks out. Realistic deployment: 12-16 weeks from proposal receipt. Not 6." Q5: What's the exit strategy? The proposal doesn't mention it. You note: before any contract, you need to know the data export format, the contract term length, and what happens to stored ticket data at offboarding. That's it. You've just done a Technical Reality Check. Total time: 15 minutes. Now you can write a response. Not "no." Not "yes." Something like: "I've done a preliminary review. Before we can assess feasibility, I need answers to five specific questions. Here they are. Happy to set up 30 minutes to walk through them together." You've moved the conversation from enthusiasm to decision-ready. You've protected the organization without being obstructionist. And you have a written record of the questions you asked, which matters if the project later goes sideways without those answers ever being provided. That's the whole point of the Technical Reality Check. It's not a rejection letter. It's the question set that separates proposals worth pursuing from proposals worth deferring. What the Rest of the Toolkit Covers The Technical Reality Check is the first thing you run. It gets you to a defensible position in 15 minutes. But the full response (the one that protects your career, your team's credibility, and the

    14 min
  2. Anthropic Shipped an AI Security Scanner. Here's the Per-PR Cost Math.

    6d ago ·  Bonus

    Anthropic Shipped an AI Security Scanner. Here's the Per-PR Cost Math.

    The first time my manager asked, “Are we using AI to scan PRs for vulnerabilities yet?" I said I'd look into it. Then I spent four hours reading docs, pricing pages, and GitHub issues before I had a number I trusted enough to put in a Slack message. That should have taken twenty minutes. The number exists. The math is straightforward. Nobody had written it down in one place where a platform engineer could find it. Anthropic quietly shipped `anthropics/claude-code-security-review` as a first-party GitHub Action. You add a workflow file, point it at a secret, and it posts a findings comment on every pull request. The scanner reasons about code rather than matching signatures, which means it catches things like logic-level injection paths that a regex-based tool would miss. It also means the false-positive profile is different from what you're used to, and you need a triage process before you wire it to branch protection. This article gives you the cost math and the triage playbook in full. Both are things you'd need even if you built this yourself. Why "Just Run It" Isn't a Strategy Adding a CI step that calls an LLM API isn't free, and it isn't free to manage. There are two failure modes I see teams hit. The first is budget surprise. Someone adds the scanner, it runs for a month, the cloud bill shows up, and the conversation gets uncomfortable because nobody did the math upfront. The scanner doesn't cost a lot, but "not a lot" needs a number attached to it before you walk into a budget conversation. The second failure mode is alert fatigue. The scanner finds something on every PR. Engineers start skimming the findings comment the same way they skim Dependabot. One day there's a real SQL injection in a PR, it's buried in a list of five findings, and it merges. The triage process is what keeps findings meaningful instead of noise. Both problems are solvable. The math takes ten minutes. The triage rubric takes one meeting to agree on. Neither requires buying anything yet. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. What This Costs Per PR (The Real Numbers) Claude bills per token. One token is roughly four characters of text. A PR diff gets converted to tokens and sent to the model as input. The model's findings comment is output tokens. The formula is simple: Cost = (input_tokens × input_rate) + (output_tokens × output_rate) For Claude Sonnet 4.6, the rates are approximately $3 per million input tokens and $15 per million output tokens. (Verify current pricing at platform.anthropic.com before your next budget conversation. Rates change.) Scenario 1: A 200-Line PR Diff A focused bug fix or small feature. Maybe three files changed. Component Tokens Rate Cost ----------------------------------------------------------------------- System prompt + workflow context (input) 2,000 $3.00 / 1M $0.006 PR diff, ~200 lines (input) 1,300 $3.00 / 1M $0.004 Findings output, 1-2 findings (output) 600 $15.00 / 1M $0.009 ----------------------------------------------------------------------- Total per PR ~$0.019 Call it two cents. For a small PR, this is a rounding error. Scenario 2: A 2,000-Line PR Diff A refactor, a new feature, a dependency upgrade touching multiple services. Component Tokens Rate Cost ------------------------------------------------------------------------ System prompt + workflow context (input) 2,000 $3.00 / 1M $0.006 PR diff, ~2,000 lines (input) 13,000 $3.00 / 1M $0.039 Findings output, 2-4 findings (output) 1,500 $15.00 / 1M $0.023 ------------------------------------------------------------------------ Total per PR ~$0.068 Seven cents. Still noise for a single PR. Monthly Back-of-the-Envelope The question your manager will ask isn't “What does one PR cost?" It's “What does this cost per month?" If your team merges 80 PRs a month (about 4 per business day), with a mix of small and medium diffs averaging around $0.04 per scan: 80 PRs × $0.04 = $3.20/month Even if your average PR runs larger, say closer to the 2,000-line scenario at $0.07 each: 80 PRs × $0.07 = $5.60/month A busy multi-team repo at 400 PRs a month at $0.07 each is $28/month. That's less than one developer's Spotify subscription. The cost math isn't the obstacle here. The obstacle is having a triage process in place before you flip it on. One practical note: output token count varies with how many findings the scanner generates. Zero findings produces shorter output and costs less. Ten findings costs a bit more. The estimates above assume one to three findings per PR, which is realistic for an established codebase with existing security hygiene. The 3-Tier Triage Rubric Every finding the scanner posts needs to land in one of three buckets. Here's the decision framework. REAL: Block the merge. Fix it. A finding is REAL when it describes an exploitable path with proof. The scanner should show you the specific line, and explain how an attacker would reach it, and the explanation should hold up when you read the code yourself. SQL injection via string concatenation in a request handler is REAL. Hardcoded credentials that actually ship to production are REAL. The discriminator: "If an attacker had this codebase and five minutes, could they demonstrate this?" If yes, it's REAL. Block the PR and fix it before merge. PROBABLE: Human review required. A finding is PROBABLE when the pattern is plausible, but context matters. The scanner can see the diff, not the full runtime environment. A finding might flag a code path that looks injectable, but your framework wraps every database call with prepared statements at a layer the scanner can't see. Or the flagged code only runs in a context that requires prior authentication the scanner doesn't know about. The discriminator: "This could be real, but I need someone who knows this codebase to confirm." Don't block the PR automatically. Route it to the PR author or a senior engineer. Give it a two-hour resolution window before it escalates. DISCARD: Suppress it with a documented rule. A finding is DISCARD when it's structurally a false positive. The scanner flagged test code that never runs in production. It flagged a generated file you don't own. It flagged a template placeholder in an IaC file that gets substituted at deploy time. It flagged a public API URL as a hardcoded credential because the word "key" appeared in the variable name. The discriminator: "Would an attacker gain anything by knowing this?" If no, it's a DISCARD. The important part is that you document why. Suppressing without a comment is how you end up silently ignoring real findings six months later when the context is gone. A Worked Example: The SQLAlchemy False Positive Here's the kind of finding that will show up on your team in the first two weeks if you use any ORM. A PR adds a new search endpoint. Somewhere in the diff, there's code like this: def search_users(search_term: str): results = db.session.query(User).filter( User.name.ilike(f"%{search_term}%") ).all() return results The scanner flags it as a potential SQL injection vulnerability. The finding explains that `search_term` appears to be user-controlled input and is being interpolated into a query string. Severity: HIGH. A human reading this would notice a few things. The code uses SQLAlchemy's ORM layer. The `.ilike()` method is a SQLAlchemy query construct, not a raw SQL string. SQLAlchemy sends the query to the database as a parameterized statement with the value bound separately, which is exactly the defense against SQL injection. The `f"%{search_term}%"` is constructing the pattern string in Python, but that pattern gets passed as a bound parameter by the driver. This is a DISCARD. The scanner saw string interpolation near a database call and correctly identified that as a pattern worth flagging. It couldn't see that the ORM handles parameterization automatically. The suppression note you'd document reads something like: SQLAlchemy ORM calls via `.filter()`, `.ilike()`, `.like()`, and similar query methods use parameterized queries automatically. String interpolation to construct pattern values (e.g., for LIKE clauses) does not create injection risk when using these methods. Do not flag SQLAlchemy ORM filter calls as SQL injection. That note goes into a filter file your workflow references. The same class of finding stops appearing on every PR that touches a database query. Two things to notice about this example. First, the scanner wasn't wrong to flag it. Without ORM context, string interpolation near a SQL-like method call is exactly what a good scanner should notice. Second, the suppression is better than just dismissing it, because the documented rule now covers every future PR using the same pattern. You pay the triage cost once. What the Full Kit Covers The cost math and triage rubric are the foundation, but they don't tell you how to wire any of this into GitHub. The full guide covers the GitHub Actions workflow YAML itself (the one that calls `anthropics/claude-code-security-review` and handles the findings response), how to set up branch protection so that HIGH findings actually block merges instead of just posting a comment, and the in-workflow automation that runs the REAL/PROBABLE/DISCARD classification before the comment lands on the PR. There's also a head-to-head with GPT-4o as a second-opinion scanner. They're not equivalent tools. The Anthropic action is purpose-built for this job. The GPT-4o path is a chat completions API call with a security prompt, which costs about seven times less per PR but produces more variable results. The comparison matri

    12 min
  3. 6d ago

    Stop Paying for Cloud APIs: Building a Local AI Stack on Mac Studio

    Running LLMs locally usually feels like a compromise. You either get tiny, fast models that can't think or massive models that crawl at one word per minute. But with the right hardware, you can break that trade-off and replace your cloud billing entirely. The Setup The dilemma most developers face is a choice between two bad options. On one side, you have cloud APIs like OpenAI or Anthropic. They are easy to use and incredibly smart, but they come with a heavy "API tax" and privacy concerns. If you're processing proprietary code or sensitive customer data, sending that information to a third-party server is a massive risk. On the other side, you have traditional local setups. Usually, you're limited by the VRAM on your GPU. If you have a standard consumer card with 12 GB or 24 GB of VRAM, you're stuck with small models. You can't run the heavy-hitters that actually compete with GPT-5. This creates a wall where local AI is only good for "toy" problems, while production workloads stay in the cloud. The Hardware Math The real secret to breaking this wall is Apple Silicon's unified memory. On a Mac Studio with an M3 Ultra, the 256 GB of memory is shared between the CPU and the GPU. This eliminates the VRAM bottleneck that kills most local setups. You aren't limited by a tiny slice of video memory; you're limited by the total pool of system memory. When you look at the actual numbers, the math becomes very clear. Here is how I structure my model loading on this machine: If you load all of these concurrently, you're using roughly 107 GB of memory. That leaves about 149 GB for the macOS, your browser, your IDE, and everything else. This allows you to run a 32B model for writing, a 72B for research, and an 8B for quick checks all at the same time. The economics are just as compelling. A Mac Studio setup costs anywhere from $4,000 to $7,000 as a one-time purchase. If your production workflows are costing you $200 to $500 per month in cloud tokens, the hardware pays for itself in 12 to 18 months. After that, the "cost" of running a massive model is basically just the electricity it uses. Plus, you finally own your data. Temperature Is a Randomness Dial, Not a Quality Dial I see a lot of tutorials that suggest using a temperature of 0.7 for every single prompt. That is a mistake. Temperature doesn't make a model "smarter" or "better." It is simply a randomness dial. It controls how much the model is allowed to deviate from the most likely next word. If you use the same temperature for everything, your pipeline will fail. For tasks requiring high precision, a high temperature will introduce hallucinations. For creative tasks, a low temperature will make the output feel robotic and repetitive. In my production newsletter pipeline, I use a specific routing table to manage this: There are two key takeaways here. First, for fact-checking, you want the temperature at 0.1. This makes claim extraction repeatable and ensures your verdicts are consistent every time you run the script. Second, setting the temperature to 0.8 for "humanization" might seem counterintuitive, but it works. A higher temperature allows the model to make less predictable word choices, which actually produces more natural, less "AI-sounding" prose. The OpenAI Compatibility Trick As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The best part about using Ollama for this setup is that you don't have to rewrite your entire codebase. Ollama exposes an OpenAI-compatible API at `localhost:11434/v1`. This means any tool, library, or SDK that respects the `OPENAI_BASE_URL` environment variable can be redirected to your local machine with almost zero effort. You can point your existing Python scripts or LangChain agents to your local Mac by simply setting these variables in your terminal: export OPENAI_BASE_URL=http://localhost:11434/v1 export OPENAI_API_KEY=ollama # Any value works; Ollama doesn't check this If you are working within a configuration file, such as a JSON config for a custom agent, it looks like this: { "model": "openai/qwen3:32b-fast", "openai_base_url": "http://localhost:11434/v1", "openai_api_key": "ollama" } Every LangChain chain, every summarization script, and every SDK that follows the OpenAI protocol becomes a free local-model call. You can migrate an entire project from GPT-4 to your local M3 Ultra in about 30 seconds. Why This Pattern Matters This isn't just about saving money on API credits. It is about architectural sovereignty. When you move your core intelligence layer to local hardware, you remove the dependency on a single vendor's uptime, pricing changes, and content filtering policies. The pattern of using unified memory to host multiple specialized models at different temperatures allows you to build a "factory" of intelligence. You have a high-speed 8B model for sorting, a balanced 32B model for drafting, and a heavy 70B model for deep reasoning, all running in the same memory space. This is how you build a production-grade AI stack that is private, permanent, and incredibly cost-effective. ( This cost calculation was based on 6-month-ago pricing when I bought my Mac Studio. Since then the availability of Mac Studios with large amounts of unified memory has evaporated. This has driven up pricing. Hopefully this is temporary. ) Quick Reference Key Commands * Set local base URL: `export OPENAI_BASE_URL=http://localhost:11434/v1` * Check running models: `ollama ps` Temperature Cheat Sheet * 0.1 to 0.3: Extraction, coding, fact-checking, and structured data (JSON). * 0.7: General purpose, drafting, and summarization. * 0.8 to 1.0: Creative writing, brainstorming, and persona simulation. Found this useful? I share practical lessons from my systems engineering and AI journey at As The Geek Learns Get full access to As The Geek Learns at astgl.com/subscribe

    11 min
  4. May 25

    I Built a Self-Improving AI Swarm. After 100 Runs It Was No Better Than Run One.

    I spent twelve hours watching a leaderboard that refused to move. The setup was simple: six AI agents tasked with writing technical articles. They were designed to be a closed loop. The drafter would write, the grader would score, and the agents would then "evolve" their own configs to chase a higher score. I hit "go" on my Mac Studio, went to bed, and woke up to a flat line. After 100 iterations, the average score had crawled from 63.0 to 63.9. The all-time peak was 69.0 at iteration 79, but the system never stayed there. It was a C-minus. Indistinguishable from noise. I had fallen for the Autonomy Fallacy. I assumed that if I gave a swarm of LLMs the right knobs—temperature, max_tokens, and the ability to append "prompt additions" to their system prompts—they would naturally drift toward quality. I was wrong. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. When I opened config/agents/drafter.yaml to see what the agent had "learned," I found a disaster. The prompt_additions list had evolved into five overlapping phrases of pure SEO buzzword soup. It was telling itself to be "semantically rich," "data-dense," and to "enhance semantic alignment by including keyword-integrated background information." The drafter hadn't learned how to write a better article. It had learned how to trick the grader. The Smoking Gun The smoking gun was the model choice. I was using qwen3:8b as the grader to judge the output of qwen3:32b-fast. I had a smaller, weaker model acting as the quality gate for a larger, smarter one. The 8B model couldn't tell the difference between a nuanced technical insight and a paragraph full of "semantically rich context." To the grader, the buzzwords looked like "density." The agents converged on what the grader liked, not on what a human would actually publish. This wasn't self-improvement; it was reward hacking. To make it worse, the first twenty iterations were a total wash. I had a silent JSON parse failure in the config-evolution logic: Expecting value: line 1 column 1 (char 0). The agents were trying to mutate their configs and failing, but the loop kept running. By the time I pushed the fix in commit c28a611, the system had already drifted into a local maximum of corporate-speak. I realized that self-improvement requires an external pull. You cannot have a system where the performer and the judge are of the same pedigree, or worse, where the judge is the weaker link. The Rebuild I tore the architecture down and built v2. First, I moved the "brain" of the operation. The performance stayed local. I used gemma4:31b on the Mac Studio to generate the text, but I moved the judging to the cloud. I plugged in Sonnet 4.6. I decided the cheapest place to spend API tokens wasn't on generating 2,000-word drafts, but on grading them. Second, I killed the "single-shot mutation" approach. In v1, the agent changed its prompt, ran once, and if the score went up, the change stuck. That's too much noise. I replaced it with a tournament. Now, the system samples three different prompt templates from a versioned library. The performer generates three candidates. Sonnet ranks them using a structured rubric and a single API call. Then I implemented an Elo system for the templates. # src/prompt_library.py (excerpt) def record_tournament(self, ranking: list[str]) -> dict: for i in range(len(ranking) - 1): winner = self.templates[ranking[i]] loser = self.templates[ranking[i + 1]] expected_w = 1 / (1 + 10 ** ((loser.elo - winner.elo) / 400)) delta = ELO_K_FACTOR * (1 - expected_w) winner.elo += delta loser.elo -= delta self._maybe_retire_losers() # Templates below Elo 1300 are deleted The templates that consistently win the tournament climb the leaderboard; the ones that produce buzzword soup are automatically retired. What Happened Next The difference was immediate. On the very first run of v2, the drafter scored 81.45. That's twelve points higher than v1's all-time best. Over 25 pinned verification runs, the mean score was 82.67 with a standard deviation of 2.18. The worst draft in that run scored 75.4—still above v1's ceiling of 69.0. The most satisfying part was the judge's feedback. When the system tested the v1-baseline template, Sonnet didn't just give it a low score. It wrote: "The headline 'The Rust Revolution' is pure SEO-speak and the opening paragraph is a textbook AI tell... it's the kind of breathless corporate copy that kills trust immediately." That is exactly the failure mode the local 8B grader had been blind to for 100 iterations. The cost is roughly four cents per tournament. For the price of a coffee, I can run 125 iterations and actually trust that the line on the graph is moving upward. What I'd Tell Myself a Week Ago If you're building a self-improving loop, don't trust the autonomy. You need three things: * A judge stronger than the performer. If the judge is weaker, you aren't optimizing for quality; you're optimizing for the judge's biases. * Tournament selection. Single-shot mutation is just a random walk. You need multi-candidate comparisons to clear the noise floor. * A human-review gate. No automated judge is calibrated forever. Build in a pause where you manually pick the winner and anchor the next round. Stop trying to make the agents smarter. Just buy a better mirror. Improvement isn't about the engine—it’s about the feedback loop. Get full access to As The Geek Learns at astgl.com/subscribe

    7 min
  5. May 16

    Managing Anthropic Agent SDK Costs: A Post-June 15 Billing Playbook

    Your background agents are about to run out of money. Anthropic's new credit pool system means your automation could die in a single week. Here is how I re-engineered my stack to stay under budget without breaking my workflows. The Setup You've built a small fleet of agents. They sort your mail, watch your repos, file your daily briefings. My current setup before the June 15th cutover: Then May 13 lands, and Anthropic announces the change: on June 15, every programmatic Claude call moves into a metered monthly credit pool. $100 a month on Max 5x. No rollover. Run the math against your actual schedule. If you've got anything polling on the order of minutes (cron pipelines, hourly digests, watchdog sweeps), that pool drains in 7 to 10 days. And here's the kicker. Your interactive Claude Code keeps working. Your headless automation just stops. You wake up to a dead pipeline, a drained pool, and a subscription that still says active. What's Actually Going On This isn't just a random pricing tweak. There is a clear economic driver here. Throughout early 2026, many third-party tools used the Agent SDK at a $20 Pro subscription rate to run workloads that would cost hundreds at standard API rates. It was essentially compute arbitrage at scale. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Anthropic started cracking down in April, but the May 13 announcement is the structural fix. They are moving to dedicated monthly credit pools to restore access under metered billing. The reality is that most agentic operating systems are built directly on the Agent SDK. Because these agents lack a human in the loop to throttle their usage, they are now metered by default. Interactive sessions stay on the flat-rate subscription because the human provides the natural brake. Programmatic agents do not. The Fix I implemented a two-phase mitigation to deploy before the June 15 deadline. Phase 1 was a hot patch designed to provide immediate protection. I added a BILLING_MODE environment variable with three states: unmetered, metered, and paused. The paused state blocks every programmatic call across all providers, while metered enforces a strict cap on the Anthropic route. I also added a file-backed JSON ledger at store/billing-ledger.json to track monthly costs. It uses a write-then-rename pattern to ensure crash safety during updates. To handle errors, I introduced a BillingCapExceeded error class. I used the same instanceof pattern as my KillSwitchRefusal logic so a typo in a message cannot accidentally trigger a retry loop. The logic lives in a single chokepoint: runAgent() in src/agent.ts. The pre-call gate checks the cap, and the post-call gate records result.totalCostUsd from the SDK, firing a Telegram alert if a threshold is crossed. As a final safety measure, I cut the cadence on my two highest-frequency tasks: the pipeline-advance cron moved from 15 minutes to hourly, and I paused the council-evening task entirely under metered mode. // src/config.ts — tri-state env that gates programmatic agent calls export const BILLING_MODE = optional('BILLING_MODE', 'unmetered'); export const BILLING_CAP_USD = number('BILLING_CAP_USD', 80); // src/agent.ts — pre-call gate in the dispatcher function assertBillingAllowed(provider: Provider): void { if (BILLING_MODE === 'paused') { throw new BillingCapExceeded( 'BILLING_MODE=paused — programmatic agent calls are disabled.', ); } if (provider === 'anthropic' && BILLING_MODE === 'metered') { const total = getMonthlyTotal(); if (total >= BILLING_CAP_USD) { throw new BillingCapExceeded( `Anthropic monthly credit cap reached: $${total.toFixed(2)} >= $${BILLING_CAP_USD.toFixed(2)}.`, ); } } } export async function runAgent(opts: AgentOptions): Promise { assertEnabled('AGENTS_ENABLED'); const provider: Provider = opts.provider ?? 'anthropic'; assertBillingAllowed(provider); if (provider === 'ollama') return runOllamaAgent(opts); if (provider === 'codex') return runCodexAgent(opts); return runAnthropicAgent(opts); } Phase 2 focuses on the long-term router infrastructure. I promoted runAgent() from a direct SDK caller to a dispatcher that can route across anthropic, ollama, and codex providers. I also extended the agent.yaml schema with provider: and local_model: fields. I shipped a single-turn Ollama runner that wraps the local-LLM client. It returns totalCostUsd: 0 and a model tag like ollama:llama4:scout. I deliberately avoided tool calls in this initial version to keep the scope small. # agents//agent.yaml — new fields, validated at load id: scout name: SCOUT model: claude-sonnet-4-6 provider: anthropic # default. flip to 'ollama' to route locally. # local_model: llama4:scout # used when provider: ollama To be honest, I did not actually flip any agents to Ollama in this specific PR. The agents I need to move, like STEWARD or WATCHMAN, execute Bash and SQLite queries. A local runner without tool-call support would break them silently. Building a proper tool-call shim takes a few more days, but the cadence reduction and the billing breaker alone are enough to keep my spend under $80 per month. Why This Matters Every person using an agent OS is in the same boat. Whether you use ClaudeClaw, Cline, Aider, or Roo Code, the underlying SDK is the same, and the June 15 cliff is approaching. The playbook I used generalizes: you need one chokepoint, one ledger, and one way to audit your cadence. We also need to be honest about workload requirements. Tasks like editorial review or complex code deliberation still justify the Sonnet price tag. However, simple tasks like classification, routing, or summarization run perfectly fine on a local model with zero metered cost. The router infrastructure makes this migration a simple config flip rather than a massive code refactor. Finally, this reflects where the industry is heading. OpenAI has used usage-based pricing for a long time, and GitHub Copilot is moving toward credit pools. In the next year, more vendors will split consumption between interactive flat-rate plans and programmatic metered usage. Building this abstraction now means you won't have to scramble the next time a vendor changes their terms. Quick Reference * Single Chokepoint: Ensure every agent call flows through one function. This turned a three-week refactor into a one-week job. * Cadence over Architecture: Reducing task frequency (e.g., 15m to 1h) cuts spend faster than migrating to local models. * Ship the Breaker First: Implement the cost ledger and the BillingCapExceeded error as insurance before you attempt the complex provider migration. # The cutover, June 14: flip the env, restart, reseed, smoke-test BILLING_MODE=metered BILLING_CAP_USD=80 # then launchctl kickstart -k gui/$(id -u)/com.claudeclaw.app npm run pipeline -- schedule-advance npm run schedule -- pause council-evening Found this useful? I share practical lessons from my systems engineering journey at As The Geek Learns Get full access to As The Geek Learns at astgl.com/subscribe

    7 min
  6. May 8

    ChatGPT Just Invented an Entirely Fake Version of My MCP Server

    I asked ChatGPT to tell me about my own MCP server. It returned about a thousand words of confident, beautifully formatted, completely fabricated nonsense. Tables. Comparisons. A made-up acronym. A "thinking substrate" that sits above data and below agents. None of it is real, and that's the part worth talking about. The Setup My project is called `mcp-astgl-knowledge`. It's an MCP server with 15 tools for searching my newsletter articles, backed by sqlite-vec and Ollama. The whole thing fits on a laptop. ASTGL stands for "As The Geek Learns," which is the name of this newsletter. I wrote it. I shipped it. There is a public GitHub repo and a public package.json. So when a friend asked me what the MCP server actually does, I figured I'd see how each big AI assistant explained it. ChatGPT was first up. I typed in "ASTGL MCP Knowledge" and hit enter. What I got back wasn't an answer. It was a hallucination wearing the suit of an answer. "ASTGL (Abstract Semantic Task Graph Layer) MCP Knowledge Server is an emerging MCP server focused on structured knowledge representation and reasoning... it turns knowledge into graph-based, machine-reasonable structures that agents can query and evolve." That paragraph alone has three fabrications: the acronym expansion (made up), the "graph-based, machine-reasonable structures" (the server stores text chunks with vector embeddings, no graph), and "evolve" (the index is static, refreshed every six hours by a cron job, agents do not edit it). Then it kept going. A four-row "MCP stack" table positioning ASTGL as "the thinking substrate" between data and agents. A comparison matrix against fictional products called "Totem" and "SwarmClaw" that don't exist. A capabilities list including "task decomposition" and "reasoning over structure." Use cases. "Real-world examples." A confident sign-off: "If AST-grep is about seeing code better, then ASTGL is about thinking better." Every word of it written with the calm, structured, lightly-emoji'd authority that makes ChatGPT sound right by default. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. What's Actually Going On When you ask an LLM about a topic it doesn't have indexed, it has two options: say "I don't know," or fill in the gap with something plausible. In practice, models default to the second one. They're trained to be helpful, and "I don't know" reads as unhelpful. So the gap gets filled. The result is what I'd call a fluency hallucination. The output has no factual grounding, but the writing is structured well enough that a casual reader can't tell. There are bullet points. There are tables. There's a "👉 In plain terms" callout. The rhetorical scaffolding looks like a real explainer because it's been pattern-matched to one. The contents underneath are pure fiction. This is a worse failure mode than search engines have. When Google doesn't know about you, you don't appear in results, and the user can see the gap. When an LLM doesn't know about you, the user gets a beautifully written description of someone the LLM made up, and your real work is still missing, but now there's a fake version sitting in front of it. For under-indexed creators (which, right now, is most of us), this is the default. Not the edge case. The Fix There's no quick patch for this on the engine side. The model isn't broken. It's doing what it was trained to do. The only handle I have is on my own side: make sure my real content reaches the retrieval surface, and measure whether it's working. So I built a citation tester. It's a small TypeScript script that hits Perplexity, Claude, and ChatGPT through their APIs, asks each one twenty target questions tied to articles I've already published, and parses the cited URLs from the response. If `astgl.ai` shows up, that's a hit. If it doesn't, that's the data. The point isn't that the floor is bad. I knew it would be. The point is that without a number, "improve our AEO" is a vibe, not a project. Every Monday at 9am the script runs again, writes a fresh row to a SQLite table, and tells me whether the floor moved. When it does move, I'll know which engine moved first, on which questions, and at what citation position. That's the actual feedback loop. Same root cause as the hallucination: my content isn't reaching the retrieval surface. Same fix: get it there. Different observability. Why This Matters If you write online and you care whether AI assistants represent you accurately, this is the thing to internalize: the alternative to being cited is not being silent. It's being replaced. Replaced by a confident summary of work you didn't do, opinions you don't hold, and product features you'd never ship. People who ask an LLM about your work and read its answer don't know they're reading fiction. They walk away with a model of you that you didn't write. The traditional AEO playbook talks about ranking, authority, and citation rate. All real, all worth measuring. But there's a tier underneath that, and it's the one most independent creators are stuck on right now: existence. Until your content is in the index, ranking doesn't apply. You aren't competing with anyone. You're competing with the LLM's imagination of you. Measurement is the cheapest part of fixing it, and it's the part most people skip. Quick Reference Four things that matter, in order: 1. Pick 20 questions your articles should answer. Tie each one to a specific URL on your site. 2. Hit each engine via API weekly. Perplexity returns a `citations[]` array. Claude returns search results in `web_search_tool_result` blocks. OpenAI returns `url_citation` annotations on `output_text` items. 3. Record the result to a small database, not a spreadsheet. You want trend data, not a snapshot. 4. Look at the floor first. Zero is a fine starting number as long as you're tracking it. The full script I'm using, including the gotcha where Node's `--env-file` silently dropped my Anthropic key on a fresh keypair, is in the repo. The article about the Anthropic key bug is coming separately. Found this useful? I share practical lessons from my systems engineering journey at As The Geek Learns Get full access to As The Geek Learns at astgl.com/subscribe

    8 min
  7. May 6

    The Ollama Model-Swap Death Spiral That Killed Every Cron at Once

    3 a.m. Every cron job on the Mac Studio failed inside the same 90-second window. No code changes. No model updates. No new jobs. Just a wall of timeout errors that lit up every channel I had wired to alerts. The culprit was hiding in plain sight: a fallback chain doing exactly what I told it to. The Setup One Mac Studio. One Ollama daemon. A handful of cron jobs each calling the local LLM for different tasks: code review, log summarization, doc indexing, a nightly digest. Each cron specified a preferred model. Each one inherited a "be resilient" fallback chain from the task router: try the preferred model, fall back to a smaller one, fall back to a tiny one if both fail. It looked clean on paper. Big model for the smart stuff, smaller model when the big one chokes, tiny model as a safety net. Classic graceful degradation. The kind of pattern you'd put in a "production-ready" checklist without thinking twice. The models on disk ranged from 4GB to 22GB. Loading the big one into VRAM took roughly 60 seconds cold. Generation, once warm, took 5 to 10 seconds. Guess which number I used to set the timeout. What's Actually Going On Here's the cascade. Cron A fires at 3:00:00 and asks for `qwen2.5-coder:32b`. The model isn't loaded. Ollama spends the entire 30-second timeout just paging the weights into VRAM. It never gets to generation. The request fails. The fallback chain kicks in and asks for `qwen2.5-coder:14b`. Ollama evicts the half-loaded 32b, starts loading the 14b. Another 30 seconds gone. Fallback again. Tiny model loads, finally generates. Cron A "succeeds" with degraded output. Meanwhile, Cron B fires at 3:00:15 expecting the 32b model that Cron A's first attempt was loading. Now there's a tiny model in VRAM instead. Cron B starts the same dance from a different starting point. Cron C lands on top of that. Within 90 seconds, every cron is waiting on a model swap that the next cron is about to invalidate. The fallback chain wasn't degrading gracefully. It was thrashing the VRAM and guaranteeing nobody finished. Every safety net I'd added was making the failure worse. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. The Fix Two changes. No clever code. Just operational discipline. First, pin one model in VRAM with `keep_alive: 24h`. This is a request-level option that tells Ollama to stop evicting the model after the response. Default behavior is to unload after 5 minutes of idle. That's the eviction that lets the next caller's load attempt thrash everything. # Pin model in VRAM with keep_alive curl -s http://localhost:11434/api/generate -d '{ "model": "qwen2.5-coder:32b", "prompt": "test", "keep_alive": "24h" }' Second, force every frequent cron to use that same pinned model. Kill the fallback chain for hot-path workloads. Fallback is fine for one-off scripts you run by hand. It's poison when three crons fire in parallel against shared VRAM. To make sure the model is loaded before any cron fires, I added a LaunchAgent that runs the warm-up curl on boot: Label com.local.ollama-warmup RunAtLoad ProgramArguments /usr/bin/curl -s http://localhost:11434/api/generate -d {"model":"qwen2.5-coder:32b","prompt":"warmup","keep_alive":"24h"} Load it with `launchctl load ~/Library/LaunchAgents/ollama-warmup.plist`. Now the model is hot before login completes. Every cron hits a warm model and finishes in the 5-to-10-second window the timeouts were designed for. Result: zero model-swap thrashing since the change. Crons that used to fail intermittently now run consistently. Why This Matters The lesson isn't about Ollama. It's about cold-load math. Anytime your "graceful degradation" path is slower than your timeout, every retry makes the next caller's situation worse. Fallback chains assume the fallback is fast. Model loads aren't fast. Database failovers aren't fast. Cold containers aren't fast. Operational discipline beats clever code here. One hot model, no swaps, every cron pointed at the same target. The "less resilient" design is actually more reliable because it removes the failure mode entirely. If you're running local LLMs on shared hardware, assume VRAM is a single resource that gets thrashed under parallelism. Pin what matters. Warm it before it's needed. Don't trust fallback chains during peak hours. Quick Reference * Cold model load on a 20GB+ model: roughly 60 seconds * Warm generation: 5 to 10 seconds * Default Ollama eviction: 5 minutes of idle * Pin a model: `keep_alive: 24h` in the API request body * Warm-up on boot: LaunchAgent (macOS) or systemd unit (Linux) * Hot path rule: one model, no fallback, same model across every concurrent caller * Reserve fallback chains for interactive, single-caller use If you found this article useful, you can find more articles like this at: As The Geek Learns Get full access to As The Geek Learns at astgl.com/subscribe

    7 min
  8. May 2

    I Killed OpenClaw and Built ClaudeClaw Mission Control

    Two months ago I wrote about ripping Notion out of my workflow and replacing it with OpenClaw—a self-hosted AI agent framework running on my Mac Studio. No cloud. No subscription. No black box. Last weekend I shut it down. Disabled 38 cron jobs. Moved 23 LaunchAgents into a _retired-openclaw/ quarantine folder. Killed the Ollama daemon. Archived the directory with a 30-day deletion timer. Everything in that original article still reads as true. Local-first is still right. Data ownership is still right. The critique of SaaS “well-enough” software is still right. What I got wrong was believing OpenClaw was the right vehicle for any of it. This is the post-mortem and the replacement: an agent OS I built on top of the Claude Agent SDK called ClaudeClaw Mission Control. Thirteen themed agents. One daemon. A scheduler I can actually see into. Zero silent failures slipping past me for a week before I notice. Let me explain how I got here. The Setup OpenClaw was doing real work. 38 cron jobs. Morning briefings. Evening summaries. A content pipeline that pulled research from web sources, structured it, scored it, and queued articles for ASTGL. An email triage pass. A model-usage monitor. A nerve-health monitor watching the other monitors. On paper: impressive. In practice: I had no idea if any of it was working. The system was so noisy that when something broke, I learned about it four days later when I noticed my morning briefing hadn’t arrived. Or I didn’t learn about it at all, because the cron job was exiting 0 while the script inside it was crash-looping. That last one is the killer. Let me show you what I mean. What’s Actually Going On Three failure modes hit me in a 48-hour window, and each one was invisible to the system watching the system. As The Geek Learns is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Failure one: successful exits, 100% broken payload. My content pipeline was ingesting URLs, and a regression introduced a trailing-slash bug that made example.com/foo and example.com/foo/ look like different URLs to the dedup layer. Every new article hit a UNIQUE constraint violation inside a subprocess. The outer wrapper caught the error, logged it to a file nobody was reading, and exited 0. For two weeks the cron appeared green while 100% of structurings were crashing. Failure two: PATH-resolved Node. I had the daemon running Node 24 (absolute path, explicit). A subagent it spawned inherited a PATH that fell through to Homebrew’s Node 25. One of the native modules (better-sqlite3) was compiled against 24, so every subagent invocation crashed with ERR_DLOPEN_FAILED and MODULE_VERSION mismatch. The smoke test I’d written passed because it ran from the daemon’s shell. The actual production path failed every time. Failure three: auth expiry with no escape hatch. OpenClaw stored some credentials in pass (the Unix password store). When my GPG key timed out, the daemon couldn’t start. Which meant the health monitor couldn’t start. Which meant the thing that would have told me about the outage was the thing that was out. OpenClaw had no watcher that lived outside the daemon it was watching. None of these are OpenClaw-specific bugs in the upstream sense. They’re pattern problems that emerge anywhere you have: 1. A monolithic daemon responsible for its own monitoring. 2. Flat-file state (HEARTBEAT.md, LEARNINGS.md) that gets appended to rather than queried. 3. Exit codes treated as truth when the real signal is in stderr. 4. No separation between “Did it run?” and “Did it work?” OpenClaw was built for a different job. It was a personal automation gateway—great at “kick off this script at 6:30 AM.” It wasn’t built to be an agent OS with observability. I was using a shovel to drive screws. I also couldn’t ignore the security posture. February’s disclosures—135,000 exposed instances, 15,000 vulnerable to RCE, the ClawHavoc plugin-registry incident, nine CVEs—had pushed me to patch hard and lock down. But every week I spent hardening OpenClaw was a week I wasn’t building what I actually wanted: themed agents that owned workstreams, could be reasoned about individually, and fail loudly. The Fix ClaudeClaw Mission Control is a Node.js daemon built on the Claude Agent SDK. It runs as a single LaunchAgent (com.claudeclaw.app), owns a SQLite store at store/claudeclaw.db, polls a scheduled_tasks table every 60 seconds, and dispatches due tasks to agents by ID. The interesting part isn’t the daemon. It’s the agents. I set up thirteen of them, themed after the small council of a certain fictional kingdom, because if I’m going to stare at this UI every day, I’d rather it amused me. Thirteen themed agents, each owning a workstream. STEWARD drives my mornings and evenings. MAESTER runs the ASTGL content pipeline. WATCHMAN watches the whole system from outside it. Each agent lives in its own directory at agents//, with an agent.yaml (model, personality, cwd, MCP servers) and a CLAUDE.md system prompt. A scheduled task carries an agentId column in the DB, and the dispatcher routes like this: if (shouldRouteViaAgent(task.agentId, listAgentIds())) { const result = await delegateToAgent(task.agentId, task.prompt, { fromAgent: SCHEDULER_FROM_AGENT, chatId: task.chatId, }); return result.text ?? '(empty response)'; } Adding a new agent is now: drop a folder under agents/, write a CLAUDE.md, run schedule reassign . No source changes. The dispatcher picks it up on next tick. That’s the piece I kept trying and failing to get with OpenClaw—modular ownership. In OpenClaw, everything was “the daemon.” In ClaudeClaw, MAESTER owning the content pipeline means if content alerts stop firing, the log line says maester: task failed instead of openclaw-gateway: subprocess exited nonzero. Attribution is free. Adding a new agent is now: drop a folder under agents/, write a CLAUDE.md, run schedule reassign . No source changes. The dispatcher picks it up on next tick. That’s the piece I kept trying and failing to get with OpenClaw—modular ownership. In OpenClaw, everything was “the daemon.” In ClaudeClaw, MAESTER owning the content pipeline means if content alerts stop firing, the log line says maester: task failed instead of openclaw-gateway: subprocess exited nonzero. Attribution is free. The Watchman probes WATCHMAN runs every hour at :05. It has seven probes, each targeting a failure mode that burned me on OpenClaw: 1. Failed tasks. status=’failed’ in the DB. Trivial. 2. Stuck tasks. status=’running’ AND last_run 3. Missed slots. status=’active’ AND next_run 4. Daemon liveness. launchctl print gui/$UID/com.claudeclaw.app—does launchd still have it? 5. Content-pipeline health. Tails the structured log file, parses the JSON, checks for crash shapes. 6. Hidden failures. Scans the last_result text column for ERR_DLOPEN_FAILED, MODULE_VERSION, Traceback, and other “the job exited zero but it sure didn’t work” signals. This is the probe that would have caught my trailing-slash bug in an hour instead of two weeks. 7. Delegation crashes. inter_agent_tasks WHERE status=’failed’ — on-demand agent invocations that blew up. On top of that, there’s a separate LaunchAgent running a healthcheck every 30 minutes that lives outside the main daemon and uses a keychain-backed Telegram token. If the daemon is dead, the healthcheck still delivers the alert. That’s the lesson from failure three: the watcher cannot share fate with the watched. Memory v2 OpenClaw’s memory was HEARTBEAT.md and LEARNINGS.md—flat files I appended to. Eventually they got long enough that the agent stopped reading them usefully, and I had no query surface to pull just the relevant bits. ClaudeClaw’s Memory v2 is a five-layer context stack: 1. Semantic recall—cosine similarity against stored memory embeddings, top 5 by score, chat-scoped. 2. Recent high-importance memories—memories with importance >= 0.7 written in the last 7 days. 3. Consolidation insights—a 30-minute loop that summarizes the short-term buffer into durable notes. 4. Cross-agent hive—stubbed for now; eventually lets MAESTER peek at something STEWARD noted this morning. 5. Conversation history—last N turns. Layers dedupe by memory ID. The whole thing is safe to drop into the SDK’s systemPrompt option. It’s not magic. It’s just queryable instead of append-only, which is the delta between “context I can use” and “a log file I’ll never re-read.” Forum-topic routing instead of bot-per-agent A small but satisfying piece. All thirteen agents post to one Telegram bot, into one supergroup, but each agent has a dedicated forum topic: Alerts → thread 22 (WATCHMAN) ASTGL → thread 23 (MAESTER) Council → thread 24 Steward → thread 25 Whisperers → thread 26 War Room - Security → thread 40 (WAR) One token. One chat. Threaded conversations per domain. The ergonomics are dramatically better than 13 separate bots with 13 separate tokens, which is the architecture I almost built before I remembered that Telegram supergroups have forum topics now. Why This Matters A few things I want to flag for anyone planning something similar. Build the rollback before you build the new thing. I wrote scripts/retire-openclaw.sh with explicit --rollback semantics before I disabled a single cron job. Plists get moved (not deleted) into _retired-openclaw/. Cron jobs get flipped enabled: false with a timestamped backup (jobs.json.bak.pre-retire-20260419). The OpenClaw directory sits untouched for 30 days with a calendar reminder to delete it. If ClaudeClaw had cratered on day two, I was one shell command away from being back on the old system in under a minute. Silent success is worse than loud failure. The design principle I pulled from this whole experience: every job in the system needs someone whose job it is to doubt that

    17 min

About

Tools and training for IT professionals. Join James Cruce—a systems engineer with 25+ years managing enterprise VMware infrastructure—for practical PowerCLI tutorials, automation tips, and lessons from the trenches. Much to learn, there always is. astgl.com