Under the Hood: AI Engineering Deep Dives

Shreyas S K

0.0 (0)
TECHNOLOGY
UPDATED WEEKLY

Audio explorations of breakthrough AI techniques, research papers, and production ML insights for engineers building with LLMs. shreyassk.substack.com

24/12/2025

From Instant Noodles to Michelin-Star Agents: A Beginner’s Guide to Prompt Optimization

Most AI agents today are like instant noodles: quick to put together, good enough for a demo, but nowhere close to Michelin-star quality. You paste a prompt into your favorite LLM, tweak a few words, and hope it behaves. The result works… until it doesn’t. Agentic Engineering Weekly is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. In this episode, we’ll break down how to move from that “doom word smithing” phase into a disciplined, prompt optimization workflow for agentic applications. Using a ramen kitchen analogy, we’ll walk through levels of maturity—from manual tinkering to automated optimization loops with specialized “chef” agents. Segment 1 – The impossible triangle and bland AI Every agent builder runs into the same impossible triangle: speed, cost, and accuracy. You can usually optimize for two, but rarely all three at once. * Fast and accurate often means expensive models or long prompts. * Fast and cheap usually means you sacrifice reliability. * Cheap and accurate can become painfully slow. Most teams respond with manual prompt engineering—endless edits, extra instructions, more guardrails—which is basically throwing random ingredients into your ramen pot and hoping it tastes better. Without a recipe or a taste test, it’s guesswork. Segment 2 – Level 1: Adding evals (the taste test) The first real upgrade is evals: defining what “good” looks like and measuring it consistently. This is your taste test. You move from “I think this looks better” to concrete checks: did the agent follow constraints, reason correctly, and produce the right structure on a labeled set of examples. Evals turn subjective tinkering into an objective feedback loop. Once you have that tasting system in place, a key question appears: instead of manually changing the recipe, can another agent do that for you? Segment 3 – Level 2: The brute-force robot chef At Level 2, you bring in a meta-optimizer: an AI that generates many variants of your base prompt, runs them through your eval, and ranks them. Think of this as a very fast, not-very-creative robot chef. * It produces hundreds or thousands of small variations of your prompt. * Each variation is evaluated on your test set. * The best-performing prompts are promoted; the bad ones are discarded. This brute-force search already outperforms manual editing because you explore a much larger design space, guided by metrics instead of vibes. But it’s still inefficient—you’re throwing a lot of spaghetti, or ramen, at the wall. Segment 4 – Level 3: The Michelin kitchen of optimizers At Level 3, the “kitchen” becomes a team of specialized optimizer agents, each with a different role. 4.1 Evolutionary “bake-off” chef Evolutionary algorithms treat prompts like a population of recipes competing in a tournament. * You start with many candidate prompts. * The best ones become “parents”, combining and mutating into new “children” prompts. * Over generations, performance improves as weak prompts are eliminated. Advanced methods like GEPA break the prompt into components—system message, task description, examples—and optimize each part separately before recombining them. 4.2 The Doctor-Chef: HRPO HRPO, or Hierarchical Reflective Prompt Optimization, behaves like a diagnostic chef. Instead of spewing random variants, it analyzes where the agent fails: did it hallucinate, misread constraints, or mis-handle edge cases. It then creates targeted patches that address each failure mode, resulting in surgical edits rather than wholesale rewrites. 4.3 The Sommelier: parameters and few-shot selection Sometimes the recipe is fine; the oven temperature and plating are off. * A parameter optimizer tunes settings like temperature and top_p to get the best trade-off between creativity and reliability for the same prompt. * A few-shot Bayesian optimizer finds the minimal, most informative examples to include, so the agent learns the task with fewer tokens and better generalization. Together, these optimizers give you a rich toolbox instead of a single “better prompt” button. Segment 5 – Choosing the right “chef” and wrapping up Which optimizer should you use? It depends on your constraints and failure modes. If you just need a quick lift on a simple task, a basic meta-prompt optimizer might be enough. On more complex benchmarks, hierarchical approaches like HRPO often emerge as strong all-round performers, especially when you care about fine-grained failure analysis and robust improvements over a baseline prompt. The big shift is mental: you stop thinking in terms of one perfect prompt and start thinking in terms of a continuous optimization pipeline. If your agents still feel like instant noodles, it’s time to invest in evals, optimizers, and a proper kitchen. With the right systems, you can reliably move from quick prototypes to Michelin-star-grade agents that hold up in production. Agentic Engineering Weekly is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Thanks for reading Agentic Engineering Weekly! This post is public so feel free to share it. Get full access to Agentic Engineering Weekly at shreyassk.substack.com/subscribe

25 min
12/11/2025

Scaling Transformer Context: The O(n²) Bottleneck

Introduction: The Context Length Divide When you look at the landscape of large language models today, there’s a striking gap that separates the leaders from the rest. Open-source models often hit a wall around 32,000 tokens, while industry giants like OpenAI and Anthropic casually discuss 128K, sometimes even a million-token context windows. It’s tempting to assume this is simply a matter of having deeper pockets and more GPUs. But the reality is far more interesting—and more fundamental. This divide isn’t about resources. It’s about solving an architectural challenge that has plagued transformer models since their inception: the O(n²) bottleneck. If you’re preparing for AI/ML interviews or building systems at scale, understanding this bottleneck—and how it’s being overcome—is essential. The Memory Misconception Here’s a common assumption: when a model struggles with long inputs, it must be running out of memory (RAM). That seems logical, right? More text means more data to store, so naturally you’d need more memory. But that’s not where the real problem lies. The actual bottleneck is computational, and it’s deeply embedded in the transformer architecture itself. Specifically, it lives in the self-attention mechanism. Understanding the O(n²) Problem What is Self-Attention? At the heart of every transformer is the self-attention mechanism. This is what allows the model to understand relationships between different parts of the input. Here’s how it works: For every single token in your input sequence, the model calculates an attention score with every other token. If your sequence has length n, that means: - Token 1 must attend to tokens 1, 2, 3, ... n - Token 2 must attend to tokens 1, 2, 3, ... n - Token 3 must attend to tokens 1, 2, 3, ... n - And so on... The total number of computations? n × n = n² This is what engineers call **O(n²) complexity** or quadratic scaling. And this is the wall. Putting Numbers to the Pain Let’s make this concrete with real examples: **At 8,000 tokens:** - Attention computations needed: 8,000 × 8,000 = 64 million **At 1 million tokens:** - You might think: “That’s about 125x longer, so maybe 125x more compute?” - Actually: 1,000,000 × 1,000,000 = **1 trillion computations** The cost doesn’t scale linearly—it explodes quadratically. This isn’t just slower; it becomes mathematically impractical. It’s Not Just Compute Time The problem extends beyond processing speed. The system must also construct and store an n×n attention matrix in memory—a massive grid holding scores for every possible token-to-token relationship. For that million-token example, you’re trying to allocate space for and calculate **one trillion floating-point values** just to map out the attention patterns. This is why most implementations using standard methods get stuck around 32K tokens. Beyond that point, it becomes both computationally and physically prohibitive. Breaking the Quadratic Wall: Sliding Window Attention So how are the industry leaders achieving 128K or even million-token contexts? They’re not “cheating” the math exactly—they’re using clever architectural innovations. The primary technique is called **Sliding Window Attention (SWA)**. The Core Idea Instead of forcing every token to attend to the entire sequence, Sliding Window Attention restricts each token to only attend to a fixed-size local window of nearby tokens. If we call this window size W, the transformation is dramatic: - **Before (standard attention):** O(n²) - **After (sliding window):** O(n·W) The complexity becomes linear in sequence length! Since W is a constant (typically a few hundred to a few thousand tokens), the scaling problem is fundamentally solved. The Global Understanding Challenge But wait—if token 1 can only see tokens 2-50, and token 1000 can only see tokens 950-1050, how does the model build a global understanding of a long document? How does information from the beginning reach the end? This is where the elegance of the solution becomes apparent. Learning from Computer Vision The insight comes from Convolutional Neural Networks (CNNs) used in image processing. A CNN starts by looking at tiny patches of pixels—very local views. But as information flows through dozens of layers, these local features combine and recombine, eventually building a complete understanding of the entire image. Sliding Window Attention works the same way: 1. **Layer 1:** Each token attends to its local window 2. **Layer 2:** Information has already propagated one hop, effectively doubling the receptive field 3. **Layer 3:** Information reaches even further 4. **Layer 30+:** With deep enough networks (modern transformers often have 30-40+ layers), local information has propagated across the entire sequence Information from the start gradually influences the end through this cascade of local interactions across many layers. Practical Implementation: Real-World Trade-offs Theory is elegant, but building actual systems requires pragmatic compromises. The Chunking Optimization To make Sliding Window Attention run efficiently on real hardware (GPUs), implementations typically split the query and key matrices into overlapping chunks, then perform attention within those chunks. This approach might compute slightly more than the theoretical minimum—potentially about 2x the absolute minimum operations. But here’s the critical trade-off: **Without this optimization:** - Theoretically minimal operations - Completely impractical to run due to O(n²) - Requires supercomputer-scale resources **With this optimization:** - Slightly more operations than theoretical minimum - Actually runnable on a single high-end GPU - Scales to million-token contexts This is systems design in action: finding the practical compromise that makes the impossible possible. Key Takeaways for Interviews and Practice If you’re preparing for AI/ML systems interviews or building LLM applications, here’s what you need to know: 1. The Bottleneck is Architectural, Not Resource-Based - It’s not primarily about having more GPUs or RAM - The O(n²) self-attention mechanism is the fundamental barrier - Solutions require architectural innovations, not just scaling hardware 2. Understand the Math - Standard attention: O(n²) complexity - Sliding Window Attention: O(n·W) complexity - Know how to calculate and explain the difference - Be ready to discuss real numbers (8K vs 1M tokens) 3. Local to Global Information Flow - SWA achieves global understanding through layer depth - Similar to how CNNs build from local to global features - Information propagates across the sequence through multiple layers 4. Practical vs Theoretical Optimization - Real implementations make pragmatic trade-offs - 2x theoretical operations might enable 1000x practical speedup - Hardware efficiency often requires chunking and overlap strategies 5. Interview-Ready Explanation Be able to explain: “The O(n²) bottleneck comes from self-attention requiring every token to attend to every other token. Sliding Window Attention solves this by restricting attention to local windows, reducing complexity to O(n·W). Through deep layer stacking, local information propagates to achieve global understanding while maintaining linear scaling.” The Provocative Question We’ve spent this entire article understanding how models can achieve million-token context windows. The architectural innovations are real, and the barriers have been broken. But now for the thought-provoking question that every systems designer should consider: **Now that we can break the ceiling—do we actually need a million tokens for most real-world applications?** There’s a tension between what’s technically possible and what’s practically necessary. Yes, we’ve solved how to scale to enormous context windows. But consider: - What percentage of actual use cases require > 100K tokens? - Are we solving for the 1% case or the 99% case? - What are the costs (latency, compute, energy) of maintaining these massive windows? - Could smarter retrieval mechanisms achieve better results at lower cost? Understanding the capability is crucial. But understanding when to use it—and when not to—is equally important for building practical systems. Conclusion The great divide in LLM context lengths isn’t about money or hardware—it’s about understanding and solving the O(n²) bottleneck through architectural innovation. Sliding Window Attention represents a fundamental breakthrough: using local attention with deep layer propagation to achieve linear scaling while maintaining global understanding. For AI practitioners and interview candidates, this isn’t just a theoretical curiosity. It’s a core systems design principle that separates those who can build scalable LLM applications from those who can’t. The next time you see a model advertising million-token context, you’ll know exactly what mathematical barrier they overcame to get there—and more importantly, you’ll know whether you actually need it. Get full access to Agentic Engineering Weekly at shreyassk.substack.com/subscribe

6 min

2 Episodes

Audio explorations of breakthrough AI techniques, research papers, and production ML insights for engineers building with LLMs. shreyassk.substack.com

Creator

Shreyas S K
Years Active

2K
Episodes

2
Rating

Clean
Show Website

Under the Hood: AI Engineering Deep Dives

Under the Hood: AI Engineering Deep Dives

Episodes

From Instant Noodles to Michelin-Star Agents: A Beginner’s Guide to Prompt Optimization

Scaling Transformer Context: The O(n²) Bottleneck

About

Information