Skip to content

/ writing

Notes on building AI products.

Essays on the parts that don't show up in the launch tweet. Eval rubrics, prompt design, agent loops, the math of running a model at scale, and the difference between a demo and something a user can rely on.

53 essays · RSS

/ prompts

Prompts as API contracts

The words an LLM is allowed to use are themselves an interface. Treat them like one.

10 essays
  1. May 3 The AI's vocabulary is a hidden API contract Every word your LLM is allowed to say imposes obligations on the systems beneath it. Treat the prompt's vocabulary like an interface or pay for it later.
  2. Apr 28 Prompts as type signatures The quickest mental model improvement for prompt engineering: stop thinking of prompts as instructions, start thinking of them as type signatures for the model's output.
  3. Apr 27 System prompts that age well A system prompt is shipped code. It needs the same discipline. Here are the patterns that survive a year of model upgrades, prompt edits, and team turnover.
  4. Apr 26 Prompt versioning that doesn't suck Versioning prompts is harder than versioning code because the artifact is a string and the test suite is fuzzy. Here's the workflow that ships.
  5. Apr 25 When to ship a prompt change The decision rule that separates teams who ship prompt changes confidently from teams who hover their finger over the button.
  6. Apr 24 Few-shot design: the prompt technique that's underused in 2026 Few-shot examples are the most reliable way to shape model behavior. Most production prompts use them badly or skip them entirely. Here's how to use them well.
  7. Apr 23 Schema-first prompts: stop asking the model nicely Constrained generation, structured output APIs, and JSON schema have made prompt engineering more like API design and less like creative writing. Lean in.
  8. Apr 22 Personas in prompts: useful or theatre? Almost every system prompt starts with 'You are a helpful assistant.' Most personas in prompts are decorative. Here's when they actually move the needle, and when they're padding.
  9. Apr 21 Debugging LLM apps: the trace-everything approach You cannot debug what you cannot replay. The single highest-leverage habit in LLM engineering is making every model call inspectable after the fact.
  10. Apr 19 System, user, developer: which message goes where Modern LLM APIs distinguish between system, user, developer, and assistant roles. The rules for which content goes in which slot aren't intuitive. Here's the working model.

/ evals

LLM eval engineering

Test benches for AI features that catch the bugs you actually care about.

10 essays
  1. May 3 Human-in-the-loop evals: where it's still essential in 2026 Automated evals can do a lot, but not everything. Here's where humans still beat any LLM judge, and how to set up the human review loop without breaking the bank.
  2. May 2 What an LLM eval bench actually needs to do Most eval frameworks measure whether the model returned a string. Production eval benches measure whether shipping the change is safe. The gap is everything.
  3. May 2 Adversarial evals: what to break before users do The friendly cases will tell you the model usually works. The adversarial cases will tell you what happens when things go wrong. Most teams don't have enough of the second kind.
  4. May 1 Eval drift: when your bench stops measuring what you care about An eval bench can pass with flying colors while production quality declines. The gap is called eval drift, and it's the most common silent failure in LLM ops.
  5. Apr 30 The hidden cost of evals (and how to keep them affordable) Eval pipelines are easy to start and expensive to run at scale. Here's where the cost actually comes from and how to keep it under control without losing the safety net.
  6. Apr 29 Eval datasets that hold up over time Most eval datasets rot. The cases drift, the rubrics get stale, the bench becomes a museum piece. Here's how to build one that stays useful for years.
  7. Apr 28 Three kinds of evals: continuous, deep, and shadow Most teams treat 'evals' as one thing. The teams shipping reliable AI products run three distinct eval loops at different cadences. Here's the breakdown.
  8. Apr 27 The eval rubric is the work Most teams treat the eval rubric as paperwork. The teams shipping reliable LLM products treat the rubric as the actual product specification. Here's the difference.
  9. Apr 26 LLM-as-judge: what actually works in 2026 Using one LLM to grade another LLM's output is the most over-deployed and under-evaluated eval pattern in production. Here's when it works, when it fails, and how to use it well.
  10. Apr 25 The minimum viable eval bench (and why most teams skip it) Most LLM teams ship without a real eval bench. The reason isn't that benches are hard. It's that the first one feels too small to matter. Here's the smallest useful one.

/ agents

Agent architecture

Loops, tool calls, state, observability. The patterns that ship in production.

11 essays
  1. May 13 Agent cost control: where the money actually goes An agent that costs $0.10 per run becomes a $30K monthly bill at meaningful traffic. Here's where the cost concentrates and which controls keep it sustainable.
  2. May 12 Observability for agents: what to instrument from day one An agent without observability is a black box that occasionally produces output. Here's what to instrument, what to alert on, and what to keep out of your dashboards.
  3. May 11 Agent latency: where the seconds actually go An agent that takes 30 seconds to answer is unusable for most product surfaces. Here's where the time actually goes and which optimizations move the needle.
  4. May 10 Tool permissions for agents: the principle of least privilege An agent with the wrong tool permissions is a security incident waiting to happen. Here's the permission model that keeps agents capable without giving them the keys to everything.
  5. May 9 Evaluating agents: trajectory matters as much as outcome Eval frameworks for single-prompt LLM features don't translate cleanly to agents. Agents have process. The bench needs to grade the process, not just the result.
  6. May 8 Multi-agent vs single-agent: when the orchestra is worth it Multi-agent architectures look elegant in diagrams. In production, they're more often a tax than a benefit. Here's when the orchestra actually beats the soloist.
  7. May 7 The five most common agent failure modes (and how to fix each) Production agents fail in predictable ways. Knowing the patterns saves weeks of debugging. Here are the five I see most often and what actually fixes them.
  8. May 6 When to use an agent (and when not to) The 'agent' label has been applied to almost every LLM feature. Most of them shouldn't be agents. Here's the actual decision criteria.
  9. May 5 Agent state management: the part nobody writes about Most agent tutorials skip past the question of where state lives. In production, state management is half the work. Here's the model that scales.
  10. May 4 Tool design for agents: APIs the model can actually use An agent is only as good as the tools you give it. Most teams design tools the way they design APIs for other engineers, and pay for it. Here's the difference that matters.
  11. May 1 Agent loops are just function-call graphs Strip away the agent terminology and you're left with a graph of function calls with conditional edges. The patterns that ship treat them that way.

/ economics

The napkin math of AI in production

Cost per request, latency budgets, when caching wins. The numbers nobody publishes.

11 essays
  1. May 23 Optimizing LLM spend after the bill is already big Most cost-optimization advice assumes you're starting from scratch. What if you already have a $100K/month bill and need to bring it down without breaking the product? Here's the order of operations.
  2. May 22 Batch vs realtime LLM workloads: pick the right surface Many LLM workloads that run synchronously in production should be running asynchronously, and vice versa. The cost and reliability difference is large. Here's the framing.
  3. May 21 Cost attribution for LLM features: knowing where your bill comes from An aggregate API bill tells you nothing about which features, users, or queries drive cost. Without attribution, you can't optimize. Here's the model that works.
  4. May 20 LLM build vs buy: the questions that actually matter Should you build your own model, fine-tune, host open-source, or call APIs? The decision depends on a few specific questions, and the answer is usually 'call APIs.'
  5. May 19 LLM rate limits: budgeting for the throughput you actually need Provider rate limits constrain what you can ship more often than they should. Most teams hit the limits at the wrong time and don't have a plan. Here's the planning framework.
  6. May 18 The cost of context: why bigger windows aren't free Long context windows let you stuff more into a prompt. They don't let you do it for free. The cost scales superlinearly with context size in ways that surprise teams.
  7. May 17 Pricing tiers for AI features: matching limits to economics Flat-rate AI pricing leaves you exposed to the heavy users. Pure pay-per-use is hostile to most users. The middle ground is tiers with clear limits, designed around your cost distribution.
  8. May 16 Model routing: spending the right amount of intelligence Not every request needs the frontier model. Routing requests to the right model tier is one of the highest-leverage cost optimizations and one of the most underused.
  9. May 15 Prompt caching: the optimization most teams underuse Modern LLM APIs let you cache the static parts of your prompt. Most teams enable it, then design prompts that defeat it. Here's how to get the actual savings.
  10. May 14 LLM unit economics: the math your CFO will eventually ask about Unit economics for LLM features look different from regular software unit economics. The variable costs are real, the gross margins can flip with usage patterns, and the questions are coming. Here's how to think about them.
  11. Apr 30 The economics of running an LLM agent at scale Napkin math for the unit cost of an AI feature: tokens, latency, caching, model routing, and the surprising line items nobody publishes.

/ product

AI product engineering

Building things people use, not demos that wow. Substance first, surface second.

11 essays
  1. Jun 2 Setting the quality bar for AI features: how good is good enough AI features are non-deterministic. They will make mistakes. The product question is how often, on which inputs, with what user-visible consequences. Here's the framework.
  2. Jun 1 Versioning AI products: who pays when behavior changes AI product behavior changes when models change. Users notice. The versioning model determines who absorbs the change. Get this wrong and your users feel like the product is randomly different.
  3. May 31 From AI demo to production: the gap is bigger than it looks A working AI demo is maybe 20% of the work. The other 80% is everything that makes it survive contact with real users. Here's the punch list.
  4. May 30 Team shapes for AI products: who owns what Building AI products requires combinations of skill that don't fit traditional team structures. Here's the team shape that actually works and the dysfunction patterns to avoid.
  5. May 29 Roadmapping AI products: planning for a moving foundation Traditional roadmaps assume the technology underneath is stable. AI products live on a substrate that changes every few months. Here's the planning approach that adapts.
  6. May 28 Building user trust in AI features AI features have a trust problem most software features don't. Users have learned to be skeptical. The features that earn trust do specific things. Here's the list.
  7. May 27 Measuring AI product success: which metrics actually mean something Most AI product dashboards track the wrong things. Engagement is misleading; AI-feature usage is decoration. Here are the metrics that actually tell you whether your AI feature is working.
  8. May 26 Feature flags for AI features: rolling out the unpredictable AI features fail differently from regular features. Standard rollout patterns leave you exposed to model regressions and traffic-driven failures. Here's the gating model that fits.
  9. May 25 Onboarding for AI products: setting expectations the model can meet First-touch experience determines whether users come back. AI products have a unique onboarding problem: managing expectations the model may or may not meet. Here's the playbook.
  10. May 24 AI features that disappear (and why that's the goal) The best AI features in 2026 don't have an 'AI' label. They're invisible improvements to existing flows. Here's why most AI-branded features fail and the disappearing ones succeed.
  11. Apr 29 Build the substance, then the surface Most AI product failures are LLM wrappers shipped before there's anything underneath worth wrapping. The hard part of an AI product is almost never the prompt.