Skip to content

/ writing

Notes on building AI products.

Essays on the parts that don't show up in the launch tweet. Eval rubrics, prompt design, agent loops, the math of running a model at scale, and the difference between a demo and something a user can rely on.

88 essays · RSS

/ start here

Two capstones to find your way in.

New here? These two pieces map the rest. Pick the one that matches how you read.

/ prompts

Prompts as API contracts

The words an LLM is allowed to use are themselves an interface. Treat them like one.

10 essays
  1. May 3 The AI's vocabulary is a hidden API contract Every word your LLM is allowed to say imposes obligations on the systems beneath it. Treat the prompt's vocabulary like an interface or pay for it later.
  2. Apr 28 Prompts as type signatures The quickest mental model improvement for prompt engineering: stop thinking of prompts as instructions, start thinking of them as type signatures for the model's output.
  3. Apr 27 System prompts that age well A system prompt is shipped code. It needs the same discipline. Here are the patterns that survive a year of model upgrades, prompt edits, and team turnover.
  4. Apr 26 Prompt versioning that doesn't suck Versioning prompts is harder than versioning code because the artifact is a string and the test suite is fuzzy. Here's the workflow that ships.
  5. Apr 25 When to ship a prompt change The decision rule that separates teams who ship prompt changes confidently from teams who hover their finger over the button.
  6. Apr 24 Few-shot design: the prompt technique that's underused in 2026 Few-shot examples are the most reliable way to shape model behavior. Most production prompts use them badly or skip them entirely. Here's how to use them well.
  7. Apr 23 Schema-first prompts: stop asking the model nicely Constrained generation, structured output APIs, and JSON schema have made prompt engineering more like API design and less like creative writing. Lean in.
  8. Apr 22 Personas in prompts: useful or theatre? Almost every system prompt starts with 'You are a helpful assistant.' Most personas in prompts are decorative. Here's when they actually move the needle, and when they're padding.
  9. Apr 21 Debugging LLM apps: the trace-everything approach You cannot debug what you cannot replay. The single highest-leverage habit in LLM engineering is making every model call inspectable after the fact.
  10. Apr 19 System, user, developer: which message goes where Modern LLM APIs distinguish between system, user, developer, and assistant roles. The rules for which content goes in which slot aren't intuitive. Here's the working model.

/ evals

LLM eval engineering

Test benches for AI features that catch the bugs you actually care about.

10 essays
  1. May 3 Human-in-the-loop evals: where it's still essential in 2026 Automated evals can do a lot, but not everything. Here's where humans still beat any LLM judge, and how to set up the human review loop without breaking the bank.
  2. May 2 What an LLM eval bench actually needs to do Most eval frameworks measure whether the model returned a string. Production eval benches measure whether shipping the change is safe. The gap is everything.
  3. May 2 Adversarial evals: what to break before users do The friendly cases will tell you the model usually works. The adversarial cases will tell you what happens when things go wrong. Most teams don't have enough of the second kind.
  4. May 1 Eval drift: when your bench stops measuring what you care about An eval bench can pass with flying colors while production quality declines. The gap is called eval drift, and it's the most common silent failure in LLM ops.
  5. Apr 30 The hidden cost of evals (and how to keep them affordable) Eval pipelines are easy to start and expensive to run at scale. Here's where the cost actually comes from and how to keep it under control without losing the safety net.
  6. Apr 29 Eval datasets that hold up over time Most eval datasets rot. The cases drift, the rubrics get stale, the bench becomes a museum piece. Here's how to build one that stays useful for years.
  7. Apr 28 Three kinds of evals: continuous, deep, and shadow Most teams treat 'evals' as one thing. The teams shipping reliable AI products run three distinct eval loops at different cadences. Here's the breakdown.
  8. Apr 27 The eval rubric is the work Most teams treat the eval rubric as paperwork. The teams shipping reliable LLM products treat the rubric as the actual product specification. Here's the difference.
  9. Apr 26 LLM-as-judge: what actually works in 2026 Using one LLM to grade another LLM's output is the most over-deployed and under-evaluated eval pattern in production. Here's when it works, when it fails, and how to use it well.
  10. Apr 25 The minimum viable eval bench (and why most teams skip it) Most LLM teams ship without a real eval bench. The reason isn't that benches are hard. It's that the first one feels too small to matter. Here's the smallest useful one.

/ agents

Agent architecture

Loops, tool calls, state, observability. The patterns that ship in production.

11 essays
  1. May 13 Agent cost control: where the money actually goes An agent that costs $0.10 per run becomes a $30K monthly bill at meaningful traffic. Here's where the cost concentrates and which controls keep it sustainable.
  2. May 12 Observability for agents: what to instrument from day one An agent without observability is a black box that occasionally produces output. Here's what to instrument, what to alert on, and what to keep out of your dashboards.
  3. May 11 Agent latency: where the seconds actually go An agent that takes 30 seconds to answer is unusable for most product surfaces. Here's where the time actually goes and which optimizations move the needle.
  4. May 10 Tool permissions for agents: the principle of least privilege An agent with the wrong tool permissions is a security incident waiting to happen. Here's the permission model that keeps agents capable without giving them the keys to everything.
  5. May 9 Evaluating agents: trajectory matters as much as outcome Eval frameworks for single-prompt LLM features don't translate cleanly to agents. Agents have process. The bench needs to grade the process, not just the result.
  6. May 8 Multi-agent vs single-agent: when the orchestra is worth it Multi-agent architectures look elegant in diagrams. In production, they're more often a tax than a benefit. Here's when the orchestra actually beats the soloist.
  7. May 7 The five most common agent failure modes (and how to fix each) Production agents fail in predictable ways. Knowing the patterns saves weeks of debugging. Here are the five I see most often and what actually fixes them.
  8. May 6 When to use an agent (and when not to) The 'agent' label has been applied to almost every LLM feature. Most of them shouldn't be agents. Here's the actual decision criteria.
  9. May 5 Agent state management: the part nobody writes about Most agent tutorials skip past the question of where state lives. In production, state management is half the work. Here's the model that scales.
  10. May 4 Tool design for agents: APIs the model can actually use An agent is only as good as the tools you give it. Most teams design tools the way they design APIs for other engineers, and pay for it. Here's the difference that matters.
  11. May 1 Agent loops are just function-call graphs Strip away the agent terminology and you're left with a graph of function calls with conditional edges. The patterns that ship treat them that way.

/ retrieval

Retrieval and RAG

The unsexy half of every AI product. Indexing, chunking, reranking, hybrid search.

12 essays
  1. Jun 14 Freshness in RAG: keeping the index in sync with the world A RAG system that returns yesterday's data on questions about today's reality is a liability. Keeping the index fresh is harder than it sounds. Here's the patterns.
  2. Jun 13 RAG with permissions: keeping users out of each other's data A multi-tenant RAG system has to enforce permissions at retrieval time, not after. Get this wrong and you have a data leak. Here's the architecture that holds up.
  3. Jun 12 Long context vs RAG: when to retrieve and when to stuff Modern models support 200K+ token contexts. Some say RAG is dead. The reality is more nuanced. Here's the framing for when each approach actually wins.
  4. Jun 11 Document preprocessing for RAG: garbage in, garbage out RAG systems are downstream of your document preprocessing. Bad text extraction, lost structure, broken tables: each one degrades retrieval. Here's the pipeline that matters.
  5. Jun 10 Choosing a vector database: the criteria that actually matter Vector DB choice gets discussed at length and decided poorly. Most teams pick by feature checklist; the actual tradeoffs are different. Here's the framework.
  6. Jun 9 Query rewriting: the underused RAG optimization User queries are not optimal retrieval queries. Rewriting the query before retrieval, often with an LLM, can dramatically improve recall. Most teams don't do it.
  7. Jun 8 Evaluating RAG: separating retrieval quality from answer quality Most teams evaluate the final answer their RAG system produces. That's necessary but not sufficient. Without evaluating retrieval separately, you can't tell what to fix.
  8. Jun 7 Reranking: the second-stage retrieval most teams skip First-pass retrieval is fast and noisy. A reranker on top cleans up the order in tens of milliseconds. Skipping it leaves quality on the table.
  9. Jun 6 Choosing an embedding model: the decision that compounds Your embedding model decision affects retrieval quality, cost, and the cost of every future migration. Most teams pick by leaderboard. Here's the decision that actually fits your product.
  10. Jun 5 Hybrid search: why pure vector retrieval isn't enough Vector search is great until it isn't. The cases it misses are the ones BM25 catches. Combining both is the right default for most production RAG, and it's not as hard as it looks.
  11. Jun 4 Chunking strategies that hold up in production How you split documents for retrieval is one of the highest-leverage RAG decisions and one of the most under-discussed. Here's the chunking playbook that actually works.
  12. Jun 3 Retrieval is the unsexy half of every AI product Generative AI gets the attention. Retrieval does the work. The teams shipping reliable AI products spend most of their effort on the indexing, chunking, and ranking that nobody writes about.

/ safety

AI safety and guardrails

Prompt injection, jailbreaks, content moderation. Engineering discipline, not philosophy.

11 essays
  1. Jun 25 Abuse detection for AI products: spotting bad actors at scale Some users will try to abuse your AI product. The volume of normal usage hides the abusive patterns until they're costly. Here's how to detect abuse without spying on legitimate users.
  2. Jun 24 Incident response for AI features: the playbook AI incidents look different from regular software incidents. The playbook is similar but with AI-specific steps. Here's the runbook the teams I've seen use successfully.
  3. Jun 23 Audit trails for AI: who decided what, when When something goes wrong with an AI system, the audit trail is what tells you what happened. Most AI systems don't have one. Here's the structure that holds up under investigation.
  4. Jun 22 Designing refusal: how AI says no without alienating users Refusing user requests is part of every safe AI product. How the refusal is communicated determines whether users tolerate the limit or abandon the product. Here's the design.
  5. Jun 21 Hallucination mitigation: not 'fewer hallucinations' but 'no harmful ones' Eliminating hallucination is unrealistic. Preventing hallucinations from causing harm is achievable. Here's the reframing and the patterns that work.
  6. Jun 20 PII handling in LLM products: where the data actually goes AI products handle user data. Most teams don't have a clear picture of where PII flows in their stack. Here's the audit and the patterns that actually keep data safe.
  7. Jun 19 Jailbreak resistance: how production systems hold up Jailbreaks are attempts to make the AI ignore its constraints. They keep evolving. Defending against them requires more than the model's built-in resistance. Here's how.
  8. Jun 18 Content moderation for AI: the pipeline that holds up Models can produce content you don't want users to see. A moderation pipeline catches it before it reaches them. Here's the architecture and the patterns that work.
  9. Jun 17 Red-teaming your own AI: how to break it before users do The cheapest safety incident is the one you found yourself. Most teams don't red-team their AI products. Here's how to do it without a dedicated security team.
  10. Jun 16 Prompt injection: the actual threat model Prompt injection gets discussed as a generic risk. The actual threats are specific and the defenses are specific. Here's the threat model and the defenses that work.
  11. Jun 15 AI safety as engineering discipline, not philosophy Most AI safety conversations stay abstract. The teams shipping reliable AI products treat safety as concrete engineering: architecture, eval, instrumentation. Here's the discipline.

/ infra

AI infrastructure

Serving, deployment, MCP, GPU economics. The plumbing that decides if you scale.

10 essays
  1. Jul 5 Deploying AI changes safely: rollouts that don't surprise users AI deployments have unique risks. Standard CI/CD patterns leave gaps. Here's the rollout discipline that catches problems before they reach all users.
  2. Jul 4 Load testing AI features: what breaks first under load AI features fail differently under load than regular APIs. Standard load tests miss the failure modes that matter. Here's the load testing approach that finds real problems.
  3. Jul 3 Multi-region AI deployment: latency, residency, and reliability Once your AI product has users worldwide, single-region deployment hurts. Multi-region adds complexity but solves real problems. Here's the architecture that works.
  4. Jul 2 LLM caching layers: prompt cache, response cache, semantic cache Caching for LLM products has more layers than caching for regular APIs. Each layer has different tradeoffs. Here's the stack and the patterns that compound.
  5. Jul 1 Streaming LLM responses: the UX win that's harder than it looks Streaming the model's tokens to the user as they're generated dramatically improves perceived latency. The implementation has more gotchas than tutorials suggest.
  6. Jun 30 GPU economics for AI inference: where the money actually goes Self-hosting LLMs means renting GPUs. The cost calculation isn't just $/hour. Utilization, batching, quantization, and cold starts all change the picture. Here's the real math.
  7. Jun 29 Inference serving in 2026: vLLM, TGI, SGLang, and the choice that matters If you're self-hosting LLMs, the inference server is one of the highest-leverage choices. Here's the landscape and the criteria that actually drive the decision.
  8. Jun 28 Model Context Protocol (MCP): what it actually is and why it matters MCP is the protocol decoupling AI models from the tools and data they use. In 2026 it's becoming a baseline. Here's what it is and what to actually do about it.
  9. Jun 27 The LLM gateway pattern: one API for all your AI Calling LLM APIs directly from product code is fine until it isn't. The gateway pattern centralizes the cross-cutting concerns. Here's how to build one without overengineering.
  10. Jun 26 AI infrastructure: the boring layer that decides if you scale Prompts and models get attention. Infrastructure decides whether the product survives. Here's the infrastructure thinking that separates teams that scale from teams that don't.

/ economics

The napkin math of AI in production

Cost per request, latency budgets, when caching wins. The numbers nobody publishes.

11 essays
  1. May 23 Optimizing LLM spend after the bill is already big Most cost-optimization advice assumes you're starting from scratch. What if you already have a $100K/month bill and need to bring it down without breaking the product? Here's the order of operations.
  2. May 22 Batch vs realtime LLM workloads: pick the right surface Many LLM workloads that run synchronously in production should be running asynchronously, and vice versa. The cost and reliability difference is large. Here's the framing.
  3. May 21 Cost attribution for LLM features: knowing where your bill comes from An aggregate API bill tells you nothing about which features, users, or queries drive cost. Without attribution, you can't optimize. Here's the model that works.
  4. May 20 LLM build vs buy: the questions that actually matter Should you build your own model, fine-tune, host open-source, or call APIs? The decision depends on a few specific questions, and the answer is usually 'call APIs.'
  5. May 19 LLM rate limits: budgeting for the throughput you actually need Provider rate limits constrain what you can ship more often than they should. Most teams hit the limits at the wrong time and don't have a plan. Here's the planning framework.
  6. May 18 The cost of context: why bigger windows aren't free Long context windows let you stuff more into a prompt. They don't let you do it for free. The cost scales superlinearly with context size in ways that surprise teams.
  7. May 17 Pricing tiers for AI features: matching limits to economics Flat-rate AI pricing leaves you exposed to the heavy users. Pure pay-per-use is hostile to most users. The middle ground is tiers with clear limits, designed around your cost distribution.
  8. May 16 Model routing: spending the right amount of intelligence Not every request needs the frontier model. Routing requests to the right model tier is one of the highest-leverage cost optimizations and one of the most underused.
  9. May 15 Prompt caching: the optimization most teams underuse Modern LLM APIs let you cache the static parts of your prompt. Most teams enable it, then design prompts that defeat it. Here's how to get the actual savings.
  10. May 14 LLM unit economics: the math your CFO will eventually ask about Unit economics for LLM features look different from regular software unit economics. The variable costs are real, the gross margins can flip with usage patterns, and the questions are coming. Here's how to think about them.
  11. Apr 30 The economics of running an LLM agent at scale Napkin math for the unit cost of an AI feature: tokens, latency, caching, model routing, and the surprising line items nobody publishes.

/ product

AI product engineering

Building things people use, not demos that wow. Substance first, surface second.

13 essays
  1. Jul 7 Shipping AI products in 2026: the playbook by phase If I were building a new AI product today, here's the order I'd do it in. Phase by phase, with the specific decisions that matter at each stage.
  2. Jul 6 The AI engineering stack in 2026: a map of the discipline AI product engineering has become a real discipline with its own stack. Here's the map I've ended up with after 18 months of writing about each piece, and where each piece fits.
  3. Jun 2 Setting the quality bar for AI features: how good is good enough AI features are non-deterministic. They will make mistakes. The product question is how often, on which inputs, with what user-visible consequences. Here's the framework.
  4. Jun 1 Versioning AI products: who pays when behavior changes AI product behavior changes when models change. Users notice. The versioning model determines who absorbs the change. Get this wrong and your users feel like the product is randomly different.
  5. May 31 From AI demo to production: the gap is bigger than it looks A working AI demo is maybe 20% of the work. The other 80% is everything that makes it survive contact with real users. Here's the punch list.
  6. May 30 Team shapes for AI products: who owns what Building AI products requires combinations of skill that don't fit traditional team structures. Here's the team shape that actually works and the dysfunction patterns to avoid.
  7. May 29 Roadmapping AI products: planning for a moving foundation Traditional roadmaps assume the technology underneath is stable. AI products live on a substrate that changes every few months. Here's the planning approach that adapts.
  8. May 28 Building user trust in AI features AI features have a trust problem most software features don't. Users have learned to be skeptical. The features that earn trust do specific things. Here's the list.
  9. May 27 Measuring AI product success: which metrics actually mean something Most AI product dashboards track the wrong things. Engagement is misleading; AI-feature usage is decoration. Here are the metrics that actually tell you whether your AI feature is working.
  10. May 26 Feature flags for AI features: rolling out the unpredictable AI features fail differently from regular features. Standard rollout patterns leave you exposed to model regressions and traffic-driven failures. Here's the gating model that fits.
  11. May 25 Onboarding for AI products: setting expectations the model can meet First-touch experience determines whether users come back. AI products have a unique onboarding problem: managing expectations the model may or may not meet. Here's the playbook.
  12. May 24 AI features that disappear (and why that's the goal) The best AI features in 2026 don't have an 'AI' label. They're invisible improvements to existing flows. Here's why most AI-branded features fail and the disappearing ones succeed.
  13. Apr 29 Build the substance, then the surface Most AI product failures are LLM wrappers shipped before there's anything underneath worth wrapping. The hard part of an AI product is almost never the prompt.