Advisory · Labs Brief us →
● ADVISORY LABS FIELD NOTES FIELD: APPLIED AI 05/19/2026
// GENERAL · PLAYBOOK

The 7-specialist review pipeline that keeps AI features from breaking in production

A reusable playbook: seven structured review passes before any AI feature ships. Catches the failures that pilot-stage AI code carries into production.

Isai Guerrero May 4, 2026 9 min read

The most common failure mode I see in shipped AI features is not “the model is wrong.” It’s “the model was right in the demo, and then something around the model failed in production.” Research on LLM system failure modes identifies fifteen distinct classes — version drift, context-boundary degradation, incorrect tool invocation — that standard benchmarks miss entirely.

The way to catch those failures is structured review. Not “give it a once-over.” A specific, repeatable pass with seven distinct lenses, each one looking for the failure modes the others won’t catch.

Here’s the pipeline I run on every AI feature before merge.

The seven specialists

Each specialist is a focused review prompt with a checklist. I run them as Claude subagents from the main session — each gets the diff plus the relevant context (schema, config, related code), and returns a written review.

They’re called specialists because their attention is narrow. A generalist review misses the thing only an observability-trained eye sees. A pile of generalist reviews catches roughly half the issues a single specialist catches.

1. Architecture

Does the change fit the existing layers? Is logic in the right place — in the model adapter, in the agent, in the orchestrator, in the UI? Does it introduce a new layer when an existing one would do?

The trap this catches: AI-feature PRs that smuggle business logic into prompt strings where it gets lost from refactors and grep searches.

2. Data model

Does the migration preserve referential integrity? Are new columns nullable with safe defaults? Are indexes added for new query paths? Are the types in the DB, the API schema, and the TS interface aligned?

The trap this catches: AI features that work in dev because the dev DB has 50 rows and collapse in prod because they introduced an unindexed full-table scan.

3. AI correctness

Are the prompts versioned? Is there a contract test that the model returns the expected JSON shape? Is there an anti-fabrication rule? What’s the behavior when the model returns “unknown” or refuses? What’s the fallback when the model is down?

The trap this catches: features that pass golden-path tests and fail catastrophically on the 5% of inputs where the model returns malformed output.

4. Security

Are user-controlled fields properly escaped before being passed to the model? Are model outputs treated as untrusted before being rendered, executed, or stored? Are there authentication / authorization checks on the agent’s tool calls? Are PII fields properly redacted from logs?

The trap this catches: prompt injection from user inputs, model outputs interpreted as trusted SQL or shell commands, leaked PII in transcript logs.

5. Performance

What’s the worst-case latency? What’s the request budget per user action? Are model calls in a critical render path, or async? Are there caches, and are they invalidated correctly? What happens at 10× the current traffic?

The trap this catches: features that latency-spike under load because every render triggers a 4-second model call.

6. Observability

What’s logged? Are model inputs and outputs sampled to a transcript store? Is there a dashboard for token spend, latency, error rate? Are alerts wired to PagerDuty for silent-failure modes (model returns success but garbage)?

The trap this catches: features that “work” for six weeks before someone notices the output quality has been silently degrading.

7. Rollback

What does rollback look like? Is the migration reversible? Is the prompt change feature-flagged? Can the new agent be turned off without redeploying? What’s the escalation path during an active incident?

The trap this catches: prompt changes that ship to all users at once with no kill switch, requiring an emergency revert PR at 11 p.m.

How the pass works

Concretely, in the workflow:

  1. PR is opened with the AI feature changes.
  2. Pre-commit hook runs the unit tests and type-checks.
  3. Author runs pnpm review — kicks off the seven specialists in parallel, each one reviewing the diff with its specialized checklist.
  4. Outputs land as PR comments. Each specialist returns either ✅ Pass, ⚠ Address-and-merge, or ❌ Block.
  5. Any ❌ blocks the merge until addressed. ⚠ items get addressed in the same PR if cheap, or filed as follow-up issues if not.
  6. Author addresses the feedback. Re-runs the pass if any specialist initially blocked.
  7. On all-green, the PR is mergeable.

The whole pass takes 4–8 minutes wall-clock for a typical feature PR. Author wait-time is roughly the time it takes to grab coffee.

Why it’s seven, not three or fifteen

Three is too few — the lenses overlap enough that you miss whole categories. Fifteen is too many — diminishing returns and review fatigue set in. Seven covers the major failure modes for AI features specifically, with minimal overlap between specialists.

If your stack has unusual concerns — a heavily regulated vertical, real-time systems, multi-tenant data isolation — you can swap one of the seven for a domain-specific specialist (a compliance specialist for fintech, a tenant-isolation specialist for multi-tenant SaaS). The seven are the floor, not the ceiling.

What it has caught for me

A non-exhaustive list from the last twelve months:

  • A voice agent that quoted stale eligibility caps because its tool was reading from a 12-minute cache (perf + AI correctness).
  • A prompt change that doubled token spend per session because of a debugging instruction the author forgot to remove (observability).
  • An enrichment endpoint that allowed user-controlled property addresses to be passed unescaped into the model, enabling prompt injection (security).
  • A migration that worked on the dev DB but would have locked the prod users table for ~90 seconds on apply (data model).

None of those would have been caught by unit tests. Most wouldn’t have been caught by a generalist code review either. All of them were caught by the specialist whose narrow attention was looking for exactly that class of failure.

How to start

Copy the seven names. Write a one-page checklist for each one yourself — what does “architecture” mean in your stack, what does “observability” mean in your stack. Wire it into your pre-merge workflow.

The pipeline you start with won’t be the pipeline you keep. Mine isn’t. You’ll iterate the checklists as you learn what your specific stack’s failure modes look like.

That’s the playbook.

Want this pipeline embedded inside your team’s PR flow? Send a brief and we’ll set up a working session.


Send a brief if you’re trying to instrument production discipline into a team that currently ships AI features by vibe.

Tags
production-aireviewprocessclaude
// FAQ
Is this a process for managers or for the engineer authoring the PR?
The engineer authoring the PR runs it. The seven specialists are focused Claude prompts with checklists; the author kicks them off via `pnpm review` and addresses feedback before requesting human review. The author owns quality.
What if my stack has unusual concerns?
The seven are the floor, not the ceiling. Swap one of them for a domain specialist (compliance for fintech, tenant-isolation for multi-tenant SaaS) or add an eighth. The categories evolved from production failures we kept seeing; yours will too.
How long does it take to set up?
A working version, with placeholder checklists: a day. A version tuned to your stack's actual failure modes: a quarter of running it. The first iteration of the pipeline is never the one you keep.
// RELATED READING

More on this thread.

// BRIEF US

If this reads like your problem, send a brief.

Two business days to first reply. No retainer pressure. Worst case you get a pointed question back.

Send a brief →