The Wolfpack: An Open-Source QA Pipeline for AI Agents

AI coding assistants are incredible at generating code. They're terrible at knowing when that code is wrong.

I've been building PawPIMS, a veterinary practice management platform, with Claude, Mistral, and Gemini as my primary engineering team for months. The output is impressive. The quality control problem nearly killed the project twice. An AI agent will confidently ship a tax calculation that silently returns zero on missing data, or refactor a modal that breaks every click handler downstream, or "simplify" a database query into an N+1 nightmare that takes down the page.

The problem isn't the models. The problem is that there was no structured process for reviewing their output before it shipped.

What We Built

The Wolfpack is a multi-agent pipeline that structures AI code changes into a reviewable, certifiable process. Instead of one agent doing everything in a single session (plan, code, test, commit, ship), the Wolfpack breaks the work into six roles that run in separate sessions with no shared context:

Alpha plans the feature and scores its complexity
Bloodhound adversarially reviews the plan before anyone writes code
Shepherd implements the plan (code only, no tests)
Pointer adversarially reviews the actual code diff
Tracker writes and runs automated tests
Watchdog certifies the full package and scores execution quality

Each role runs as a fresh session. Information flows between them only through files: plan documents, review findings, test logs, certification reports. No context bleed, no "I already looked at this earlier" shortcuts.

Human in the Loop

Each phase handoff is manual. You run /clear, switch to the next model, and invoke the next slash command. This is partly a technical limitation (there's no way to automate model switching between sessions yet), but it's also a deliberate design choice.

You see the output of every phase before the next one starts. You read the plan before the Bloodhound reviews it. You read the code review before the Shepherd rewrites. You read the test results before the Watchdog certifies. At any point, you can stop the pipeline, override a decision, or redirect the work. The agents propose; you approve.

This matters because AI agents are confident even when they're wrong. An unsupervised pipeline that plans, codes, tests, and ships without human checkpoints will eventually ship something that looks correct to every agent in the chain but is fundamentally broken in a way only a human would catch. The manual handoffs are the safety net.

Why Separate Roles Matter

The key insight: an AI reviewing its own work catches less than a different AI reviewing it.

When Claude Opus plans a feature and then Claude Opus reviews that plan, it has the same blind spots both times. When Mistral reviews an Opus plan, it brings genuinely different assumptions. We enforce this with a cross-model adversarial rule: the reviewer must always be a different model family from the role it's reviewing.

This isn't theoretical. On a recent Red-tier hunt (our highest complexity rating), the Opus Shepherd made a "simpler" fix that broke a framework convention. The Mistral Pointer caught it in code review. On the same hunt, the Pointer found two MEDIUM-severity issues that the Shepherd dismissed as optional. We caught that behavioral gap and tightened the rules so all findings are mandatory on rewrites.

Five Tiers of Ceremony

Not every change needs six agents and three rounds of review. A typo fix is not a billing system rewrite. The Wolfpack scales ceremony to complexity:

Green — typo, config change. The Shepherd implements, the Watchdog rubber-stamps. Done in 5 minutes.
Blue — small feature, polish. One-shot code review and testing. No looping.
Yellow — standard feature. Full review cycles, tests can trigger rewrites.
Orange — multi-component work, API changes. Two review rounds, performance audits.
Red — compliance-sensitive, architectural. Full ceremony, security lens, manual smoke testing.

The Alpha scores the task on seven dimensions and the tier is computed automatically. A developer can override, but the default is usually right.

The Pedigree: The Pipeline Learns

Every completed hunt gets scored by the Watchdog on execution quality: did the plan get followed? Was the code clean? Did tests pass? Did the code reviewer catch real bugs?

We also score process value-add: did the extra pipeline phases actually earn their overhead? If the Pointer never catches anything on Blue-tier hunts, maybe Blue should skip the Pointer like Green does. If a specific model consistently fails at test writing, the pedigree data tells the Alpha to avoid assigning it to that role.

This feedback loop means the pipeline gets better at model selection over time. Early on, you're guessing which model works best for each role. After 20 hunts, you have data.

What It Costs

Time. Each additional role adds a few minutes. A Green-tier hunt takes 5 minutes. A Red-tier hunt with full ceremony takes 60 to 90 minutes. The old approach (one agent, one session, ship it) was faster and produced bugs that took hours to diagnose in smoke testing.

The pipeline doesn't slow you down. It moves the debugging from "post-ship panic" to "pre-ship review." That's a net win on anything above trivial complexity.

Using It

The Wolfpack works with Claude Code and is compatible with Mistral and Gemini via symlinks. It's framework-agnostic: you configure your project's test commands, review checklists, compliance rules, and deployment steps in a single wolfpack-config.md file.

Installation is copying the skill files into your project's .claude/ directory and filling in the config. The README has the full setup guide.

/hunt my-feature "Add user profile editing"

Each phase finishes by telling you exactly what to do next. No memorizing a pipeline. Just follow the instructions.

Try It

The repo is at github.com/MorsCordis/wolfpack. MIT licensed. If you're using AI agents to write production code and want more confidence in what ships, give it a look.

If you build something with it, I'd love to hear about it. And if you find a gap, the pipeline is built to evolve. Every behavioral rule in the Wolfpack exists because a real bug made it through and we tightened the process to prevent the next one.

That's how engineering should work.