Endless Galaxy Studios
8 independent reviewers found the same bug, but none of them talked to each other.
  • Case Studies
  • Multi-Agent Systems
  • Skill Design
  • AI Engineering
  • Prompt Engineering

I tested Claude Code's agent teams but completely missed the point

Yovarni Yearwood

Yovarni Yearwood

7 min read

A first look at Claude Code's agent teams on a real code review. 8 agents but zero team semantics - turns out the I completely missed how the team features work.


Quick Background#

LLMs are exciting - they’re the new hotness and every day there’s tons of new tools and technologies to try out. Honestly, it can be hard to keep up. I’ve put off trying out agent teams for the longest time because it’s such an architectural change and it’s one of those things that is personal to the way you develop, since everybody has their own development process.

At my day job, one of our challenges recently has been seeing how we can better integrate AI within our workflows. My forays into an SDLC-driven agentic development process has been a journey, and quite the fun one at that. My latest stab at a development process sees me tackling the thing I’ve put on the back burner for a long time - Claude Code’s agent teams.

You can find the SDLC and relevant skills mentioned in this blog post on my GitHub.

Why agent teams?#

  • Subagents are fire-and-forget. You Agent({...}), the subagent runs in its own context, returns a summary, you never see it again. No inter-agent communication. No persistence across turns.
  • Team members are persistent and addressable. You TeamCreate(...) once, then spawn members into it. Each member is a running agent with a name. You — and they — can SendMessage({to: "some-member", ...}) to any of them. Messages arrive in their inbox between turns. They stay alive across your turns until you explicitly shut them down.
  • Teams share a task list. All members see the same TaskList. Any member can TaskCreate, TaskUpdate. Real coordination primitive, not just a single-agent todo scratchpad.

So in theory a team of agents can debate each other, hand off tasks, watch each other’s work, and maintain shared context across a long multi-phase workflow. The skill I tested (/review-team) is built on top of all of that — in theory it spawns a team of specialists to review code together, argue out disagreements, and produce a unified report.

What I tested#

I ran /review-team against three recent commits in one of my personal projects. The change set was intentionally cross-domain:

  • Visual/UI work (animations, rendering effects)
  • Marketing copy / metadata
  • A new interactive component with keyboard navigation and accessibility implications

Cross-domain was the point — I wanted work where different specialist reviewers would genuinely have different things to say about the same code.

The skill dispatched 8 reviewer agents:

  • code-reviewer - generic code quality lens
  • frontend-developer - React/Next.js correctness
  • ui-ux-designer - interaction patterns, layout
  • performance-engineer - rendering cost, animations
  • brand-officer - visual/voice consistency
  • marketing-strategist - SEO, metadata, messaging
  • software-architect - abstractions, component boundaries
  • accessibility-auditor - keyboard nav, ARIA, focus

Plus a software-architect dispatched as the debate mediator.

What happened#

The parallel review produced useful convergence#

Five agents independently flagged the same issue: a missing group class on one of the components that broke hover-reveal navigation. When multiple specialists converge on the same finding from different angles, confidence goes way up. That convergence was the one clear win of the run.

But it’s worth being precise about why it worked — this is just how multi-agent dispatch behaves in general. Spin up several specialist subagents in parallel, aggregate their findings, you get the same convergence pattern. It isn’t specific to agent teams. A simpler multi-subagent skill would have produced the same result at a fraction of the cost (like cc-sdlc’s /review-commit)

Total: 21 findings (2 critical, 12 major, 7 minor).

The “debate” phase didn’t actually debate#

The skill’s docs describe a multi-round protocol where conflicting agents see each other’s findings, respond with evidence, and the mediator only escalates to the user when agents can’t resolve it.

What actually happened: the mediator was spawned with a prompt that said “analyze findings, produce a synthesized report” — so it just wrote a summary document. Zero SendMessage calls between agents. No debate. Solo arbitration.

For example: code-reviewer flagged the missing group class as critical. frontend-developer flagged the same thing as major. The mediator picked major with a reasonable-sounding rationale — but the two original agents never actually argued. The mediator just picked a winner.

The difference matters: the whole premise of a debate protocol is that domain experts surface evidence for their position. A single architect making unilateral calls without that exchange is just a subagent with extra steps.

The follow-up fix skill spawned brand-new agents#

After the review, I ran the companion /review-fix skill. It spawned 17 more fresh agents across several fix/review rounds. None of them reused the 8 team members from the review. The review team was sitting idle while the fix skill spun up entirely new agents with no context about what had just been reviewed.

8 + 17 = 27 total agent spawns for one 3-commit review-and-fix cycle. Each spawn starts cold, re-reads project context, reloads knowledge. Massive redundancy.

Cost#

Rough estimate: ~560K tokens for the full review + fix cycle.

A standard subagent-dispatch review on the same commits would have cost ~100K. So this pattern came in at roughly 5–6× the cost of the cheaper alternative.

What I took away#

1. The convergence value came from parallelism, not from team semantics. Five specialists converging on the same finding is a quality signal you can’t easily manufacture with a single reviewer. But any parallel multi-subagent dispatch produces that convergence. The team skill didn’t create this value — it just happened to run inside a team. A simpler parallel-subagent skill would produce the same outcome at a fraction of the cost.

2. This implementation was just parallel subagents with a synthesis step. The things that make a team different from parallel subagents — SendMessage exchanges, persistent context, hand-offs — weren’t being used. So the skill ran at team prices and delivered parallel-subagent-dispatch value.

3. The review skill and fix skill didn’t share context. Two skills, one workflow. Should have been one team serving both phases. Wasn’t.

4. Protocol compliance needs to be enforced, not just described. The skill’s README laid out a 6-step debate protocol. Two of the six actually ran. When a skill’s behaviour drifts from its documentation like that, you can’t trust what you’re buying.

Verdict on this version of the skill#

Not worth the cost for day-to-day work. The parallel-review value could be captured much more cheaply with a standard parallel-subagent dispatch, and the debate protocol that was supposed to be the differentiator didn’t actually run.

That said, the concept is solid — the implementation just hadn’t caught up to the design. I came up with nine concrete recommendations. The main ones:

  • Make the debate actually orchestrate inter-agent exchanges instead of solo synthesis
  • Let agents confirm, challenge, or suggest severity changes on each other’s findings directly, before routing everything through a mediator — puts agency with the domain experts
  • Validate agent types have the tools they need before spawning (one of the agents got spawned read-only and couldn’t run git show — silent failure)
  • When debate genuinely doesn’t resolve a conflict (both positions have equal evidence), escalate it to the user as an explicit decision rather than forcing the mediator to pick arbitrarily just to close the loop
  • Let the fix skill reuse existing team members instead of spawning fresh ones where possible
  • Emit a protocol-compliance checklist at the end of each run so skipped steps are visible instead of silent

Why this matters#

  • Agent teams are a real thing worth watching. The multi-agent pattern has legitimate value, especially for cross-domain work where multiple specialist lenses surface different things. The concept isn’t wrong.
  • But “agent teams” aren’t automatically better than parallel subagent dispatches. The extra cost only pays back when the team semantics are actually being exercised — inter-agent messages, persistent shared state, coordinated hand-offs. A skill that just spawns N agents in parallel without inter-agent communication is charging team prices for subagent work.
  • When evaluating AI tools — verify the implementation matches the pitch. The gap between this skill’s described protocol and its actual behaviour was wide, and there was no warning sign surfaced to us. We’re going to see this pattern a lot.

I ran a follow-up test three days later on a revised version of the skill. There’ll be a new blog post soon covering that. The short version: most of these problems got fixed, but new ones showed up.