Two Kinds of Leadership. What controlled experiments taught us about making AI output better

What we found

Loading workers with rules makes output worse, not better. Bare workers with good specs were rated senior to staff level. Every context format scored lower.
Expert context transforms planning. The expert team caught 6 requirements the bare planner missed entirely, including a data integrity bug that would have shipped.
Even the pure output-quality criterion shows 95% wins. The reviewer concern that "tautological" criteria inflate the score isn’t supported. quality_bar alone tells the same story.
Two focused reviewers beat one combined reviewer. Spec compliance and output quality as separate passes find more issues than one pass doing both.
No context format beats bare Claude for creative execution. Fiction scored 19-20/20 on every run. Every assembled variant scored lower.
The architecture: plan with experts, execute bare, review with quality bars. Knowledge in the right place at the right time.

The full narrative is below. Every claim links to the experiment that produced it.

The Question

Does giving your AI more information actually make it better? Or does it just make it slower, more expensive, and worse at its job?

Published research found that overloading AI with context reduces quality while increasing cost by 20%+. We set out to figure out exactly what helps, what hurts, and how to get the best output possible.

The Setup

Two fixture domains, chosen to be as different as possible:

Rails API (OrderFlow). 3 personas covering migration safety, API design, and TDD discipline. Quality gates, anti-patterns, reference patterns.
Children’s mystery novel (The Vanishing Paintings). 3 personas covering plot structure, prose craft, and young reader advocacy. Mystery craft rules, clue-pacing standards.

20 prompts across 3 axes: quality bars (prompts that invite mistakes), persona voice (open questions), and routing (complex multi-concern tasks). Each prompt ran twice: bare (source files only, no context) vs assembled (full team context). Blinded A/B judging with randomized ordering, 5 criteria per prompt.

20 prompts × 5 criteria = 100 evaluations per run.

Phase 1: The Baseline (87%)

The 11 ties and 2 losses tell the real story. They cluster by task type.
Domain	Win %	Ties	Losses
Rails API	88%	5	1
Children’s Novel	86%	6	1
Combined	87%	11	2

The 13 non-wins clustered into a pattern that matched a paper we’d already integrated: Hu et al. (2026), “Expert Personas Improve LLM Alignment but Damage Accuracy.”

Presentation tasks (quality bars, style, structure): assembled won almost every time. Content tasks (factual recall, deployment steps, diagnostics): tied. The persona overhead was neutral or slightly negative. This wasn’t a failure. It was the fundamental tradeoff the research predicts: personas help alignment, hurt knowledge retrieval. We were seeing it in our own data.

The Calibration (87% → 96%)

Two levers. Six lines of code. Nine percentage points.

Lever 1: Persona Calibration Heuristic

A lightweight metacognitive check added to each domain’s workflow: on presentation tasks, lean into persona fully. On content tasks, lead with training, persona for framing only. The theory: on content tasks, persona context was competing with the model’s factual training. The heuristic tells the model to step back and let training lead.

Lever 2: Task-Specific “When to Go Deeper” Pointers

The original pointers were generic: “When working on migrations, read rails-patterns.md.” The new pointers name the task and what to look for: “When deploying or preparing for production, read rails-patterns.md. Verify migration safety, index strategy, and CI status before shipping.”

The tied prompts involved tasks where domain knowledge lived in reference files that didn’t load. Generic pointers don’t trigger on “deploy to production.” Specific pointers do.

Run	Rails	Novel	Combined	Losses
Phase 2a	92%	100%	96%	0
Phase 2c (confirmation)	100%	98%	99%	0

Children’s novel went from 86% to 100%. Every previously tied prompt converted. The task-specific pointers gave concrete ammunition: revelation pacing, clue chain audits, chapter-specific diagnosis.

Per-criterion breakdown

A fair question: do the 5 criteria inflate the score? Three criteria (persona voice, expertise routing, specificity) measure whether context was used. Two (quality bar enforcement, pushback quality) measure whether output was better. If the tautological criteria are inflating things, you’d see them winning while quality criteria don’t.

That’s not what the data shows. quality_bar alone: 95% assembled wins. All five criteria are within 5% of each other. In 19 of 20 prompts, every criterion goes the same direction. The full per-criterion breakdown is on the eval data page.

The Hallucination Problem

Between the 96% and 99% runs, we ran the eval again with a minor tweak. Got 91% with 7 losses.

The assembled version on one prompt hallucinated a classification exercise. Instead of answering the question, it categorized all 10 prompts and analyzed the persona calibration system. The judge faithfully scored this garbage against the bare version’s correct answer, and awarded 5 losses.

The judge scores quality, not correctness. A beautifully written, persona-rich response that doesn’t answer the question can still win on voice and routing criteria.

Assembled context can produce hallucinations too. The eval exposed it. We excluded this run from the final analysis but documented it as a critical learning: eval design must check correctness, not just quality. This led directly to the Overwatch adversarial system.

The Overwatch System

The hallucination problem demanded a fix. We built a two-tier adversarial system: a lightweight self-check in every project’s workflow (“did you actually do what you claimed?”), and a full Overwatch persona for teams with 3+ members.

But the first version used the same checks for both domains. Results:

Run	Rails	Novel	Combined	Losses
Hawkeye (structured)	98%	90%	94%	1
Hawkeye (structured) run 2	94%	88%	91%	4

Rails held steady. Novel dropped from 99% to ~89%. The structured Overwatch (“did you apply the quality bars you cited?”) pushed the model toward process meta-commentary instead of craft engagement.

Process-compliance checks work for structured domains but hurt creative domains. Same adversarial intent, different framing.

The fix: domain-adapted Overwatch. Structured domains get process checks. Creative domains get engagement checks (“did you reference specific characters, or give generic advice?”).

Run	Rails	Novel	Combined	Losses
Hawkeye (domain-adapted)	92%	100%	96%	0

Novel back to 100%. Zero losses. Adversarial checks must be domain-adapted.

The Proficiency Test

The A/B eval showed assembled context influences behavior. But influence isn’t outcome. The proficiency test asks: when both versions do real work, which output is actually better?

Two self-contained tasks: build a library reservation API from a complete spec, and write a mystery novel opening scene from a character brief. Automated structural checks plus LLM rubric scoring, 20 checks per domain.

Domain	Bare	Assembled	Delta
Rails	12/20	18/20	+6 assembled
Novel	19/20	18/20	-1 bare

Then we dispatched blind qualitative judges: a senior engineer and senior editor who didn’t know which version had context.

The blind engineer rated bare “senior” and assembled “strong mid-level reaching for senior.” Assembled was penalized for over-commenting, including planning notes in output, and showing a subtle race condition that bare avoided. The blind editor rated bare “veteran” and assembled “mid-career,” penalized for explaining its own clues in a postscript.

We re-ran with full context visible, letting the judge see why the assembled version showed its work.

“Submission A is the stronger engineering outcome... but B’s team, with A’s instincts. B’s standards create a shared vocabulary, testable quality gates, and a self-audit culture that scales across engineers.”

Principal engineer review, Rails domain

The senior editor reviewing the novel domain reached the same conclusion from a different angle: “The rules didn’t make B a better writer. They made B a better architect.”

Bare produces higher peak craft. Assembled produces better structural work. The pattern held across both domains: assembled makes Claude more thorough but more self-conscious. It checks more boxes but writes with less confidence.

The Instinct Experiments

Five experiments, each changing one variable in the assembled context. Same proficiency test, same baselines. The question: can we find a context format that produces instincts instead of compliance?

The Breakthrough: War Stories

Rewriting standards as “here’s what went wrong last time” narratives beat bare on Rails for the first time. 20/20 rubric, ranked 1st by qualitative judge.

“C writes code like someone who’s been paged at 2am.”

Qualitative judge on war stories variant

War stories produced: partial unique index, dependent: :restrict_with_error, per_page clamping both directions, includes(:book) with serialization, side-effect absence tests, Bullet gem integration. Rules produce compliance: “always use transactions” → Claude cites the rule. War stories produce instincts: “last time someone skipped the lock, two users got the same seat” → Claude thinks about the failure mode and designs around it.

The Creative Problem

No context format beat bare Claude on creative execution. Bare hit 19-20/20 on every single run across all 5 experiments. Every assembled variant scored lower.

The Combo Trap

We tested combining original rules with war stories. The combo ranked last. War stories alone ranked 2nd. Bare 1st. More context dilutes instincts. Adding rules back on top of war stories pushed the model toward compliance behavior again.

The Planning Pipeline

Context hurts execution. But what about the work that happens before execution?

Same task. Same 7 scripted user answers in the same order. Only variable: whether the brainstormer had team personas loaded.

Bare asked: “What’s the expected scale?”

Assembled asked: “What’s the expected scale? This determines whether we need to worry about table lock duration on migrations.”

Same topics. Different depth. Bare asks WHAT. Assembled asks WHY.

The assembled planner’s spec caught six specific issues bare missed entirely:

Partial unique index: prevents a data integrity bug bare’s plan would ship
FOR UPDATE SKIP LOCKED: better concurrency than bare’s FOR UPDATE
Copy condition field: operational requirement bare missed (can’t withdraw damaged books)
Pagination: bare returned unbounded results on the list endpoint
Runnable RSpec code: bare wrote English test descriptions, assembled wrote actual specs
RESTful cancel: assembled used PATCH for state transition, bare used DELETE with query param

Personas change how problems are explored, not just how answers are presented. The planning phase is where assembled context earns its keep.

The Format Doesn’t Matter (For Workers)

Four context variants for execution workers: bare (nothing), war stories (full narrative), compressed (arrow format), and fix-only (plain rules). 3 runs each. Principal engineer reviews with 15 years of production Rails experience.

Bare workers were rated senior or higher on every run.
Variant	Run 1	Run 2	Run 3
bare	Senior	Senior	Staff
warstories	Senior	Mid	Mid
fixonly	Mid	Junior	Senior
compressed	Mid	Mid	Mid

Bare was rated senior or staff every time. The run 3 review called it “staff,” noting check constraints on copies_available, correct lock ordering, and handling of an expired-reservation edge case that “most submissions miss entirely.” Compressed was the only variant that got request-changes on every run, with real bugs: race windows, fat controllers, spec deviations.

We thought the team’s knowledge needs to reach the worker in the right format. Reality: the team’s knowledge needs to reach the worker as a good spec.

The worker doesn’t need context about migration safety. The worker needs a spec that says “use a partial unique index” because the team already thought about it during planning. A bare worker following a good spec produces senior/staff-level code. The delta is in the spec, not the worker’s context.

Two Kinds of Leadership

The team plans and reviews. The individual executes. That’s not a compromise. That’s the architecture.

This is the central finding. It’s not a technical finding. It’s a management philosophy, validated with data.

“Do your job or I’ll fire you”

The intuitive approach: load the worker with rules, quality gates, war stories, and compliance checks. More context, more guardrails. The worker knows exactly what’s expected.

The data shows what happens with that boss:

Workers perform compliance instead of doing the work. Phase 4: assembled workers over-commented their code, included planning notes in creative output, and explained their own craft decisions. A blind judge called the output “mid-level reaching for senior.” Checking boxes instead of thinking.
The format of the rules doesn’t matter. Phase 7: four different ways to deliver the same knowledge. The worker who got nothing was rated senior to staff. The worker who got compressed rules produced the worst code every single time, with real bugs.
Creative work suffers the most. Phase 5: bare Claude writing fiction scored 19-20/20 on every run. Every assembled format scored lower. Every single one.

The “do your job or I’ll fire you” boss creates mid-level workers who check boxes. The rules become the ceiling, not the floor.

“What can I do to help you do your job better?”

The answer is almost always the same: give me a clear plan and honest feedback. Not someone standing over my shoulder.

Planning, where leadership earns its keep:

The assembled brainstormer asks “What’s the expected scale? This determines whether we need to worry about table lock duration on migrations.” The bare brainstormer asks “What’s the expected scale?” Same question. Different depth. (Phase 6)
The assembled planner catches a data integrity bug, unbounded queries, and a missing operational requirement the bare planner misses. (Phase 6)
The output is a spec that embeds the team’s expertise as concrete requirements. Not rules to comply with, but decisions already made by people who thought about the hard problems.

Execution, trust the worker:

Worker receives the spec. No personas, no rules, no war stories. Just: here’s what to build.
A bare worker following a good spec produces senior/staff-level code. The same worker following a bare spec produces good-but-incomplete code. The delta is in the spec, not the worker’s context. (Phase 7)

Review, the honest feedback part:

The team evaluates the output against quality bars with real expertise.
Expert reviewers caught issues that generic review missed: subtle correctness problems, missing constraints, and edge cases the worker didn’t consider. (Phase 7)
Two focused reviewers each doing their job beats one reviewer trying to do both. Spec-aware review catches “you built the wrong thing.” Quality review catches “you built it wrong.” (Phase 8)

The Complete Architecture

Phase	Context	Why
Brainstorming	Assembled ON	Personas shape questions, catch architectural gaps
Spec writing	Assembled ON	Standards define quality bars, identify edge cases
Plan writing	Assembled ON	Expertise becomes spec requirements
Execution	Bare	Workers produce best output with training + good spec
Review	Assembled ON	Team catches what workers miss

The Research Foundation

Every design decision traces back to published research, then validated with original experiments. Every paper links to the source so you can check it yourself.

Paper	What it told us	How we used it
Gloaguen et al. (2026)	Bloated context reduces success, increases cost 20%+	~60 line caps on rules files, two-tier architecture, The Snap audit system
Hu et al. (2026)	Personas help alignment but hurt knowledge retrieval	Predicted the 87% baseline pattern. Led to the persona calibration heuristic (+9 points)
Kong et al. (2023)	Role-play activates domain knowledge when personas are specific	Blended personas with named authorities instead of generic roles
Xu et al. (2023)	Task-specific persona descriptions outperform generic ones	Combined with Kong: specific blended personas as the standard
Shinn et al. (2023)	Reflection agents make dramatically better decisions	Inspired the Overwatch adversarial system after Phase 2b hallucination
Yang et al. (2024)	Interface design matters more than the prompt	Shaped skill architecture: hub-and-spoke SKILL.md, progressive disclosure

Our findings extend the published work in two directions: (1) the calibration heuristic addresses the Hu et al. tradeoff directly, and (2) the planning/execution separation shows that where in the pipeline context applies matters as much as the context itself.

The Integration Test

Every finding above was isolated: planning tested separately from execution, execution from review. Phase 9 tested the full pipeline end-to-end on a real project, a Go task queue CLI.

Pipeline Coherence

Three chains built the same feature with different context levels:

Pipeline	Context	Score (out of 30)
Assembled	Architecture + personas + rules	29/30
Architecture only	Same architecture, no personas/rules	20/30
Bare	Nothing	14/30

Architecture alone gets you from 14 to 20: types match, interfaces match, error patterns match. Conventions embedded in the spec get you from 20 to 29: error message formats, output discipline, durability patterns.

Convention Compliance

Bare workers following specs with explicit conventions sections, tested across two execution mechanisms:

Tier	Conventions matched	Review fixes needed
Agent teams (parallel)	20/20	0
Subagents (sequential)	10/10	0
Total	30/30	0

The execution mechanism doesn’t matter. The spec determines quality. 30/30 conventions matched on first pass across both tiers. Zero review fixes needed.

What This Means

The finding applies to anyone building with AI agents, not just GIGO users.

Context shapes questions (good). During planning, assembled context causes the model to ask deeper questions, identify harder problems, and produce more defensible architectures. The persona makes Claude ask “what happens under concurrent load?” A question only an expert asks.

Context shapes answers (bad). During execution, assembled context causes the model to perform compliance instead of doing the work. It cites rules instead of thinking. It shows its homework instead of being good at the job.

The architecture is transferable: plan with experts, embed expertise in the spec, execute bare, review with quality bars. It works because knowledge is in the right place at the right time: in the team’s questions during planning, in the spec during execution, in the team’s judgment during review.

Try it yourself

claude marketplace add croftspan/gigo && claude plugin install gigo

Get started in 30 seconds →