Eval Data
Phase-by-phase results from 9 experimental phases. For the narrative version, read Two Kinds of Leadership.
Methodology
- Domains: Rails API (OrderFlow) and Children’s Mystery Novel (The Vanishing Paintings)
- Prompts: 10 per domain across 3 axes: quality bars (A), persona voice (B), routing (C)
- Comparison: Each prompt run twice. Bare (source files only) vs assembled (full context)
- Judging: Blinded A/B with randomized ordering
- Criteria: 5 per prompt: quality bar enforcement, persona voice, expertise routing, specificity, pushback quality
- Scale: 20 prompts × 5 criteria = 100 evaluations per run
2026-03-26
Phase 1: Baseline
Variable: Assembled vs bare, no optimizations.
| Domain | Win % | Ties | Losses |
|---|---|---|---|
| Rails API | 88% | 5 | 1 |
| Children’s Novel | 86% | 6 | 1 |
| Combined | 87% | 11 | 2 |
Finding: Assembled wins on presentation tasks, ties on content tasks. Matches Hu et al. prediction.
2026-03-26
Phase 2a: Calibration
Variable: + persona calibration heuristic + task-specific “When to Go Deeper” pointers.
| Domain | Win % | Ties | Losses |
|---|---|---|---|
| Rails API | 92% | 4 | 0 |
| Children’s Novel | 100% | 0 | 0 |
| Combined | 96% | 4 | 0 |
Finding: 6 lines of calibration code + specific pointers = 9 percentage points.
2026-03-26
Phase 2b: Aberrant Run
Variable: Minor tweak to workflow.
| Domain | Win % | Ties | Losses |
|---|---|---|---|
| Rails API | 90% | 3 | 5 |
| Children’s Novel | 92% | 4 | 0 |
| Combined | 91% | 7 | 5 (7 criteria) |
Finding: Assembled hallucinated a classification exercise on prompt 02. Eval scored it without catching the error. Led to Overwatch system.
2026-03-26
Phase 2c: Confirmation
Variable: Same as 2a, rerun.
| Domain | Win % | Ties | Losses |
|---|---|---|---|
| Rails API | 100% | 0 | 0 |
| Children’s Novel | 98% | 1 | 0 |
| Combined | 99% | 1 | 0 |
Finding: One irreducible tie on open-ended creative prompt (nothing to push back on).
Per-Criterion Breakdown (Phase 2c, 20 prompts)
Do the criteria inflate the score? Three criteria measure context utilization. Two measure output quality. If the tautological criteria were carrying the number, you’d see splits. You don’t.
| Criterion | Assembled wins | Tie | Bare wins | Win % |
|---|---|---|---|---|
| quality_bar (output quality) | 19 | 1 | 0 | 95% |
| persona_voice (utilization) | 20 | 0 | 0 | 100% |
| expertise_routing (utilization) | 19 | 1 | 0 | 95% |
| specificity (utilization) | 19 | 1 | 0 | 95% |
| pushback_quality (output quality) | 19 | 1 | 0 | 95% |
| TOTAL | 96 | 4 | 0 | 96% |
Finding: The two output-quality criteria (quality_bar, pushback_quality) both show 95%. The utilization criteria show 95-100%. All within 5% of each other. In 19 of 20 prompts, every criterion goes the same direction.
2026-03-26
Phase 3a: Hawkeye Structured
Variable: + Overwatch adversarial checks (same process-compliance checks for both domains).
| Run | Rails | Novel | Combined | Losses |
|---|---|---|---|---|
| Run 1 | 98% | 90% | 94% | 1 |
| Run 2 | 94% | 88% | 91% | 4 |
Finding: Process-compliance Overwatch works for structured domains, hurts creative domains.
2026-03-26
Phase 3b: Hawkeye Domain-Adapted
Variable: Domain-specific Overwatch (process checks for Rails, manuscript engagement for Novel).
| Domain | Win % | Ties | Losses |
|---|---|---|---|
| Rails API | 92% | 4 | 0 |
| Children’s Novel | 100% | 0 | 0 |
| Combined | 96% | 4 | 0 |
Finding: Adversarial checks must be domain-adapted. Novel back to 100%.
2026-03-26
Phase 4: Proficiency Test
Variable: Real work output (build API / write opening scene), not style comparison.
Rubric Scores
| Domain | Bare | Assembled | Delta |
|---|---|---|---|
| Rails | 12/20 | 18/20 | +6 assembled |
| Novel | 19/20 | 18/20 | -1 bare |
Qualitative Judging
| Domain | Bare verdict | Assembled verdict | Final assessment |
|---|---|---|---|
| Rails | “Senior” | “Mid-level reaching for senior” | “B’s team, with A’s instincts” |
| Novel | “Veteran” | “Mid-career” | “Rules made B a better architect, not better writer” |
Finding: Assembled produces better structural work; bare produces higher peak craft.
2026-03-26
Phase 5: Instinct Experiments
Variable: 5 context formats tested (war stories, negative examples, first-person, minimal, reference-only).
Rails Rubric Scores
| Format | Bare | Assembled | Delta | Rank |
|---|---|---|---|---|
| War Stories | 14 | 20 | +6 | 1st (beat bare) |
| Negative Examples | 18 | 19 | +1 | 2nd |
| First-Person Voice | 15 | 18 | +3 | |
| Minimal (Won’t-Do) | 12 | 13 | +1 | |
| Reference-Only | 18 | 19 | +1 |
Novel Rubric Scores
| Format | Bare | Assembled | Delta |
|---|---|---|---|
| First-Person Voice | 20 | 19 | -1 |
| Minimal (Won’t-Do) | 20 | 18 | -2 |
| Reference-Only | 20 | 18 | -2 |
| War Stories | 19 | 16 | -3 |
| Negative Examples | 19 | 6 (failed) | -13 |
Finding: War stories produce instincts in structured domains. No format beats bare on creative work. Combo test (rules + war stories) ranked last. More context dilutes instincts.
2026-03-27
Phase 6: Planning Pipeline
Variable: Whether brainstormer had personas loaded. Same task, same 7 scripted answers.
Judge verdict: Assembled wins decisively. “I would hand Team B’s pipeline output to my engineering team.”
6 specific assembled wins:
- Partial unique index: prevents data integrity bug
FOR UPDATE SKIP LOCKED: better concurrency- Copy condition field: operational requirement bare missed
- Pagination: bare returned unbounded results
- Runnable RSpec code: bare wrote English descriptions
- RESTful cancel:
PATCHvsDELETEwith query param
Finding: Personas change how problems are explored, not just how answers are presented.
2026-03-27
Phase 7: Format Experiment
Variable: 4 standards.md formats: bare (0 words), war stories (468), compressed (287), fix-only (229). 3 runs × 4 variants = 12 code generations.
PR Verdicts
| Variant | Approved | Request changes |
|---|---|---|
| war stories | 3/3 | 0/3 |
| bare | 3/3 | 0/3 |
| fix-only | 1/3 | 2/3 |
| compressed | 0/3 | 3/3 |
Engineer Level Assessments
| Variant | Run 1 | Run 2 | Run 3 |
|---|---|---|---|
| bare | Senior | Senior | Staff |
| war stories | Senior | Mid | Mid |
| fix-only | Mid | Junior | Senior |
| compressed | Mid | Mid | Mid |
Finding: Execution context format doesn’t matter. Bare rated senior/staff every time. The spec quality determines output.
2026-03-27
Phase 8: Review Pipeline
Variable: Review approach: plan-aware (has spec) vs code-quality (code only) vs combined (both).
| Variant | Plan-aware | Code-quality | Combined |
|---|---|---|---|
| bare | 10 issues | 14 issues | 11 issues |
| war stories | 10 | 12 | 11 |
| compressed | 13 | 14 | 13 |
| fix-only | 10 | 15 | 12 |
Finding: Two focused reviewers beat one combined reviewer. Plan-aware catches “you built the wrong thing.” Code-quality catches “you built it wrong.”
2026-03-27
Phase 9: Integration Test
Variable: Full pipeline end-to-end (plan → execute → review) with two execution mechanisms.
| Tier | Mechanism | Conventions matched |
|---|---|---|
| Tier 1 | Agent teams | 20/20 |
| Tier 2 | Subagents | 10/10 |
| Total | 30/30 |
Finding: Execution mechanism doesn’t matter. Architecture carries coherence (14→20), personas add conventions (20→29). 30/30 conventions matched on first pass.
The Complete Picture
| Phase | Rails | Novel | Combined | Losses | Variable |
|---|---|---|---|---|---|
| Phase 1 (baseline) | 88% | 86% | 87% | 2 | |
| Phase 2a | 92% | 100% | 96% | 0 | + calibration + specific pointers |
| Phase 2b (aberrant) | 90% | 92% | 91% | 7 | hallucination on prompt 02 |
| Phase 2c | 100% | 98% | 99% | 0 | confirmation |
| Hawkeye (structured) r1 | 98% | 90% | 94% | 1 | + Overwatch (same both domains) |
| Hawkeye (structured) r2 | 94% | 88% | 91% | 4 | second run |
| Hawkeye (adapted) | 92% | 100% | 96% | 0 | + domain-adapted Overwatch |
For the narrative version of these results, read Two Kinds of Leadership.
Try it yourself
claude marketplace add croftspan/gigo && claude plugin install gigo