Eval Data

Phase-by-phase results from 9 experimental phases. For the narrative version, read Two Kinds of Leadership.

Methodology

  • Domains: Rails API (OrderFlow) and Children’s Mystery Novel (The Vanishing Paintings)
  • Prompts: 10 per domain across 3 axes: quality bars (A), persona voice (B), routing (C)
  • Comparison: Each prompt run twice. Bare (source files only) vs assembled (full context)
  • Judging: Blinded A/B with randomized ordering
  • Criteria: 5 per prompt: quality bar enforcement, persona voice, expertise routing, specificity, pushback quality
  • Scale: 20 prompts × 5 criteria = 100 evaluations per run

2026-03-26

Phase 1: Baseline

Variable: Assembled vs bare, no optimizations.

DomainWin %TiesLosses
Rails API88%51
Children’s Novel86%61
Combined87%112

Finding: Assembled wins on presentation tasks, ties on content tasks. Matches Hu et al. prediction.

2026-03-26

Phase 2a: Calibration

Variable: + persona calibration heuristic + task-specific “When to Go Deeper” pointers.

DomainWin %TiesLosses
Rails API92%40
Children’s Novel100%00
Combined96%40

Finding: 6 lines of calibration code + specific pointers = 9 percentage points.

2026-03-26

Phase 2b: Aberrant Run

Variable: Minor tweak to workflow.

DomainWin %TiesLosses
Rails API90%35
Children’s Novel92%40
Combined91%75 (7 criteria)

Finding: Assembled hallucinated a classification exercise on prompt 02. Eval scored it without catching the error. Led to Overwatch system.

2026-03-26

Phase 2c: Confirmation

Variable: Same as 2a, rerun.

DomainWin %TiesLosses
Rails API100%00
Children’s Novel98%10
Combined99%10

Finding: One irreducible tie on open-ended creative prompt (nothing to push back on).

Per-Criterion Breakdown (Phase 2c, 20 prompts)

Do the criteria inflate the score? Three criteria measure context utilization. Two measure output quality. If the tautological criteria were carrying the number, you’d see splits. You don’t.

CriterionAssembled winsTieBare winsWin %
quality_bar (output quality)191095%
persona_voice (utilization)2000100%
expertise_routing (utilization)191095%
specificity (utilization)191095%
pushback_quality (output quality)191095%
TOTAL964096%

Finding: The two output-quality criteria (quality_bar, pushback_quality) both show 95%. The utilization criteria show 95-100%. All within 5% of each other. In 19 of 20 prompts, every criterion goes the same direction.

2026-03-26

Phase 3a: Hawkeye Structured

Variable: + Overwatch adversarial checks (same process-compliance checks for both domains).

RunRailsNovelCombinedLosses
Run 198%90%94%1
Run 294%88%91%4

Finding: Process-compliance Overwatch works for structured domains, hurts creative domains.

2026-03-26

Phase 3b: Hawkeye Domain-Adapted

Variable: Domain-specific Overwatch (process checks for Rails, manuscript engagement for Novel).

DomainWin %TiesLosses
Rails API92%40
Children’s Novel100%00
Combined96%40

Finding: Adversarial checks must be domain-adapted. Novel back to 100%.

2026-03-26

Phase 4: Proficiency Test

Variable: Real work output (build API / write opening scene), not style comparison.

Rubric Scores

DomainBareAssembledDelta
Rails12/2018/20+6 assembled
Novel19/2018/20-1 bare

Qualitative Judging

DomainBare verdictAssembled verdictFinal assessment
Rails“Senior”“Mid-level reaching for senior”“B’s team, with A’s instincts”
Novel“Veteran”“Mid-career”“Rules made B a better architect, not better writer”

Finding: Assembled produces better structural work; bare produces higher peak craft.

2026-03-26

Phase 5: Instinct Experiments

Variable: 5 context formats tested (war stories, negative examples, first-person, minimal, reference-only).

Rails Rubric Scores

FormatBareAssembledDeltaRank
War Stories1420+61st (beat bare)
Negative Examples1819+12nd
First-Person Voice1518+3
Minimal (Won’t-Do)1213+1
Reference-Only1819+1

Novel Rubric Scores

FormatBareAssembledDelta
First-Person Voice2019-1
Minimal (Won’t-Do)2018-2
Reference-Only2018-2
War Stories1916-3
Negative Examples196 (failed)-13

Finding: War stories produce instincts in structured domains. No format beats bare on creative work. Combo test (rules + war stories) ranked last. More context dilutes instincts.

2026-03-27

Phase 6: Planning Pipeline

Variable: Whether brainstormer had personas loaded. Same task, same 7 scripted answers.

Judge verdict: Assembled wins decisively. “I would hand Team B’s pipeline output to my engineering team.”

6 specific assembled wins:

  1. Partial unique index: prevents data integrity bug
  2. FOR UPDATE SKIP LOCKED: better concurrency
  3. Copy condition field: operational requirement bare missed
  4. Pagination: bare returned unbounded results
  5. Runnable RSpec code: bare wrote English descriptions
  6. RESTful cancel: PATCH vs DELETE with query param

Finding: Personas change how problems are explored, not just how answers are presented.

2026-03-27

Phase 7: Format Experiment

Variable: 4 standards.md formats: bare (0 words), war stories (468), compressed (287), fix-only (229). 3 runs × 4 variants = 12 code generations.

PR Verdicts

VariantApprovedRequest changes
war stories3/30/3
bare3/30/3
fix-only1/32/3
compressed0/33/3

Engineer Level Assessments

VariantRun 1Run 2Run 3
bareSeniorSeniorStaff
war storiesSeniorMidMid
fix-onlyMidJuniorSenior
compressedMidMidMid

Finding: Execution context format doesn’t matter. Bare rated senior/staff every time. The spec quality determines output.

2026-03-27

Phase 8: Review Pipeline

Variable: Review approach: plan-aware (has spec) vs code-quality (code only) vs combined (both).

VariantPlan-awareCode-qualityCombined
bare10 issues14 issues11 issues
war stories101211
compressed131413
fix-only101512

Finding: Two focused reviewers beat one combined reviewer. Plan-aware catches “you built the wrong thing.” Code-quality catches “you built it wrong.”

2026-03-27

Phase 9: Integration Test

Variable: Full pipeline end-to-end (plan → execute → review) with two execution mechanisms.

TierMechanismConventions matched
Tier 1Agent teams20/20
Tier 2Subagents10/10
Total30/30

Finding: Execution mechanism doesn’t matter. Architecture carries coherence (14→20), personas add conventions (20→29). 30/30 conventions matched on first pass.

The Complete Picture

PhaseRailsNovelCombinedLossesVariable
Phase 1 (baseline)88%86%87%2
Phase 2a92%100%96%0+ calibration + specific pointers
Phase 2b (aberrant)90%92%91%7hallucination on prompt 02
Phase 2c100%98%99%0confirmation
Hawkeye (structured) r198%90%94%1+ Overwatch (same both domains)
Hawkeye (structured) r294%88%91%4second run
Hawkeye (adapted)92%100%96%0+ domain-adapted Overwatch

For the narrative version of these results, read Two Kinds of Leadership.

Try it yourself

claude marketplace add croftspan/gigo && claude plugin install gigo

Get started in 30 seconds →