Eval Data — GIGO Research

Methodology

Domains: Rails API (OrderFlow) and Children’s Mystery Novel (The Vanishing Paintings)
Prompts: 10 per domain across 3 axes: quality bars (A), persona voice (B), routing (C)
Comparison: Each prompt run twice. Bare (source files only) vs assembled (full context)
Judging: Blinded A/B with randomized ordering
Criteria: 5 per prompt: quality bar enforcement, persona voice, expertise routing, specificity, pushback quality
Scale: 20 prompts × 5 criteria = 100 evaluations per run

2026-03-26

Phase 1: Baseline

Variable: Assembled vs bare, no optimizations.

Domain	Win %	Ties	Losses
Rails API	88%	5	1
Children’s Novel	86%	6	1
Combined	87%	11	2

Finding: Assembled wins on presentation tasks, ties on content tasks. Matches Hu et al. prediction.

2026-03-26

Phase 2a: Calibration

Variable: + persona calibration heuristic + task-specific “When to Go Deeper” pointers.

Domain	Win %	Ties
Rails API	92%	4
Children’s Novel	100%	0
Combined	96%	4

Finding: 6 lines of calibration code + specific pointers = 9 percentage points.

2026-03-26

Phase 2b: Aberrant Run

Variable: Minor tweak to workflow.

Domain	Win %	Ties	Losses
Rails API	90%	3	5
Children’s Novel	92%	4	0
Combined	91%	7	5 (7 criteria)

Finding: Assembled hallucinated a classification exercise on prompt 02. Eval scored it without catching the error. Led to Overwatch system.

2026-03-26

Phase 2c: Confirmation

Variable: Same as 2a, rerun.

Domain	Win %	Ties
Rails API	100%	0
Children’s Novel	98%	1
Combined	99%	1

Finding: One irreducible tie on open-ended creative prompt (nothing to push back on).

Per-Criterion Breakdown (Phase 2c, 20 prompts)

Do the criteria inflate the score? Three criteria measure context utilization. Two measure output quality. If the tautological criteria were carrying the number, you’d see splits. You don’t.

Criterion	Assembled wins	Tie	Win %
quality_bar (output quality)	19	1	95%
persona_voice (utilization)	20	0	100%
expertise_routing (utilization)	19	1	95%
specificity (utilization)	19	1	95%
pushback_quality (output quality)	19	1	95%
TOTAL	96	4	96%

Finding: The two output-quality criteria (quality_bar, pushback_quality) both show 95%. The utilization criteria show 95-100%. All within 5% of each other. In 19 of 20 prompts, every criterion goes the same direction.

2026-03-26

Phase 3a: Hawkeye Structured

Variable: + Overwatch adversarial checks (same process-compliance checks for both domains).

Run	Rails	Novel	Combined	Losses
Run 1	98%	90%	94%	1
Run 2	94%	88%	91%	4

Finding: Process-compliance Overwatch works for structured domains, hurts creative domains.

2026-03-26

Phase 3b: Hawkeye Domain-Adapted

Variable: Domain-specific Overwatch (process checks for Rails, manuscript engagement for Novel).

Domain	Win %	Ties
Rails API	92%	4
Children’s Novel	100%	0
Combined	96%	4

Finding: Adversarial checks must be domain-adapted. Novel back to 100%.

2026-03-26

Phase 4: Proficiency Test

Variable: Real work output (build API / write opening scene), not style comparison.

Rubric Scores

Domain	Bare	Assembled	Delta
Rails	12/20	18/20	+6 assembled
Novel	19/20	18/20	-1 bare

Qualitative Judging

Domain	Bare verdict	Assembled verdict	Final assessment
Rails	“Senior”	“Mid-level reaching for senior”	“B’s team, with A’s instincts”
Novel	“Veteran”	“Mid-career”	“Rules made B a better architect, not better writer”

Finding: Assembled produces better structural work; bare produces higher peak craft.

2026-03-26

Phase 5: Instinct Experiments

Variable: 5 context formats tested (war stories, negative examples, first-person, minimal, reference-only).

Rails Rubric Scores

Format	Bare	Assembled	Delta	Rank
War Stories	14	20	+6	1st (beat bare)
Negative Examples	18	19	+1	2nd
First-Person Voice	15	18	+3
Minimal (Won’t-Do)	12	13	+1
Reference-Only	18	19	+1

Novel Rubric Scores

Format	Bare	Assembled	Delta
First-Person Voice	20	19	-1
Minimal (Won’t-Do)	20	18	-2
Reference-Only	20	18	-2
War Stories	19	16	-3
Negative Examples	19	6 (failed)	-13

Finding: War stories produce instincts in structured domains. No format beats bare on creative work. Combo test (rules + war stories) ranked last. More context dilutes instincts.

2026-03-27

Phase 6: Planning Pipeline

Variable: Whether brainstormer had personas loaded. Same task, same 7 scripted answers.

Judge verdict: Assembled wins decisively. “I would hand Team B’s pipeline output to my engineering team.”

6 specific assembled wins:

Partial unique index: prevents data integrity bug
FOR UPDATE SKIP LOCKED: better concurrency
Copy condition field: operational requirement bare missed
Pagination: bare returned unbounded results
Runnable RSpec code: bare wrote English descriptions
RESTful cancel: PATCH vs DELETE with query param

Finding: Personas change how problems are explored, not just how answers are presented.

2026-03-27

Phase 7: Format Experiment

Variable: 4 standards.md formats: bare (0 words), war stories (468), compressed (287), fix-only (229). 3 runs × 4 variants = 12 code generations.

PR Verdicts

Variant	Approved	Request changes
war stories	3/3	0/3
bare	3/3	0/3
fix-only	1/3	2/3
compressed	0/3	3/3

Engineer Level Assessments

Variant	Run 1	Run 2	Run 3
bare	Senior	Senior	Staff
war stories	Senior	Mid	Mid
fix-only	Mid	Junior	Senior
compressed	Mid	Mid	Mid

Finding: Execution context format doesn’t matter. Bare rated senior/staff every time. The spec quality determines output.

2026-03-27

Phase 8: Review Pipeline

Variable: Review approach: plan-aware (has spec) vs code-quality (code only) vs combined (both).

Variant	Plan-aware	Code-quality	Combined
bare	10 issues	14 issues	11 issues
war stories	10	12	11
compressed	13	14	13
fix-only	10	15	12

Finding: Two focused reviewers beat one combined reviewer. Plan-aware catches “you built the wrong thing.” Code-quality catches “you built it wrong.”

2026-03-27

Phase 9: Integration Test

Variable: Full pipeline end-to-end (plan → execute → review) with two execution mechanisms.

Tier	Mechanism	Conventions matched
Tier 1	Agent teams	20/20
Tier 2	Subagents	10/10
Total		30/30

Finding: Execution mechanism doesn’t matter. Architecture carries coherence (14→20), personas add conventions (20→29). 30/30 conventions matched on first pass.

The Complete Picture

Phase	Rails	Novel	Combined	Losses	Variable
Phase 1 (baseline)	88%	86%	87%	2
Phase 2a	92%	100%	96%	0	+ calibration + specific pointers
Phase 2b (aberrant)	90%	92%	91%	7	hallucination on prompt 02
Phase 2c	100%	98%	99%	0	confirmation
Hawkeye (structured) r1	98%	90%	94%	1	+ Overwatch (same both domains)
Hawkeye (structured) r2	94%	88%	91%	4	second run
Hawkeye (adapted)	92%	100%	96%	0	+ domain-adapted Overwatch

Try it yourself

claude plugin marketplace add croftspan/gigo && claude plugin install gigo

Get started in 30 seconds →