How your project stays sharp

Four pillars that prevent the rot every AI project hits.

The problem every AI project hits

Context files grow monotonically. Every session discovers something worth remembering, so rules files grow. Adding, never removing. Within weeks: 200+ lines of overlapping rules, outdated guidance, reference-depth detail loading on every conversation.

Performance degrades. Sessions take longer, make more mistakes, sometimes ignore rules entirely. There’s too much to attend to.

Then Gloaguen et al. (2026) published the first rigorous study confirming it: bloated context files reduce task success rates while increasing cost by 20%+. You pay more and get worse results.

Deep expertise without the bloat

RulesReferences
Location.claude/rules/.claude/references/
LoadingAuto-loaded every conversationOn-demand only
Token costPaid every timeZero when unused
ContentPhilosophy, quality bars, workflow, universal patternsDeep knowledge, technique catalogs, authority profiles
Cap~60 lines per file, ~300 totalNo cap

The key innovation: “When to Go Deeper” pointers. Rules files don’t just point to references. They tell the agent when to read them. Not generic “see also” links but task-specific triggers:

“When tuning cooperative balance, read .claude/references/cooperative-balance.md.”

This makes the system task-aware: deep context loads only when relevant. Phase 2a validated it. Task-specific pointers took the children’s novel from 86% to 100%.

Less is more (and we can measure it)

A 30-line file that nails the essentials outperforms a 100-line file that dilutes attention. Every rules file stays under ~60 lines. Total budget across all rules: ~300 lines.

When a file approaches the cap, the essential stays in rules/ and the depth moves to references/. Your project never pays for knowledge it doesn’t need right now.

Only write what matters

If Claude can figure it out by reading your project, don’t write a rule for it. Directory structure, file organization, standard patterns: the agent finds those on its own. Writing them down actually hurts. It wastes attention on information the agent already has.

What’s worth writing: philosophy, quality bars, blended authorities, anti-patterns. The things your project cares about that aren’t obvious from reading the files.

Automatic cleanup every session

Named after Tony’s snap, not Thanos’s. Not indiscriminate wiping. Sacrificing what has to go so what matters survives.

Every session ends with a 10-point audit:

  1. Line check: any file approaching ~60 lines?
  2. Derivability check: can the agent figure this out from the code?
  3. Overlap check: rules saying the same thing in different words?
  4. Staleness check: rules that no longer apply?
  5. Cost check: worth loading on every conversation?
  6. Persona calibration check: domain knowledge in rules that belongs in references?
  7. Total budget check: approaching ~300 lines across all files?
  8. Coverage check: project outgrown the team?
  9. Overwatch check: adversarial self-check system present?
  10. Pipeline check: three phases (plan, execute, review) still intact?

Projects get sharper over time, not bigger.

Expert teams that actually have opinions

Not “a senior developer.” That’s generic. Your team members are specific: Andrew Kane’s database ops pragmatism + Sandi Metz’s ‘small objects that talk to each other’ + DHH’s convention-over-configuration. 2-3+ real authorities per team member, each bringing something the others don’t.

Quality bars and approach stay in the always-on rules. Deep knowledge (technique catalogs, authority profiles) loads on demand. Your team is present in every conversation without making it slow.

Research found that expert personas help alignment but can hurt knowledge retrieval. GIGO addresses this with a calibration heuristic that tells the model when to lean into expertise and when to lead with training. This single adjustment added 9% in testing.

An independent reviewer that assumes you’re wrong

Your work gets reviewed by something that isn’t trying to be nice. Not a yes-man. Not a rubber stamp. Three layers of review, each catching different things, each running before your work ships.

Before you commit to a design

The fact-checker reads your design brief and checks every assumption against the actual project. “You’re proposing a new event system but one already exists.” “The file you reference was renamed last week.” “Your insertion point is wrong. It appends after the final step instead of inserting between steps.”

This runs before you even approve the design. The assumptions that feel obvious are the ones that ship bugs.

Before you write a line

The Challenger reviews your specs and plans in two passes. First pass is blind: it doesn’t know what you asked for. It reads the spec cold and judges it purely on whether it will work. Will the dependency graph hold? Will a worker get stuck on an underspecified step? Does this actually fit the project as it exists today?

Second pass: does this solve the problem you actually stated? Drift, scope creep, misinterpretation. The Challenger catches the gap between what you meant and what the spec says.

Honest scores. When we used it on our own spec, it found a constraint that would have crashed at runtime. That bug would have hit the first user to trigger the feature.

After every task

Two focused reviewers run after execution. One checks spec compliance: did you build the right thing? The other checks craft quality: is the work good? Two passes find more than one pass trying to do both, because they’re asking different questions with different expertise.

Issues get triaged automatically. Some the worker fixes. Some need your input. Some get noted for later. You don’t wade through a wall of findings. You deal with what matters.

Expertise where it helps, freedom where it doesn’t

1
Plan
Experts shape the questions
2
Execute
Workers follow the spec
3
Review
Team catches what's missed

During planning, your expert team asks the questions you’d only think of at 2am. They catch gaps, surface edge cases, and embed their expertise as concrete requirements in the spec.

During execution, workers get the spec and produce high-quality output. No extra rules, no hand-holding. The quality is in the spec, not the worker’s context.

During review, two focused passes catch what one pass misses. Did you build the right thing? Is the work good? Different questions, different lenses, more issues found.

For the full data and narrative, read Two Kinds of Leadership.