CLAUDE.md
Three files, three scopes. Global <30 lines, project source-controlled, local gitignored. Read as context, not enforced as config.
Skip it → the agent forgets what your project is on every other turn.
Most agent failures aren't the model being stupid. They're the harness being thin: missing context, missing tools, missing feedback, missing guardrails. This is the kit for the harness. Opinionated. Evidence-tested. Built backwards from real failures rather than forwards from wishful features.
For most of 2026 we kept blaming the model. The agent forgot context, so the model wasn't smart enough. The agent ran the wrong tool, so the model wasn't smart enough. The agent deleted a file it shouldn't have touched, so the model wasn't smart enough.
It was never the model. It was the configuration around it.
The single most important claim in the field, from Addy Osmani: a decent model with a great harness beats a great model with a bad harness. Once you have read that line, the work of the next ten years rearranges itself. The model is intelligence. The harness is everything else, and everything else is where the work has quietly been the whole time.
HARNESS is the part of that work made public. Seven layers, each with concrete files. Two enforced primitives, so discipline becomes part of the environment instead of a hope. Built backwards from real practitioner failures. Stress-tested against three independent deep-research providers the day it was written. Twelve corrections folded back, on the record, in VALIDATION.md.
§02 · The architecture
Birgitta Böckeler's framework: every harness control is either a guide (feedforward, steers before the agent acts) or a sensor (feedback, observes after). Build both. Guides without sensors is hope. Sensors without guides is thrashing.
Three files, three scopes. Global <30 lines, project source-controlled, local gitignored. Read as context, not enforced as config.
Skip it → the agent forgets what your project is on every other turn.
Loaded on-demand by trigger phrase. Build the second time you do something. Seven good skills beat seventy half-built ones.
Skip it → you teach the agent the same workflow every Monday morning.
Plan, Explore, code-reviewer, general-purpose. Route to the cheapest model that can do the job; isolate context budgets so the main thread stays clean.
Skip it → one giant context that costs more, remembers worse, and crashes louder.
Per-project enablement for money, email, customer data. Auth-scope minimisation everywhere. Treat each server as a supply-chain edge, not a feature.
Skip it → the agent can talk to your data but never act on it, or worse, acts with the wrong scope.
PreToolUse fires before any permission check. The one place where you get guaranteed behaviour. Build the fs-guard and the secret-scan hook before anything else.
Skip it → the agent will eventually run the destructive command you assumed was safe.
Deny rules survive --dangerously-skip-permissions. Build the deny list first; promote frequently-asked commands to allow. Full pipeline: Hooks → Deny → Allow → Ask → Mode.
Skip it → you spend your days clicking "Allow" instead of writing code.
Show model, context-remaining, cost, mode. Prevents the silent footgun of running in the wrong mode for an hour without noticing.
Skip it → three hours into a session you discover you've been on the wrong model the whole time.
Load-bearing primitives that compose with the seven. The decomposition is pedagogical; the inventory keeps growing as the platform does.
Skip it → you reinvent each of these the long way, on a Friday afternoon, in a panic.
§03 · The artefacts
Every deliverable is MIT-licensed and designed to be lifted directly into your ~/.claude/ directory. Working files, not reference documentation. Read the theory if you want to understand. Take the starter kit if you want to start.
Cross-checked against Böckeler, Osmani, Trivedy, Anthropic. The conceptual ground floor.
THEORY.md →Seven layers, five archetypes, the 80/20 hacks, the anti-patterns, the capability test.
HANDBOOK.md →Each week leaves you measurably more capable than the last. No mystery boxes.
GETTING-STARTED.md →CLAUDE.md template, settings.json with deny-list, two PreToolUse hooks, four subagents, INDEX.md.
Browse starter-kit/ →Printable. Pin to your wall. Every move on one screen, including the deny-list seed.
CHEAT-SHEET.html →41 falsifiable hypotheses, three deep-research providers, twelve corrections folded back. Run your own pass.
VALIDATION.md →§04 · The stress test
41 hypotheses, each falsifiable, dispatched to three independent providers with the explicit instruction to surface disconfirming evidence as eagerly as confirming. Where they agreed, the verdict held. Where they disagreed, primary sources won. Twelve corrections were folded back. Full appendix in VALIDATION.md.
Four arXiv papers cited. Two substantive corrections found. The most rigorous of the three.
Strong tables. Independently confirmed the missing sixth permission mode. Surfaced one false-negative hypothesis.
Thinnest of the three. Confirmed the defensible core; weaker on long-tail claims. Useful triangulation.
Corrections folded back into the handbook. None of these were small.
A sixth permission mode (auto) shipped 2026-03-24. The original draft listed only five.
The full control pipeline is Hooks → Deny → Allow → Ask → Mode, not just Deny → Ask → Allow.
50% token reduction from model routing → 40–70% on research tasks; can be negative ROI on small mechanical tasks.
Project-revenue claim retired. Single-practitioner data, not a base rate.
Anthropic devcontainer added as the canonical safe sandbox for --dangerously-skip-permissions.
Skill and MCP supply-chain risk. Anthropic's own skills docs warn explicitly about malicious skills.
The Ratchet Principle Every observed failure becomes a tighter rule in the next release. The kit only ever gets quieter and harder to break.
§06 · One last thing
Twenty-five minutes of reading, one afternoon of setup, and you stop running an LLM and start running an agent. The model has never been the bottleneck.