Version v0.1 Dated 16 May 2026 Licence MIT · open

The model is intelligence. The harness is the work.

Most agent failures aren't the model being stupid. They're the harness being thin: missing context, missing tools, missing feedback, missing guardrails. This is the kit for the harness. Opinionated. Evidence-tested. Built backwards from real failures rather than forwards from wishful features.

Close-up of a traditional leather horse harness with polished brass buckles, lit dramatically against a dark backdrop
The literal article The harness is what the animal cannot run without. Nobody photographs it.
Stress-tested the day it was written
Anthropic · ~50 URLs· Google · 25 sources· OpenAI · 24 footnotes· 12 corrections folded back

Why this exists

For most of 2026 we kept blaming the model. The agent forgot context, so the model wasn't smart enough. The agent ran the wrong tool, so the model wasn't smart enough. The agent deleted a file it shouldn't have touched, so the model wasn't smart enough.

It was never the model. It was the configuration around it.

The single most important claim in the field, from Addy Osmani: a decent model with a great harness beats a great model with a bad harness. Once you have read that line, the work of the next ten years rearranges itself. The model is intelligence. The harness is everything else, and everything else is where the work has quietly been the whole time.

HARNESS is the part of that work made public. Seven layers, each with concrete files. Two enforced primitives, so discipline becomes part of the environment instead of a hope. Built backwards from real practitioner failures. Stress-tested against three independent deep-research providers the day it was written. Twelve corrections folded back, on the record, in VALIDATION.md.

§02 · The architecture

Seven layers. One closed loop.

Birgitta Böckeler's framework: every harness control is either a guide (feedforward, steers before the agent acts) or a sensor (feedback, observes after). Build both. Guides without sensors is hope. Sensors without guides is thrashing.

01 Guide Memory

CLAUDE.md

Three files, three scopes. Global <30 lines, project source-controlled, local gitignored. Read as context, not enforced as config.

Skip it → the agent forgets what your project is on every other turn.

02 Guide Skills

Reusable recipes

Loaded on-demand by trigger phrase. Build the second time you do something. Seven good skills beat seventy half-built ones.

Skip it → you teach the agent the same workflow every Monday morning.

03 Guide · Orchestration Subagents

Isolated parallel sessions

Plan, Explore, code-reviewer, general-purpose. Route to the cheapest model that can do the job; isolate context budgets so the main thread stays clean.

Skip it → one giant context that costs more, remembers worse, and crashes louder.

04 Tools MCP

4–6 servers, always on

Per-project enablement for money, email, customer data. Auth-scope minimisation everywhere. Treat each server as a supply-chain edge, not a feature.

Skip it → the agent can talk to your data but never act on it, or worse, acts with the wrong scope.

05 Sensor · Enforced Hooks

Deterministic shell scripts on lifecycle events

PreToolUse fires before any permission check. The one place where you get guaranteed behaviour. Build the fs-guard and the secret-scan hook before anything else.

Skip it → the agent will eventually run the destructive command you assumed was safe.

06 Sensor · Enforced Permissions

Six modes, deny rules survive everything

Deny rules survive --dangerously-skip-permissions. Build the deny list first; promote frequently-asked commands to allow. Full pipeline: Hooks → Deny → Allow → Ask → Mode.

Skip it → you spend your days clicking "Allow" instead of writing code.

07 Observability Statusline

Statusline + output styles

Show model, context-remaining, cost, mode. Prevents the silent footgun of running in the wrong mode for an hour without noticing.

Skip it → three hours into a session you discover you've been on the wrong model the whole time.

+ Composes

Plugins · slash commands · git worktrees

Load-bearing primitives that compose with the seven. The decomposition is pedagogical; the inventory keeps growing as the platform does.

Skip it → you reinvent each of these the long way, on a Friday afternoon, in a panic.

The loop. Guides set up the agent so it acts correctly. Sensors observe what actually happened and feed it back as memory and as enforcement. Hooks → Deny → Allow → Ask → Mode is the control pipeline; everything you write goes through it whether you wrote it that way or not.

§04 · The stress test

We sent it to three deep-research providers the day we wrote it. We told them to break it.

41 hypotheses, each falsifiable, dispatched to three independent providers with the explicit instruction to surface disconfirming evidence as eagerly as confirming. Where they agreed, the verdict held. Where they disagreed, primary sources won. Twelve corrections were folded back. Full appendix in VALIDATION.md.

Anthropic Deep Research
~50URLs

Four arXiv papers cited. Two substantive corrections found. The most rigorous of the three.

Google Deep Research
25sources

Strong tables. Independently confirmed the missing sixth permission mode. Surfaced one false-negative hypothesis.

OpenAI Deep Research
24footnotes

Thinnest of the three. Confirmed the defensible core; weaker on long-tail claims. Useful triangulation.

Corrections folded back into the handbook. None of these were small.

Factual error fixed

A sixth permission mode (auto) shipped 2026-03-24. The original draft listed only five.

Factual error fixed

The full control pipeline is Hooks → Deny → Allow → Ask → Mode, not just Deny → Ask → Allow.

Overclaim qualified

50% token reduction from model routing → 40–70% on research tasks; can be negative ROI on small mechanical tasks.

Overclaim qualified

Project-revenue claim retired. Single-practitioner data, not a base rate.

Missing primitive added

Anthropic devcontainer added as the canonical safe sandbox for --dangerously-skip-permissions.

Missing anti-pattern added

Skill and MCP supply-chain risk. Anthropic's own skills docs warn explicitly about malicious skills.

The Ratchet Principle Every observed failure becomes a tighter rule in the next release. The kit only ever gets quieter and harder to break.

§05 · Signed

Four agents. One panel. One human director.

The kit was authored by a panel of AI agents working in parallel sessions, then reviewed against each other before publishing. Human direction throughout. First names only; the work is what matters.

Framing · Academic rigour

Lars

Led the architecture of the handbook. Insisted on the capability test. Wrote the closing argument.

Engineering · Hooks

Theo

Built the starter kit. Wrote the two PreToolUse hooks. Drafted the settings.json deny rules.

Security review

Felix

Locked the deny list. Reviewed the secret-scan patterns. Surfaced the auth-scope minimisation principle.

Positioning · Copy

Rian

Shaped the public narrative, the cheat sheet, and this landing page. Made the kit findable, not just usable.

§06 · One last thing

Stop blaming the model. Build the harness.

Twenty-five minutes of reading, one afternoon of setup, and you stop running an LLM and start running an agent. The model has never been the bottleneck.