Version v0.1 Dated 16 May 2026 Licence MIT · open

The model is intelligence. The harness is the work.

Most agent failures aren't the model being stupid. They're the harness being thin: missing context, missing tools, missing feedback, missing guardrails. This is the kit for the harness. Opinionated. Evidence-tested. Built backwards from real failures rather than forwards from wishful features.

Read the handbook View on GitHub Cheat sheet

The literal article The harness is what the animal cannot run without. Nobody photographs it.

Stress-tested the day it was written

Anthropic · ~50 URLs· Google · 25 sources· OpenAI · 24 footnotes· 12 corrections folded back

For most of 2026 we kept blaming the model. The agent forgot context, so the model wasn't smart enough. The agent ran the wrong tool, so the model wasn't smart enough. The agent deleted a file it shouldn't have touched, so the model wasn't smart enough.

It was never the model. It was the configuration around it.

The single most important claim in the field, from Addy Osmani: a decent model with a great harness beats a great model with a bad harness. Once you have read that line, the work of the next ten years rearranges itself. The model is intelligence. The harness is everything else, and everything else is where the work has quietly been the whole time.

HARNESS is the part of that work made public. Seven layers, each with concrete files. Two enforced primitives, so discipline becomes part of the environment instead of a hope. Built backwards from real practitioner failures. Stress-tested against three independent deep-research providers the day it was written. Twelve corrections folded back, on the record, in VALIDATION.md.

§02 · The architecture

Seven layers. One closed loop.

Birgitta Böckeler's framework: every harness control is either a guide (feedforward, steers before the agent acts) or a sensor (feedback, observes after). Build both. Guides without sensors is hope. Sensors without guides is thrashing.

01 Guide Memory

CLAUDE.md

Three files, three scopes. Global <30 lines, project source-controlled, local gitignored. Read as context, not enforced as config.

Skip it → the agent forgets what your project is on every other turn.

02 Guide Skills

Reusable recipes

Loaded on-demand by trigger phrase. Build the second time you do something. Seven good skills beat seventy half-built ones.

Skip it → you teach the agent the same workflow every Monday morning.

03 Guide · Orchestration Subagents

Isolated parallel sessions

Plan, Explore, code-reviewer, general-purpose. Route to the cheapest model that can do the job; isolate context budgets so the main thread stays clean.

Skip it → one giant context that costs more, remembers worse, and crashes louder.

04 Tools MCP

4–6 servers, always on

Per-project enablement for money, email, customer data. Auth-scope minimisation everywhere. Treat each server as a supply-chain edge, not a feature.

Skip it → the agent can talk to your data but never act on it, or worse, acts with the wrong scope.

05 Sensor · Enforced Hooks

Deterministic shell scripts on lifecycle events

PreToolUse fires before any permission check. The one place where you get guaranteed behaviour. Build the fs-guard and the secret-scan hook before anything else.

Skip it → the agent will eventually run the destructive command you assumed was safe.

06 Sensor · Enforced Permissions

Six modes, deny rules survive everything

Deny rules survive --dangerously-skip-permissions. Build the deny list first; promote frequently-asked commands to allow. Full pipeline: Hooks → Deny → Allow → Ask → Mode.

Skip it → you spend your days clicking "Allow" instead of writing code.

07 Observability Statusline

Statusline + output styles

Show model, context-remaining, cost, mode. Prevents the silent footgun of running in the wrong mode for an hour without noticing.

Skip it → three hours into a session you discover you've been on the wrong model the whole time.

+ Composes

Plugins · slash commands · git worktrees

Load-bearing primitives that compose with the seven. The decomposition is pedagogical; the inventory keeps growing as the platform does.

Skip it → you reinvent each of these the long way, on a Friday afternoon, in a panic.

The loop. Guides set up the agent so it acts correctly. Sensors observe what actually happened and feed it back as memory and as enforcement. Hooks → Deny → Allow → Ask → Mode is the control pipeline; everything you write goes through it whether you wrote it that way or not.

§03 · The artefacts

Take what you need. It is all open.

Every deliverable is MIT-licensed and designed to be lifted directly into your ~/.claude/ directory. Working files, not reference documentation. Read the theory if you want to understand. Take the starter kit if you want to start.

Theory · 12 min

What a harness is, and why it matters.

Cross-checked against Böckeler, Osmani, Trivedy, Anthropic. The conceptual ground floor.

THEORY.md →

Long-form · 25 min

The full reference handbook.

Seven layers, five archetypes, the 80/20 hacks, the anti-patterns, the capability test.

HANDBOOK.md →

Build path · 8 min

Six weeks. One layer per week.

Each week leaves you measurably more capable than the last. No mystery boxes.

GETTING-STARTED.md →

Files · drop-in

The starter kit.

CLAUDE.md template, settings.json with deny-list, two PreToolUse hooks, four subagents, INDEX.md.

Browse starter-kit/ →

One page · 2 min

The cheat sheet.

Printable. Pin to your wall. Every move on one screen, including the deny-list seed.

CHEAT-SHEET.html →

Methodology · 8 min

How the kit was stress-tested.

41 falsifiable hypotheses, three deep-research providers, twelve corrections folded back. Run your own pass.

VALIDATION.md →

§04 · The stress test

We sent it to three deep-research providers the day we wrote it. We told them to break it.

41 hypotheses, each falsifiable, dispatched to three independent providers with the explicit instruction to surface disconfirming evidence as eagerly as confirming. Where they agreed, the verdict held. Where they disagreed, primary sources won. Twelve corrections were folded back. Full appendix in VALIDATION.md.

Anthropic Deep Research

~50URLs

Four arXiv papers cited. Two substantive corrections found. The most rigorous of the three.

Google Deep Research

25sources

Strong tables. Independently confirmed the missing sixth permission mode. Surfaced one false-negative hypothesis.

OpenAI Deep Research

24footnotes

Thinnest of the three. Confirmed the defensible core; weaker on long-tail claims. Useful triangulation.

Corrections folded back into the handbook. None of these were small.

Factual error fixed

A sixth permission mode (auto) shipped 2026-03-24. The original draft listed only five.

Factual error fixed

The full control pipeline is Hooks → Deny → Allow → Ask → Mode, not just Deny → Ask → Allow.

Overclaim qualified

50% token reduction from model routing → 40–70% on research tasks; can be negative ROI on small mechanical tasks.

Overclaim qualified

Project-revenue claim retired. Single-practitioner data, not a base rate.

Missing primitive added

Anthropic devcontainer added as the canonical safe sandbox for --dangerously-skip-permissions.

Missing anti-pattern added

Skill and MCP supply-chain risk. Anthropic's own skills docs warn explicitly about malicious skills.

The Ratchet Principle Every observed failure becomes a tighter rule in the next release. The kit only ever gets quieter and harder to break.

§05 · Signed

Four agents. One panel. One human director.

The kit was authored by a panel of AI agents working in parallel sessions, then reviewed against each other before publishing. Human direction throughout. First names only; the work is what matters.

Framing · Academic rigour

Lars

Led the architecture of the handbook. Insisted on the capability test. Wrote the closing argument.

Engineering · Hooks

Theo

Built the starter kit. Wrote the two PreToolUse hooks. Drafted the settings.json deny rules.

Security review

Felix

Locked the deny list. Reviewed the secret-scan patterns. Surfaced the auth-scope minimisation principle.

Positioning · Copy

Rian

Shaped the public narrative, the cheat sheet, and this landing page. Made the kit findable, not just usable.

§06 · One last thing

Stop blaming the model. Build the harness.

Twenty-five minutes of reading, one afternoon of setup, and you stop running an LLM and start running an agent. The model has never been the bottleneck.

Read the handbook View on GitHub