- AI-native
- Agents
- Spec-driven
- Claude Code
One prompt in. A team of agents out.
Field notes from Prompt Night in Florianópolis: the harness that turns one prompt into a fleet of agents, and how to teach it your craft.
My phone shipped a smart contract while I was asleep. Base mainnet, 4:17 in the morning, receipt in 8 minutes. No one was at the keyboard.
That push notification is the whole argument. 3 years ago I typed code into an editor. Last year I typed prompts into a chat. Now I review work from a fleet of agents while I sleep. Same person, 3 layers up.
The bottleneck moved.
You're not prompting. You're engineering a harness. The model is one box. Everything around it is the part you build: the contract you write, the skills you teach it, the tests that catch it. Birgitta Böckeler calls this harness engineering, and the name fits.
Stripe's coding agents merge more than 1,300 pull requests a week. Around 70% land unmodified. The model didn't get that much smarter in a year. The harness did.
Spec first. The guardrail AI can't fake.
We ran the same ticket 3 ways: same model, 3 git worktrees, a 25-minute cap. Add reply-rate by hour of day to campaign analytics. One prompt, 3 disciplines.
Vibe-coding put a chart on the page in 18 minutes with 0 tests. It also shipped a silent bug. The numerator counted replies by reply hour. The denominator counted sends by send hour. A 9am reply to a 3am send landed in 2 different buckets. The metric was meaningless, and nothing caught it.
Spec-driven took 8 minutes and 10 tests. The spec named that exact bug in plain English on line 1, before any code ran. Discipline shipped faster than the shortcut. Vibe pays its tax mid-flight, in debug loops you can't see coming.
- vibe-code · 0 tests · wrong
- 18m
- spec-driven · 10 tests · correct
- 8m
- where the spec named the bug
- line 1
The rule we use:
- Vibe for throwaway spikes that live under an hour.
- Plan for solo work that stays in your head.
- Spec for anything a second person or agent will touch, or anything shipping to production.
Failing test second.
A red test proves the prompt was understood. A green test proves the work is done. In between sits a one-line diff and an atomic commit a reviewer reads in 12 seconds.
test("rejects expired session cookie", async () => {
const res = await app.request("/api/me", {
headers: { cookie: `sid=${expiredSid}` },
});
expect(res.status).toBe(401);
});
// RED: received 200 → the prompt landed
// FIX: one guard clause, then GREEN
// git: fix(auth): reject expired session cookieSkills are the new function calls.
A skill is a markdown file with 2 required fields and a body. The frontmatter is an index: a name and a description. The body is the runbook, and it loads only when the description matches the task. 200 skills sit dormant and cost almost nothing until one fires.
This very talk was scoped with a brainstorming skill. The audit skills that reviewed our contracts were the same shape. Once you see it, you stop writing prompts and start writing skills.
Teach Claude your craft.
My colleague Julia Buratto gave the companion talk: teach Claude one thing exactly the way you do it, live, in your own repo. 5 steps. Install it, write the contract, set up one expert, fire it, iterate.
The contract is CLAUDE.md: short, always on, what the agent must never get wrong. Rules tell it what. Skills tell it how. 6 lines beat 6 pages.
Then match the leash to the task. A fragile step like a deploy gets exact instructions. An open-ended one like naming or voice gets direction and trust.
A skill that grades your harness.
Julia also built harness-doctor, a skill that audits the thing it's made of. Point it at any repo and it grades 7 dimensions: contract, skills, guides, sensors, hooks, memory, and the improvement loop. It prints a scorecard and the cheapest fixes first.
npx skills add jubscodes/harness-doctorIt's a worked example of a good skill: a lean orchestrator, with the depth one hop away, loaded only when needed.
4 minutes, not 16.
Map a 200,000-line repo with one agent. It reads everything, then loses the thread by token 80,000. Fan the work out instead. One conductor, 4 specialists: a codebase mapper, a dependency scanner, a history reader, an external search. Each returns a structured report, not a raw dump. You pay for the slowest agent, not the sum.
Subagents are MapReduce: spawn helpers, collect reports, no cross-talk. Agent teams are a small crew that argues, with a shared task list and direct messages. Same physics, different token bill. Run them in separate git worktrees so 3 agents don't fight over one index.
Cost is a knob, not a verdict.
3 dials keep a $40 weekend from becoming $400. Split the models: Opus plans and reviews, Sonnet executes. Watch the context: compact long sessions, start fresh when the cache window slips. Compress the chatter: a caveman hook drops articles and filler and takes 30 to 60% off a long run.
OutreachAI proved it. 5 specs and a failing test on Friday night. Agents ran while I reviewed pull requests from the beach on my phone. By Sunday: 26 specs, 2,200 TypeScript files, about $40 spent. Each teammate got an agent in their own voice.
Receipts, not demos.
For ipê-city we ran the same product through 4 frameworks in parallel: Spec Kit, GSD, Superpowers, and a Ralph loop. 2 of them generated 14× the code of the winner and never merged. The Ralph branch shipped 13,800 lines, 570 tests, 93.66% measured coverage. It's the only one I trusted on main.
The agent audited its own contracts with 4 security skills, fixed a real integer overflow on milestone math, and deployed 6 contracts to Base. Every evaluation writes a transaction hash a user can click. Type-checks lie. Unit tests lie. An on-chain assertion doesn't.
The agent audited its own contracts, fixed the overflow, and opened the PR. No human in the verify loop.
That 4:17 smart contract from the top? It was a cron job in vlad-cockpit, the panel I'm building to run a dozen agents at once. The receipt was real. The demo was asleep.
Then write down what you fixed twice and make it a skill. Ship the receipt, not the demo.