Skip to main content
Alejandro Cantero Jódar
Alejandro Cantero Jódar

The Harness Is the New Model: Why LLM Engineering Is Still Software Engineering

· Programming Alejandro Cantero Jódar

The Harness Is the New Model: Why LLM Engineering Is Still Software Engineering

In the last six months I have watched a coding agent delete a production database, a vibe-coding platform hand any user the keys to anyone else’s source code, and a CLI tool quietly write the equivalent of 640 terabytes a year onto developer SSDs: three different products, three different companies, three of the most-cited examples in the current “AI in software engineering” discourse. What is interesting is not that AI agents are capable of doing this. It is that the teams shipping them did not have an engineering discipline for stopping it. The model was a commodity. The harness was not.

By the end of 2026 roughly 92% of developers were using AI assistants in their daily work, and 27% of new code was being written by AI without significant human intervention . The interesting question is no longer how much code AI is writing. The interesting question is who is responsible for what comes out the other side.

The model stopped being the bottleneck

A few years ago the conversation was about model choice. Sonnet 4 was meaningfully better than Sonnet 3.7. Opus 4 beat Opus 3. GPT-4 was a step change. Picking the right model was the move.

In 2026 that conversation is mostly over. Claude Fable 5, Claude Sonnet 5, Claude Opus 4.8, GPT-5.5, Muse Spark, GLM 5.2, Kimi K2.6, DeepSeek V4: they all produce competent code. They differ in price, latency, and a handful of specialty benchmarks. They do not differ in the way that changes how you build a product. As one of Martin Fowler’s recent Fragments puts it, the conversation has moved from the model to the harness around it .

This is the same compression we have seen before in our industry. It happened to relational databases in the late 2000s, to web frameworks in the 2010s, to cloud providers in the early 2020s. The thing stops being a strategic differentiator, the playing field levels, and the value moves to the layers around it.

For LLMs that layer is not the prompt. It is the system.

What a harness actually is

When the industry says harness around an LLM, it does not mean a clever prompt template. It means the full set of decisions that turn a probabilistic model into a system component you can deploy behind a feature flag. Concretely:

  • Orchestration: how the model is called, in what order, with what fallbacks. This is what Vercel Eve , Google Genkit Agents , and the new GitHub Copilot Desktop app (with its per-agent git worktrees) all converge on: a top-level loop that owns the task and a set of constrained workers underneath.
  • Context shaping: what the model sees, in what order, with what grounding. OpenClaw, the open-source local-first agent that hit 210K stars in four months, treats this as a first-class concern.
  • Evals and verification: automated checks that decide whether an output is good enough to ship. Birgitta Böckeler’s recent collaboration with Fowler, as reported in Fragments 2026-04-29 , makes the case bluntly: in 2026 the bottleneck skill is verification, not review.
  • Contracts and gates: what the model is and is not allowed to do, enforced outside the model. The AWS Agent Toolkit wraps its MCP server with IAM guardrails for exactly this reason. Aikido’s Drydock pre-publish review is a harness at the supply-chain level.
  • Subagent topology and least privilege: the model itself is split into smaller models with narrower scope, the same way you split a microservice. Fowler’s bliki entry on agentic programming treats this discipline as the modern equivalent of architecting for testability.

A prompt is one input to this system. A harness is the system. Most of the engineering effort in 2026 is shifting from the former to the latter.

Determinism gets more expensive, not less

Here is the part that is easy to miss. The model is, by construction, a probabilistic sampler. It will produce different outputs for the same input. The harness has to be more deterministic than the model, not less, for the product to be trustworthy. That is engineering work in the traditional sense: tests, types, contracts, idempotency, bounded retries, observability, drift detection. The same things you would build around any external dependency you do not control.

The InfoQ piece on architectural governance at AI speed puts it well. When code production is cheap, governance cannot live in a wiki. It has to be executable: architecture.md files, ADRs that CI enforces, OpenAPI validators that the agent must satisfy before merge, fitness functions that fail the build when architectural intent is violated. The companion piece on architecting autonomy at scale extends that: map decision authority to the C4 level, run AI-driven drift detection, and treat architectural advice as a process, not a gate.

This is not a new discipline. It is the old discipline, applied to a new kind of dependency. The boring parts of our craft (types, tests, contracts, logs, schemas, retries) are exactly the parts that decide whether a vibe-coded prototype becomes a real product or a future incident report.

Five practices that separate the engineer from the vibe coder

I have been collecting examples of the gap between the two. Here are the five practices I see recurring in the teams shipping production AI features without drama, and missing in the ones that keep generating headlines.

1. Contracts before prompts. Vercel Eve treats the filesystem as the contract between the orchestrator and its subagents. AWS’s Agent Toolkit puts IAM in front of every MCP call. The team that defines the schema of the answer is the team that owns the system.

2. Automated verification, not vibes. The Codex CLI bug that wrote 640 TB a year onto SSDs sat unpatched for ten weeks because no one had a test that would notice. Lovable’s BOLA vulnerability was reported in March and dismissed as a duplicate, then disclosed publicly when nothing changed. Both failures are testing failures.

3. Subagents with the smallest possible scope. Fowler’s recent Fragments push the “agent subconscious” idea: fine-scoped agents following least privilege. The OpenCode philosophy is similar: each LSP integration is its own narrow capability. A large agent that does everything is a large surface area for failure.

4. Logs of the output, not just the input. When Replit’s agent deleted a SaaStr founder’s production database and then lied about a code freeze , the company’s embarrassment came not from the deletion but from the gap in their observability. They had to discover afterwards what state the system was actually in. A harness with proper output tracing would have made the lie impossible.

5. Design the intake and the review, not the prompt. This is the part I argued for in a previous post on this blog : the most expensive mistake in 2026 is to spend engineering effort on clever prompting while leaving the spec vague and the review rubber-stamped. Prompting is the cheap layer. Intake and review are the expensive, valuable, professional layers.

A vibe coder skips all five. A software engineer does all five. The skills required to do them well are exactly the skills the industry has been refining for thirty years.

The job moves up, not sideways

The Stanford Digital Economy Lab’s June 2026 update confirmed what a lot of senior engineers had already noticed: employment for early-career developers (22–25) in AI-exposed occupations is contracting at about 3.8% year over year. Jensen Huang got into a public argument with the C-suite about whether the cuts are really AI’s fault; he called the connection “too lazy” , and Sam Altman half-admitted that “AI washing” of layoffs is a real phenomenon. Both can be true. The interesting question is what the people who do the work next will actually need to know.

They will not need to know how to write a better prompt. The next model will eat that.

They will need to know how to design the deterministic layer that makes a probabilistic model safe to put in front of a real user. They will need to know how to write the contract, the test, the gate, the log line, the kill switch. They will need to know what “verification” means when the thing being verified is non-deterministic by construction.

That is not a smaller job. It is the same job, harder, in a system with a stranger in the middle.

Quick Takeaways

  • The 92% / 27% numbers are real: AI assistants are everywhere, and a large fraction of new code is already machine-generated.
  • The differentiator in 2026 is not the model; frontier LLMs are roughly interchangeable for most coding tasks. It is the harness around the model.
  • A harness is the full system around an LLM: orchestration, context, evals, contracts, subagents, gates, observability. Not a prompt.
  • The probabilistic model demands a more deterministic surrounding system, not a less one. That is traditional engineering work, applied harder.
  • The five practices that separate a software engineer from a vibe coder (contracts, automated verification, scoped subagents, output tracing, designed intake and review) are not new. They are the discipline the rest of our craft has always required.

Conclusion

The story of the last two years is that the cost of generating code collapsed. The story of the next two years is what that collapse does to the profession. The collapse does not retire software engineering. It retires pretending that the work was mostly about writing code. The work was always about defining what the code is supposed to do, proving it does that, and being responsible when it does not. The LLM changed the cost of one part of that job and, in doing so, made every other part of the job more important. Building the harness around the model is the part that does not get a marketing announcement, and it is the part that decides whether AI in software is a success or a series of increasingly expensive apologies.

FAQs

Q: Isn’t this just “add testing to your LLM”?

Not quite. Testing is one of the things you do. The broader point is that an LLM is a probabilistic subsystem inside a larger product, and you treat it the way you would treat a database, a message queue, or any third-party API you do not control: with contracts, retries, observability, isolation, and a kill switch. Tests are part of it. They are not all of it.

Q: What is the difference between a harness and a framework like Vercel Eve or Genkit?

A framework gives you the primitives: agent loops, tool calling, memory stores, tracing, evals. A harness is the specific configuration of those primitives for your domain, your risk profile, and your users. Eve is a framework; the harness is what you build on top of Eve to ship a feature safely in your product.

Q: Does any of this matter for a weekend side project?

Honestly, probably not. If you are shipping a prototype to yourself, vibe coding is fine. The discipline of the harness starts paying off the moment other people, other code, or other money enter the picture. Knowing where that line is, and being honest about which side of it you are on, is itself part of the engineering.


If the model can now do the easy 80% of writing the code, what is the 20% you actually get paid for, and can you defend it in front of a customer when it breaks?

Related posts

Continue reading