Harness Engineering for Coding Agent Users: Building Trust and Quality in AI-Assisted Development

The concept of a "harness" in the context of AI agents, particularly those designed for coding, has evolved from a broad definition encompassing all components beyond the core model to a more nuanced understanding. As articulated by experts in the field, the equation Agent = Model + Harness highlights that the intelligence of an AI agent is amplified by the supporting infrastructure around it. For coding agents, this infrastructure, or harness, is crucial for maximizing their effectiveness and reliability. While a foundational harness is often built into the agent through system prompts, code retrieval mechanisms, and sophisticated orchestration systems, users can and should build an outer harness tailored to their specific use cases and existing systems. This outer harness serves two critical objectives: it significantly increases the probability of the agent producing correct outputs from the outset and establishes a feedback loop that enables self-correction, catching and rectifying numerous issues before they ever reach human review. Ultimately, this leads to a reduction in review toil, an improvement in system quality, and a more efficient use of computational resources, evidenced by fewer wasted tokens.
The Evolution of AI Agent Infrastructure
The term "harness" in AI development has gained prominence as a way to describe the external mechanisms that guide and constrain AI models, ensuring their outputs align with desired outcomes. Initially, the term was used broadly to denote anything that wasn’t the core AI model itself. However, as AI agents, particularly coding agents, become more sophisticated and integrated into development workflows, a more refined definition is necessary.
The development of coding agents can be traced back to early attempts at automated code generation and assistance. Systems like GitHub Copilot, launched in 2021, marked a significant leap, demonstrating the potential of large language models (LLMs) to suggest code snippets and entire functions. However, the inherent probabilistic nature of LLMs means that generated code can sometimes be incorrect, insecure, or deviate from project standards. This is where the concept of the "harness" becomes paramount.
The "Agent = Model + Harness" paradigm posits that a powerful AI model, when coupled with a robust harness, can achieve far greater reliability and utility. The harness acts as a control mechanism, providing context, enforcing rules, and enabling feedback. For coding agents, this translates into a system that not only understands code but also adheres to project-specific conventions, architectural principles, and functional requirements.

Feedforward and Feedback Mechanisms: The Pillars of Harness Design
Effective harness engineering for coding agents hinges on two fundamental mechanisms: feedforward and feedback. Feedforward mechanisms proactively guide the agent towards desirable outputs by providing it with relevant information and constraints before it generates code. This can include detailed instructions, architectural guidelines, coding conventions, and access to relevant documentation. The goal is to increase the likelihood of correct output from the initial generation.
Conversely, feedback mechanisms allow the agent to self-correct by evaluating its generated output and identifying deviations from expected standards or requirements. These sensors act as quality gates, flagging errors or areas for improvement. When combined, feedforward and feedback create a synergistic system. A feed-forward-only approach might encode rules but lack a mechanism to verify their effectiveness, leading to repeated mistakes. A feedback-only approach, on the other hand, might identify errors but without clear guidance, could lead to an agent that perpetually repeats the same types of mistakes.
The integration of both feedforward and feedback is essential for a self-improving system. For instance, a feedforward guide might include a comprehensive set of coding conventions. If the agent consistently violates a specific convention, a feedback sensor (like a linter or a review agent) can detect this violation. This detected issue then informs an iterative refinement of the feedforward guides, perhaps by making the convention more explicit or by providing a concrete example of its correct application. This continuous loop of guidance and correction is the hallmark of a well-engineered harness.
Computational vs. Inferential Controls: A Dual Approach
Within the feedforward and feedback loops, controls can be categorized as either computational or inferential. Computational controls leverage deterministic tools and deterministic logic to guide and evaluate code. These are typically fast, reliable, and inexpensive to run. Examples include static analysis tools (linters, formatters), architectural constraint checkers, and script-based code transformations. They excel at enforcing well-defined rules and detecting syntactical or structural issues.
Inferential controls, on the other hand, rely on more sophisticated, often non-deterministic, reasoning capabilities, typically powered by LLMs. While more expensive and less predictable than computational controls, inferential mechanisms can provide richer guidance and perform semantic judgment. They can understand nuanced instructions, assess code quality based on broader context, and even infer intent. Inferential sensors can add a layer of semantic understanding to the review process, identifying issues that purely computational tools might miss.

The strategic interplay between these two types of controls is key. Computational guides, such as code generation templates or linters configured with strict rules, increase the probability of good results with deterministic tooling. Computational sensors, like pre-commit hooks running structural tests, are fast and cheap enough to be integrated into every change. Inferential guides, such as detailed explanations of desired code behavior or examples of best practices, offer rich guidance. Inferential sensors, such as AI-powered code review agents, can provide additional semantic judgment, increasing trust, especially when paired with capable LLMs.
| Regulation Dimension | Direction | Computational / Inferential | Example Implementations |
|---|---|---|---|
| Coding Conventions | Feedforward | Inferential | AGENTS.md (documentation), Skills (instructional modules) |
| Project Bootstrapping | Feedforward | Both | Skill with bootstrapping instructions and an accompanying script |
| Code Transformation | Feedforward | Computational | Tools like OpenRewrite for automated code refactoring |
| Structural Tests | Feedback | Computational | Pre-commit hooks running ArchUnit tests to enforce module boundaries |
| Review Instructions | Feedback | Inferential | Skills providing detailed guidance on how to review code |
| Semantic Code Duplication | Feedback | Inferential | AI agents identifying conceptually similar but differently written code sections |
| Over-engineered Solutions | Feedback | Inferential | AI agents flagging code that appears unnecessarily complex for its task |
| Misunderstood Instructions | Feedback | Inferential | AI agents identifying generated code that deviates from the core intent of the prompt |
The Steering Loop: Human Oversight and Iterative Improvement
The ultimate goal of harness engineering is not to eliminate human developers but to augment their capabilities and focus their efforts where they are most impactful. The "steering loop" represents the ongoing process of human oversight and iterative refinement of the harness. When an issue is repeatedly encountered, it signals an opportunity to improve the feedforward guides or feedback sensors. This iterative process ensures that the harness evolves alongside the codebase and the development team’s understanding.
Moreover, AI itself can play a role in improving the harness. As coding agents become more capable, they can be leveraged to build more sophisticated controls. This includes assisting in the creation of structural tests, generating draft rules from observed coding patterns, scaffolding custom linters, or even deriving how-to guides from codebase archaeology. This creates a virtuous cycle where AI assists in building better AI tools.
Timing: Keeping Quality Left in the Development Lifecycle
A core principle in modern software development, particularly with practices like continuous integration and continuous delivery (CI/CD), is to "keep quality left." This means identifying and rectifying issues as early as possible in the development lifecycle, where they are significantly cheaper and easier to fix. The cost of fixing a bug found during development is orders of magnitude lower than fixing one found in production.
Harness components, both feedforward and feedback, must be strategically distributed across the change lifecycle. Early stages of a code change might involve lightweight computational checks like linting and formatting. As the change progresses, more complex computational tests, such as unit and integration tests, are run. Inferential checks, which might be more time-consuming or resource-intensive, can be deployed at various stages, including as part of code review processes or in more comprehensive pipeline stages.

Feedforward and Feedback in the Change Lifecycle Examples:
- Initial Generation: Language Server Protocols (LSPs) for real-time suggestions, architectural documentation (
architecture.md), instructional skills (/how-to-testskill), agent configuration files (AGENTS.md), knowledge management integrations (MCP server), and API documentation skills (/xyz-api-docsskill) all serve as feedforward inputs. - First Self-Correction Loop: Feedback sensors like code review prompts (
/code-review), linters (npx eslint), static analysis tools (semgrep), code coverage tools (npm run coverage), and dependency analysis tools (npm run dep-cruiser) operate on the initial generation. - Human Review: This remains a critical feedback sensor, providing a layer of semantic understanding and contextual awareness that automated tools may lack.
- Integration and Pipeline: Post-integration, all previous sensors are rerun. More expensive or time-consuming sensors, such as architectural review skills (
/architecture-review skill), detailed review skills (/detailed-review skill), and mutation testing, can be employed in the CI/CD pipeline. Feedback from these stages can then trigger new commits by agents or humans.
Continuous Drift and Health Sensors:
Beyond the immediate change lifecycle, continuous monitoring of the codebase and runtime environment provides ongoing feedback.
- Continuous Drift Detection: Sensors like
/find-dead-code,/code-coverage-quality, and automated dependency updates (e.g., Dependabot) monitor for gradual degradation in code health. - Continuous Runtime Feedback: Monitoring of latency, error rates, and Service Level Objectives (SLOs) can trigger suggestions from coding agents for performance improvements. AI judges can also analyze
/response-quality-samplingand/log-anomaliesto identify systemic issues.
Regulation Categories: Defining the Scope of Harnesses
To effectively manage the complexity of AI agent harnesses, it is useful to categorize them based on what they are designed to regulate. This allows for more precise language and targeted development. The following categories offer a framework for understanding the diverse applications of harness engineering:
Maintainability Harness
This category encompasses guides and sensors focused on improving internal code quality and maintainability. This is often the easiest type of harness to implement due to the abundance of pre-existing tooling. Computational sensors reliably catch structural issues like duplicate code, high cyclomatic complexity, insufficient test coverage, architectural drift, and style violations. These are typically cheap, proven, and deterministic.

LLMs can assist with issues requiring semantic judgment, such as identifying semantically duplicate code, redundant tests, or over-engineered solutions. However, these inferential approaches are more expensive and probabilistic, making them less suitable for every commit. Critically, neither computational nor inferential sensors can reliably catch high-impact problems like misdiagnosed issues, overengineering with unnecessary features, or misunderstood instructions without significant human oversight. The ultimate correctness of the output remains dependent on the clarity of the initial human specification.
Architecture Fitness Harness
This category focuses on guides and sensors that define and enforce the architectural characteristics of an application. Essentially, these are "Fitness Functions" for code architecture. They ensure that the codebase adheres to predefined architectural principles, patterns, and constraints.
Examples include:
- Architectural Constraint Rules: Defining rules for module dependencies, component interactions, and technology stack usage.
- Fitness Functions: Implementing checks that verify adherence to specific architectural qualities, such as performance, security, or scalability.
- Design Pattern Enforcement: Guiding agents to implement code using specific design patterns and verifying their correct application.
Behaviour Harness
This is arguably the most challenging category: how to guide and sense if an application functionally behaves as intended. Current approaches often involve AI-generated tests, which, while useful, are not yet reliable enough to fully replace human oversight. Some teams are exploring the "approved fixtures" pattern, where a set of predefined, accepted outputs serves as a baseline for testing. However, this approach is not universally applicable and faces challenges in maintaining and updating fixtures.
Significant work remains to develop robust behavior harnesses that can instill sufficient confidence to reduce supervision and manual testing. The current reliance on AI-generated tests places a considerable burden of trust on the AI’s ability to comprehensively understand and test functional requirements.

Harnessability: The Codebase as a Foundation
The amenability of a codebase to harnessing, or "harnessability," is a critical factor. Codebases written in strongly typed languages, for instance, naturally benefit from type-checking as an inherent sensor. Clearly defined module boundaries facilitate the implementation of architectural constraint rules. Frameworks that abstract away complexities implicitly increase an agent’s chances of success by reducing the surface area of potential errors. Conversely, codebases lacking these properties present greater challenges for harness development.
This distinction is particularly pronounced between greenfield and legacy projects. Greenfield projects offer the opportunity to bake harnessability into the foundation from day one through deliberate technology and architecture choices. Legacy systems, especially those burdened with technical debt, present a more difficult scenario: the harness is often most needed where it is hardest to build and implement.
Harness Templates: Standardizing AI-Assisted Development
In mature engineering organizations, common service topologies (e.g., data dashboards, event processing services, API-driven business services) are often codified in service templates. These templates can evolve into "harness templates" – pre-packaged bundles of guides and sensors designed to align a coding agent with the specific structure, conventions, and technology stack of a particular topology. This could lead to teams selecting technologies and architectural patterns partially based on the availability of robust harness templates.
While promising, harness templates would face similar challenges to service templates, including versioning, contribution management, and the potential for teams to fall out of sync with upstream improvements. The non-deterministic nature of some guides and sensors could exacerbate these issues, making testing and maintenance more complex.
The Indispensable Role of the Human Developer
Human developers bring an invaluable, albeit often implicit, harness to every codebase. Their experience imbues them with an understanding of conventions, an intuition for complexity, and a sense of ownership. They possess organizational alignment, understanding team goals, acceptable levels of technical debt, and context-specific definitions of "good." This human expertise is applied incrementally, allowing for reflection and adaptation.

Coding agents, by contrast, lack this intrinsic understanding. They do not possess social accountability, aesthetic judgment for code quality, or the intuitive "we don’t do it that way here" sense. They have no organizational memory and cannot discern between load-bearing conventions and mere habits. Crucially, they cannot inherently know whether a technically correct solution aligns with the team’s strategic objectives.
Harnesses are an attempt to externalize and formalize this human expertise. However, they can only go so far. Building a comprehensive system of guides, sensors, and self-correction loops is resource-intensive. Therefore, the primary goal of a good harness should not be to eliminate human input entirely but to intelligently direct it to the most critical areas where human judgment and experience are indispensable.
Open Questions and Future Directions
The mental model of harness engineering, encompassing feedforward and feedback, computational and inferential controls, and categorized regulation dimensions, provides a valuable framework for understanding and discussing AI agent development. It aims to elevate the conversation beyond individual features to the strategic design of control systems that foster genuine confidence in AI-generated outputs.
However, numerous open questions remain. How can harnesses be maintained coherently as they grow, ensuring guides and sensors remain synchronized and non-contradictory? To what extent can we trust AI agents to make sensible trade-offs when conflicting instructions and feedback signals arise? If sensors rarely trigger, does this indicate high quality or inadequate detection mechanisms? There is a clear need for evaluation metrics for harness coverage and quality, analogous to code coverage and mutation testing for software tests.
Currently, feedforward and feedback controls are often scattered across the development lifecycle. There is significant potential for tooling that facilitates the configuration, synchronization, and holistic reasoning about these controls as an integrated system. Ultimately, building this outer harness is emerging not as a one-time configuration task but as an ongoing engineering practice, integral to the evolution of AI-assisted software development. The continued exploration and refinement of these harness strategies will be critical in unlocking the full potential of coding agents while mitigating their inherent risks.



