The Anatomy of an Agent Harness

2026-04-012026-04-01 TE 0 Comments

https://blog.langchain.com/the-anatomy-of-an-agent-harness

By Vivek Trivedy

TLDR: Agent = Model + Harness. Harness engineering is how we build systems around models to turn them into work engines. The model contains the intelligence and the harness makes that intelligence useful. We define what a harness is and derive the core components today’s and tomorrow’s agents need.

Can Someone Please Define a “Harness”?

Agent = Model + Harness

If you’re not the model, you’re the harness.

A harness is every piece of code, configuration, and execution logic that isn’t the model itself. A raw model is not an agent. But it becomes one when a harness gives it things like state, tool execution, feedback loops, and enforceable constraints.

Concretely, a harness includes things like:

System Prompts
Tools, Skills, MCPs + and their descriptions
Bundled Infrastructure (filesystem, sandbox, browser)
Orchestration Logic (subagent spawning, handoffs, model routing)
Hooks/Middleware for deterministic execution (compaction, continuation, lint checks)

There are many messy ways to split the boundaries of an agent system between the model and the harness. But in my opinion, this is the cleanest definition because it forces us to think about designing systems around model intelligence.

The rest of this post walks through core harness components and derives why each piece exists working backwards from the core primitive of a model.

Why Do We Need Harnesses…From a Model’s Perspective

There are things we want an agent to do that a model cannot do out of the box. This is where a harness comes in.Models (mostly) take in data like text, images, audio, video and they output text. That’s it. Out of the box they cannot:

Maintain durable state across interactions
Execute code
Access realtime knowledge
Setup environments and install packages to complete work

These are all harness level features. The structure of LLMs requires some sort of machinery that wraps them to do useful work.For example, to get a product UX like “chatting”, we wrap the model in a while loop to track previous messages and append new user messages. Everyone reading this has already used this kind of harness. The main idea is that we want to convert a desired agent behavior into an actual feature in the harness.

Working Backwards from Desired Agent Behavior to Harness Engineering

Harness Engineering helps humans inject useful priors to guide agent behavior. And as models have gotten more capable, harnesses have been used to surgically extend and correct models to complete previously impossible tasks.

We won’t go over an exhaustive list of every harness feature. The goal is to derive a set of features from the starting point of helping models do useful work. We’ll follow a pattern like this:

Behavior we want (or want to fix) → Harness Design to help the model achieve this.

Filesystems for Durable Storage and Context Management

We want agents to have durable storage to interface with real data, offload information that doesn’t fit in context, and persist work across sessions.

Models can only directly operate on knowledge within their context window. Before filesystems, users had to copy/paste content directly to the model, that’s clunky UX and doesn’t work for autonomous agents. The world was already using filesystems to do work so models were naturally trained on billions of tokens of how to use them. The natural solution became:

Harnesses ship with filesystem abstractions and tools for fs-ops.

The filesystem is arguably the most foundational harness primitive because of what it unlocks:

Agents get a workspace to read data, code, and documentation.
Work can be incrementally added and offloaded instead of holding everything in context. Agents can store intermediate outputs and maintain state that outlasts a single session.
The filesystem is a natural collaboration surface. Multiple agents and humans can coordinate through shared files. Architectures like Agent Teams rely on this.

Git adds versioning to the filesystem so agents can track work, rollback errors, and branch experiments. We revisit the filesystem more below, because it turns out to be a key harness primitive for other features we need.

Bash + Code as a General Purpose Tool

We want agents to autonomously solve problems without humans needing to pre-design every tool.

The main agent execution pattern today is a ReAct loop, where a model reasons, takes an action via a tool call, observes the result, and repeats in a while loop. But harnesses can only execute the tools they have logic for. Instead of forcing users to build tools for every possible action, a better solution is to give agents a general purpose tool like bash.

Harnesses ship with a bash tool so models can solve problems autonomously by writing & executing code.

Bash + code exec is a big step towards giving models a computer and letting them figure out the rest autonomously. The model can design its own tools on the fly via code instead of being constrained to a fixed set of pre-configured tools.

Harnesses still ship with other tools, but code execution has become the default general-purpose strategy for autonomous problem solving.

Sandboxes and Tools to Execute & Verify Work

Agents need an environment with the right defaults so they can safely act, observe results, and make progress.

We’ve given models storage and the ability to execute code, but all of that needs to happen somewhere. Running agent-generated code locally is risky and a single local environment doesn’t scale to large agent workloads.

Sandboxes give agents safe operating environments. Instead of executing locally, the harness can connect to a sandbox to run code, inspect files, install dependencies, and complete tasks. This creates secure, isolated execution of code. For more security, harnesses can allow-list commands and enforce network isolation. Sandboxes also unlock scale because environments can be created on demand, fanned out across many tasks, and torn down when the work is done.

Good environments come with good default tooling. Harnesses are responsible for configuring tooling so agents can do useful work. This includes pre-installing language runtimes and packages, CLIs for git and testing, browsers for web interaction and verification.

Tools like browsers, logs, screenshots, and test runners give agents a way to observe and analyze their work. This helps them create self-verification loops where they can write application code, run tests, inspect logs, and fix errors.

The model doesn’t configure its own execution environment out of the box. Deciding where the agent runs, what tools are available, what it can access, and how it verifies its work are all harness-level design decisions.

Memory & Search for Continual Learning

Agents should remember what they’ve seen and access information that didn’t exist when they were trained.

Models have no additional knowledge beyond their weights and what’s in their current context. Without access to edit model weights, the only way to “add knowledge” is via context injection.

For memory, the filesystem is again a core primitive. Harnesses support memory file standards like AGENTS.md which get injected into context on agent start. As agents add and edit this file, harnesses load the updated file into context. This is a form of continual learning where agents durably store knowledge from one session and inject that knowledge into future sessions.

Knowledge cutoffs mean that models can’t directly access new data like updated library versions without the user providing them directly. For up-to-date knowledge, Web Search and MCP tools like Context7 help agents access information beyond the knowledge cutoff like new library versions or current data that didn’t exist when training stopped.

Web Search and tools for querying up-to-date context are useful primitives to bake into a harness.

Battling Context Rot

Agent performance shouldn’t degrade over the course of work.

Context Rot ****describes how models become worse at reasoning and completing tasks as their context window fills up. Context is a precious and scarce resource, so harnesses need strategies to manage it.

Harnesses today are largely delivery mechanisms for good context engineering.

Compaction addresses what to do when the context window is close to filling up. Without compaction, what happens when a conversation exceeds the context window? One option is that the API errors, that’s not good. The harness has to use some strategy for this case. So compaction intelligently offloads and summarizes the existing context window so the agent can continue working.

Tool call offloading helps reduce the impact of large tool outputs that can noisily clutter the context window without providing useful information. The harness keeps the head and tail tokens of tool outputs above a threshold number of tokens and offloads the full output to the filesystem so the model can access it if needed.

Skills address the issue of too many tools or MCP servers loaded into context on agent start which degrades performance before the agent can start working. Skills are a harness level primitive that solve this via progressive disclosure. The model didn’t choose to have Skill front-matter loaded into context on start but the harness can support this to protect the model against context rot.

Long Horizon Autonomous Execution

We want agents to complete complex work, autonomously, correctly, over long time horizons.

Autonomous software creation is the holy grail for coding agents. But today’s models suffer from early stopping, issues decomposing complex problems, and incoherence as work stretches across multiple context windows. A good harness has to design around all of this.

This is where the earlier harness primitives start to compound. Long-horizon work requires durable state, planning, observation, and verification to keep working across multiple context windows.

Filesystems and git for tracking work across sessions. Agents produce millions of tokens over a long task so the filesystem durably captures work to track progress over time. Adding git allows new agents to quickly get up to speed on the latest work and history of the project. For multiple agents working together, the filesystem also acts as a shared ledger of work where agents can collaborate.

Ralph Loops for continuing work. The Ralph Loop is a harness pattern that intercepts the model’s exit attempt via a hook and reinjects the original prompt in a clean context window, forcing the agent to continue its work against a completion goal. The filesystem makes this possible because each iteration starts with fresh context but reads state from the previous iteration.

Planning and self-verification to stay on track. Planning is when a model decomposes a goal into a series of steps. Harnesses support this via good prompting and injecting reminders how to use a plan file in the filesystem. After completing each step, agents benefit from the checking correctness of their work via self-verification. Hooks in harnesses can run a pre-defined test suite and loop back to the model on failure with the error message or models can be prompted to self-evaluate their code independently. Verification grounds solution in tests and creates a feedback signal for self-improvement.

The Future of Harnesses

The Coupling of Model Training and Harness Design

Today’s agent products like Claude Code and Codex are post-trained with models and harnesses in the loop. This helps models improve at actions that the harness designers think they should be natively good at like filesystem operations, bash execution, planning, or parallelizing work with subagents.

This creates a feedback loop. Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in.

But this co-evolution has interesting side effects for generalization. It shows up in ways like how changing tool logic leads to worse model performance. A good example is described here in the Codex-5.3 prompting guide with the apply_patch tool logic for editing files. A truly intelligent model should have little trouble switching between patch methods, but training with a harness in the loop creates this overfitting.

But this doesn’t mean that the best harness for your task is the one a model was post-trained with. The Terminal Bench 2.0 Leaderboard is a good example. Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses. In a previous blog, we showed how we improved our coding agent Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness. There’s a lot of juice to be squeezed out of optimizing the harness for your task.

Where Harness Engineering is Going

As models get more capable, some of what lives in the harness today will get absorbed into the model. Models will get better at planning, self-verification, and long horizon coherence natively, thus requiring less context injection for example.

That suggests harnesses should matter less over time. But just as prompt engineering continues to be valuable today, it’s likely that harness engineering will continue to be useful for building good agents.

It’s true that harnesses today patch over model deficiencies, but they also engineer systems around model intelligence to make them more effective. A well-configured environment, the right tools, durable state, and verification loops make any model more efficient regardless of its base intelligence.

Harness engineering is a very active area of research that we use to improve our harness building library deepagents at LangChain. Here are a few open and interesting problems we’re exploring today:

orchestrating hundreds of agents working in parallel on a shared codebase
agents that analyze their own traces to identify and fix harness-level failure modes
harnesses that dynamically assemble the right tools and context just-in-time for a given task instead of being pre-configured

This blog was an exercise in defining what a harness is and how it’s shaped by the work we want models to do.

The model contains the intelligence and the harness is the system that makes that intelligence useful.

To more harness building, better systems, and better agents.

什么是 Harness？

Agent = Model + Harness

如果你不是模型，那就是 Harness。

这句话听起来有点绝对，但确实抓住了关键。Harness 本质上就是模型之外的一切：代码、配置，以及各种执行逻辑。模型本身只是能力的来源，只有通过 Harness 把状态、工具调用、反馈循环和约束机制串起来，它才真正变成一个 Agent。

具体来看，Harness 一般包括这些部分：

系统提示词：定义模型的角色和目标
工具、技能、MCP：模型可以调用的外部能力
基础设施：文件系统、沙箱、浏览器等运行环境
编排逻辑：子 Agent、任务拆分、模型路由等
钩子/中间件：压缩、续写、代码检查等确定性流程

为什么要用「模型 vs Harness」来划分系统？

因为这是一个更清晰的边界。很多关于 Agent 的定义都容易变得模糊，但用这个方式去看问题，会逼你回答一件事：模型负责什么？剩下的系统要补什么？

接下来，我们就从这个定义出发，拆解 Harness 的核心组件，并从「模型能做什么」反推「为什么需要这些设计」。

为什么需要 Harness？

原因很简单：有些事我们希望 Agent 能做，但模型本身做不到。

听起来像句废话，但关键就在这里——先搞清楚模型的边界。大多数模型的输入是文本、图像、音频，输出是文本。仅此而已。也就是说，模型本质上只是一个输入 → 输出的函数。

它本身并不会：

在多轮交互中记住状态
执行代码
获取实时信息
操作环境（比如装依赖、跑程序）

这些能力，都不在模型里，而是在外面补的。也就是——Harness。

举个最常见的例子：聊天。「聊天」看起来很自然，但模型其实并不会聊天。

要实现这个体验，你至少需要做几件事：

维护一段对话历史
每次请求时把历史拼进上下文
不断循环接收用户输入和模型输出

本质上就是一个简单的循环，把模型包起来用。所以关键点其实只有一句话：你希望 Agent 表现出的能力，最终都要落在 Harness 上实现。

这也是 Harness Engineering 的核心思路：不是去「调教模型能不能做到」，而是换个方向——先想清楚你要它做到什么，再把这些能力一个个补到 Harness 里。

文件系统：持久化存储和上下文管理

我们希望 Agent 能做的，其实很直接：能用真实数据，能把放不下的内容挪出去，还能把工作保存下来。

模型只能处理当前上下文窗口里的内容。没有文件系统的时候，用户只能不断复制粘贴，把信息喂给模型——这对人来说都麻烦，更别说让 Agent 自主工作了。

现实世界里，我们就是靠文件系统来组织一切工作的。模型在大量数据中也早就「学会了」这一点。所以一个很自然的结论是：Harness 需要提供文件系统抽象，以及对应的读写操作（fs-ops）。

有了文件系统，很多能力才真正出现：

Agent 有了自己的工作空间，可以读写数据、代码和文档
信息可以按需加载，而不是一股脑塞进上下文
中间结果可以落盘，状态可以跨会话保留
文件本身就是协作接口：人和多个 Agent 可以围绕同一份内容协同工作

再往前一步，加上版本控制（比如 Git），事情就更完整了：

可以记录每一步改动
出问题可以回滚
可以开分支做不同尝试

从这个角度看，文件系统其实不是一个「附加能力」，而是最基础的 Harness 原语之一。后面很多能力（状态管理、协作、任务拆分）都会依赖它。

Bash + 代码执行：通用问题解决工具

我们真正想要的，是让 Agent 能自己把问题解决掉，而不是每一步都提前帮它设计好工具。

但现实是，大多数 Agent 还是在用一套固定模式：

想一步（推理）
调一个工具（行动）
看结果（观察）
再继续循环

问题在于：它只能用你提前给好的那些工具。这就带来一个很实际的限制——你不可能提前穷举所有工具。所以更直接的做法是：别给一堆工具，直接给它一台「能干活的机器」。也就是：在 Harness 里提供 Bash + 代码执行能力。

一旦有了这个能力，事情就变了：

模型可以自己写脚本解决问题
可以临时「造工具」，而不是依赖预定义接口
可以组合已有能力，拼出新的工作流

本质上，你不再是在「设计工具列表」，而是在提供一个通用执行环境。当然，Harness 里仍然可以有各种现成工具，但在很多场景下，代码执行会成为默认策略。

沙箱环境和工具：安全执行与工作验证

给了 Agent「能存」和「能执行」，还不够——它还需要一个能放心干活的地方。

代码总得在某个环境里运行。但如果直接在本地执行模型生成的代码，风险很高；同时，本地环境也很难支撑多任务、并发的 Agent 工作。

更合理的做法是：把执行放进沙箱里。

沙箱解决了两个核心问题：

1. 安全性

隔离执行环境，避免影响本地系统
可以限制命令、禁用网络、控制权限
即使出错，也被限制在沙箱内部

2. 可扩展性

可以按需创建环境
多个任务并行执行
用完就销毁，不留下状态污染

但光有「环境」还不够，还要让它开箱就能用。这就是 Harness 要做的另一件事：准备一套合理的默认工具。比如：

语言运行时和常用依赖
Git、测试工具等 CLI
浏览器（用于页面交互和验证）

这些工具的价值，不只是「能用」，而是让 Agent 能观察自己的工作结果：

看日志
跑测试
截图页面
检查输出

一旦有了这些能力，就能形成一个很关键的闭环：写代码 → 运行 → 观察 → 修复 → 再运行，也就是一个简单但有效的自我验证循环。

所以这里的重点不是「提供一个运行环境」，而是：决定 Agent 在什么环境里工作、能用什么工具、能看到什么结果，以及如何判断自己做对了没有。这些，全部都是 Harness 的职责。

记忆与搜索：持续学习能力

我们希望 Agent 不只是「当下聪明」，还要能记住东西、查到新信息。

但模型本身做不到。它的知识只来自两部分：

训练时学到的内容（权重）
当前上下文里提供的信息

除此之外，没有「记忆」。也不能主动更新知识。所以问题就变成一句话：怎么把「新知识」放进模型？答案其实只有一个：通过上下文注入。在这件事上，文件系统再次变成基础设施。

一种常见做法是让 Harness 维护一些「记忆文件」（比如 AGENTS.md）：

Agent 在运行过程中可以往里面写信息
下次启动时，这些内容会被重新加载进上下文
文件更新了，上下文也随之更新

这其实就是一种很朴素的「学习方式」：写下来 → 保存 → 下次继续用，虽然没有改模型权重，但已经能做到跨会话积累经验。

但还有一个问题：模型不知道「现在发生了什么」。比如：

新发布的库版本
最新的 API 变化
实时数据

这些都不在训练数据里。

这时候就需要另一类能力：搜索和外部知识获取。比如：

Web Search
像 Context7 这样的上下文查询工具（MCP）

它们的作用很直接：把模型「看不到」的信息，拉进上下文。

对抗上下文衰减：智能压缩策略

我们不希望 Agent 越用越「笨」。

但现实是，一旦上下文越来越长，模型的表现往往会变差。

这就是所谓的 Context Rot（上下文衰减）：

信息变多，但有效信息比例下降
关键线索被淹没
推理能力开始不稳定

本质原因很简单：上下文是有限资源，而且很容易被浪费。所以问题变成：怎么让 Agent 在长时间工作中，始终用「干净」的上下文？这正是 Harness 要解决的事情。可以把今天很多 Harness 理解成一件事：把「上下文管理」这件事工程化。

最核心的手段是：

1. 压缩（Compaction）

当上下文快满时，不能只是「继续堆」，必须处理已有内容。常见做法是：

对已有对话做总结
保留关键信息
把细节移出上下文

这样，Agent 可以在不丢失关键信息的情况下继续工作。

2. 工具调用卸载（Tool Output Offloading）

工具输出往往是最大的问题来源：

日志很长
返回结果很杂
但真正有用的信息很少

一种更合理的策略是：

只保留开头 + 结尾（关键信号）
完整内容写入文件系统
需要时再读取

本质就是一句话：不要让「噪音」占据上下文。

3. 技能（Skills）与延迟加载

还有一个常见问题：Agent 一启动，就把大量工具说明、MCP 描述全部塞进上下文。结果还没开始干活，上下文已经被污染了。

更好的方式是：按需加载（渐进式披露）。也就是：

先给最小必要信息
需要某个能力时，再把相关内容引入

可以把 Skills 理解为：对工具和能力的「懒加载机制」。

长期自主执行

我们真正想要的，是让 Agent 能把一件复杂的事从头做到尾。

但现实还差得很远。现在的模型常见问题是：

容易提前结束（还没做完就停了）
不擅长拆解复杂任务
一旦跨多个上下文窗口，工作就开始变得不连贯

所以问题不在「它会不会写代码」，而在：它能不能把工作持续推进下去。这正是 Harness 要解决的核心问题之一：如何让工作跨时间、跨上下文持续进行。这里其实不是一个能力，而是一组能力叠加出来的结果。

文件系统 + Git：把过程「记下来」

长任务一定会产生大量中间结果，不可能全靠上下文撑住。

所以必须把工作外部化：

文件系统记录当前状态
Git 记录历史和变化
新的 Agent 可以快速接手已有进度

当多个 Agent 协作时，这套东西本质上就是一个共享笔记本。

Ralph 循环：防止「做一半就停」

模型很容易在「看起来差不多了」的时候结束。

Ralph 循环的思路很直接：

拦截「我要结束」的信号
重新给它一个干净的上下文
让它继续朝目标推进

关键在于：上下文可以重置，但状态不能丢。这也是为什么文件系统是前提。

规划 + 自我验证：让过程不跑偏

能持续做，还不够，还要做对。这里有两个关键机制：

1. 规划（Planning）

把目标拆成步骤
写进文件
持续更新

这样每一步都有「参照物」，不容易偏离方向。

2. 自我验证（Self-Verification）

每做完一步，就检查：

跑测试
看日志
检查输出

如果失败：

把错误信息喂回模型
继续修

这就形成了一个稳定的闭环：执行 → 检查 → 反馈 → 修正

Harness 与模型的共同进化

今天的 Agent 产品，比如 Claude Code 和 Codex，是模型和 Harness 同时演化的结果。

在训练过程中，模型不仅学习生成文本，还被训练去更好地使用 Harness 提供的工具和流程，比如：

文件系统操作
Bash 执行
任务规划
与子 Agent 并行工作

这形成了一个反馈循环：

Harness 提供原语和操作能力
模型学习如何使用这些原语
训练结果又反馈回下一代模型
模型在相同的 Harness 环境中表现越来越好

这种共同进化虽然让模型在特定 Harness 下更有能力，但也有副作用：

模型可能对特定工具或逻辑「过拟合」
换了不同的 Harness 环境，性能可能下降

一个例子来自 Codex-5.3 提示指南：用于编辑文件的 apply_patch 工具，如果模型在训练中只接触一种逻辑方式，切换补丁方法时可能出现问题。

这也说明：最适合你任务的 Harness 不一定是训练时使用的那个。

例如，Terminal Bench 2.0 测试就显示了这一点：Claude Code 中的 Opus 4.6 得分远低于其他 Harness 中的 Opus 4.6。通过优化 Agent 运行环境（如文档结构、验证回路、追踪系统），LangChain 的编码 Agent 在同一基准下，排名从全球第 30 位升到第 5 位，得分从 52.8% 提升到 66.5%

结语

随着模型越来越强大，今天在 Harness 中承担的一些功能可能会被模型自身吸收。模型在规划、自我验证和长时程任务保持连贯性方面会更可靠，因此对上下文注入的依赖也会减少。

这似乎意味着 Harness 会变得不那么重要。但就像提示工程今天依然有价值一样，Harness Engineering 很可能仍然对构建高效 Agent 起关键作用。

原因很简单：

Harness 不仅弥补模型的不足
它还是设计系统的方式，让模型能够更有效地完成任务
配置良好的环境、合适的工具、持久状态和验证循环，让任何模型都能发挥最大效率

可以把这个比作舞台与演员的关系：

Harness 是舞台和幕后控制系统
Agent 是舞台上的演员

无论演员多么出色，没有舞台和规则，他们也难以发挥全部能力。

引用链接

[1] The Anatomy of an Agent Harness: https://blog.langchain.com/the-