跳到主要内容

Harness Markdown 表格模板

这篇文档只专注一件事:

把 Harness 运行、对比、回归里最常用的表格模板整理成可以直接复制的 Markdown 版本。

如果已经理解 Harness 是“运行与验证支架”,这里可以直接当模板页用。

1. 任务运行清单表

| 字段 | 含义 | 示例 |
| --- | --- | --- |
| `Run Batch ID` | 本轮任务批次 | `batch-2026-04-29-a` |
| `Task Set Name` | 任务集名称 | `research-core-set` |
| `Compared Version` | 当前版本 | `agent-v5` |
| `Baseline Version` | 对照版本 | `agent-v4` |
| `Task Count` | 任务数量 | `25` |
| `Run Status` | 运行状态 | `finished` |
| `Notes` | 备注 | `包含 4 个历史失败样本` |

2. 单任务轨迹摘要表

| Step | Goal | Action | Tool | Observation | State Update |
| --- | --- | --- | --- | --- | --- |
| 1 | | | | | |
| 2 | | | | | |
| 3 | | | | | |

这个表适合:

  • 复盘单个失败案例
  • 对比两个版本的运行路径

3. 回归对比表

| Task ID | Baseline Result | Current Result | Status | Notes |
| --- | --- | --- | --- | --- |
| `task-001` | pass | pass | unchanged | |
| `task-002` | fail | pass | improved | |
| `task-003` | pass | fail | regressed | |

4. 失败分布统计表

| Failure Type | Count | Example Task | Notes |
| --- | --- | --- | --- |
| Wrong Tool | | | |
| Wrong Argument | | | |
| Missing Evidence | | | |
| Looping | | | |
| Premature Stop | | | |

5. 版本回归发布表

| 检查项 | 结果 | 备注 |
| --- | --- | --- |
| Core Task Set Pass | | |
| Historical Failure Set Pass | | |
| No New Critical Regression | | |
| Cost Within Budget | | |
| Latency Within SLA | | |
| High-risk Flow Reviewed | | |
| Release Decision | | |

6. 高风险任务监控表

| Task ID | Risk Level | Human Review Needed | Final Decision | Notes |
| --- | --- | --- | --- | --- |
| `risk-001` | high | yes | hold | |
| `risk-002` | medium | no | release | |

7. 运行问题复盘表

| Issue ID | Trigger Task | Symptom | Root Cause | Fix | Added to Regression |
| --- | --- | --- | --- | --- | --- |
| `issue-001` | | | | | |

8. 最小 Harness checklist

[ ] 我有固定任务集
[ ] 我能记录批次运行结果
[ ] 我能看到单任务轨迹摘要
[ ] 我能做版本回归对比
[ ] 我能统计失败类型
[ ] 我能做发布前检查

9. 一句话用法

如果只想先做一个最小 Harness,可以先复制这 3 张表:

  1. 回归对比表
  2. 失败分布统计表
  3. 版本回归发布表

这 3 张表已经足够支撑很多团队的第一版工程闭环。