Agent Failure Triage

Agent 系统上线之后，真正麻烦的往往不是“它失败了”，而是：

它到底是在哪一层失败的。

看起来像同一个 bug，实际可能完全不是同一种问题：

route 错了
context 不够
handoff 丢信息
tool 失败
approval 没接住
state 污染
specialist 自己判断错了

这篇文档讨论的就是故障分诊。

1. 先别急着改 prompt

很多 Agent 系统一出问题，第一反应就是改 prompt。

这一步经常太早。

因为同样一个坏结果，根因可能在完全不同的层：

输入层
route 层
state / memory 层
tool 层
coordination 层
output 层

故障分诊的第一步，不是修，而是先归层。

2. 一张最小分诊图

这张图不是为了穷尽所有情况，而是为了避免一上来就把所有问题都归到模型身上。

3. 最常见的 6 类失败

3.1 Route / Triage 失败

表现通常是：

specialist 明显接错任务
不该 handoff 却 handoff 了
该追问却直接执行了
高风险任务进了低风险路径

这类问题最适合先查：

triage prompt
route policy
specialist 边界
route evals

3.2 Context / State 失败

表现通常是：

明明前面说过的信息，后面丢了
summary 覆盖了关键条件
memory 混进旧偏好
handoff 后上下文缩水

这类问题最适合先查：

输入到模型的实际上下文
state 序列化结果
memory 写入策略
handoff payload

3.3 Tool 失败

表现通常是：

tool schema 对不上
参数缺失
tool 调错了
tool 返回值不稳定
工具成功了，但结果没有回注好

这类问题最适合先查：

tool call 参数
tool 成功率
tool 输出格式
tool result 注入位置

3.4 Coordination 失败

主要出现在多 Agent 场景里。

表现通常是：

handoff 目标错
payload 不完整
多个 Agent 做了重复工作
manager 和 worker 理解不一致
A2A 返回了任务状态，但主控没处理好

这类问题最适合先查：

handoff trace
specialist 输入差异
coordination payload
角色边界定义

3.5 Runtime / Reliability 失败

表现通常是：

timeout
retry 过多
interrupt 后没 resume
approval 卡死
运行成本飙升

这类问题最适合先查：

延迟分布
retry / timeout 统计
interrupt / resume 记录
approval 队列状态

3.6 Reasoning / Output 失败

这才是大家最熟悉的一类。

表现通常是：

最终答案错
格式不对
漏步骤
groundedness 不够
specialist 本身判断失误

但这一层最好放在最后排查。

前面几层没查清，直接改 reasoning 往往会越改越乱。

4. 怎么判断是 route 问题还是 specialist 问题

可以先问 3 个问题：

如果把同样输入直接给正确 specialist，它能做好吗
specialist 接收到的上下文是否已经变形
问题是在进入 specialist 之前就已经发生了吗

如果：

正确 specialist 单独跑是好的
但实际链路跑坏了

那更像 route / handoff / payload 问题。

5. 怎么判断是 tool 问题还是 reasoning 问题

可以看这几件事：

tool 有没有真的成功执行
tool 返回值是否完整
模型是否正确读取了 tool 结果
不调用 tool 时，模型本身是否有能力答对

如果 tool 本身就失败，先别急着改 reasoning。

如果 tool 成功、返回值也正常，但模型仍然误读结果，那才更像 reasoning / output 问题。

6. 怎么判断是 state 问题还是 memory 问题

一个简单区分是：

state 更像当前任务运行期的工作上下文
memory 更像跨轮次保留下来的偏好和长期信息

如果问题只出现在当前 run，更像 state。

如果问题反复跨 run 出现，而且总带着旧内容，更像 memory。

7. 先看哪几个 trace 信号

做故障分诊时，trace 里最值得先看的通常是：

route 决策
handoff 记录
tool 参数
tool 返回值
interrupt / resume 记录
最终拼接给模型的上下文
token / latency 异常点

这一层先看清，再决定是不是要回到 prompt 或模型层。

8. 一套最小故障标签

为了让线上失败样本能沉淀下来，最好先定一套标签。

例如：

- route_error
- missing_context
- stale_memory
- tool_schema_error
- tool_runtime_error
- handoff_payload_error
- approval_flow_error
- timeout_or_retry_spike
- reasoning_error
- output_format_error

这些标签的作用不是学术精确，而是让失败样本能被归档、搜索和回归。

9. 最小分诊表格

| case_id | failure_layer | sub_type | route_ok | tool_ok | handoff_ok | state_ok | final_ok | notes |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| f_001 | route | wrong_specialist | no | - | - | yes | no | billing task sent to support |
| f_002 | tool | schema_error | yes | no | - | yes | no | missing required field |
| f_003 | coordination | payload_loss | yes | yes | no | partial | no | missing user constraints in handoff |
| f_004 | runtime | timeout | yes | partial | yes | yes | no | interrupt resumed too late |

这个表和 Multi-Agent-Evaluations.md 可以直接配合起来用。

10. 一个实际排查顺序

比较稳的顺序通常是：

先看是否 route 正确
再看 context / state 是否完整
再看 tool / handoff / approval 是否正常
最后才看 reasoning 和输出

这样做的好处是：

更少误改 prompt
更快锁定根因
回归集更容易积累

11. 和现有专题里的关系

最适合一起看的文档有：

12. 小结

Agent failure triage 的关键，不是先修，而是先把失败放回正确的层里：

route
context / state
tool / coordination
runtime
reasoning / output

层次分清了，修复动作才不会乱。

1. 先别急着改 prompt​

2. 一张最小分诊图​

3. 最常见的 6 类失败​

3.1 Route / Triage 失败​

3.2 Context / State 失败​

3.3 Tool 失败​

3.4 Coordination 失败​

3.5 Runtime / Reliability 失败​

3.6 Reasoning / Output 失败​

4. 怎么判断是 route 问题还是 specialist 问题​

5. 怎么判断是 tool 问题还是 reasoning 问题​

6. 怎么判断是 state 问题还是 memory 问题​

7. 先看哪几个 trace 信号​

8. 一套最小故障标签​

9. 最小分诊表格​

10. 一个实际排查顺序​

11. 和现有专题里的关系​

12. 小结​

1. 先别急着改 prompt

2. 一张最小分诊图

3. 最常见的 6 类失败

3.1 Route / Triage 失败

3.2 Context / State 失败

3.3 Tool 失败

3.4 Coordination 失败

3.5 Runtime / Reliability 失败

3.6 Reasoning / Output 失败

4. 怎么判断是 route 问题还是 specialist 问题

5. 怎么判断是 tool 问题还是 reasoning 问题

6. 怎么判断是 state 问题还是 memory 问题

7. 先看哪几个 trace 信号

8. 一套最小故障标签

9. 最小分诊表格

10. 一个实际排查顺序

11. 和现有专题里的关系

12. 小结