跳到主要内容

Research Agent 代码实现示例

如果说 最小 Agent 代码实现示例 更偏回答:

第一个 Agent 的代码骨架应该长什么样

那么这篇文档要解决的是另一个更贴近研究型任务的问题:

一个会持续做调研、补资料、收束证据的 Research Agent,代码上到底该怎么组织?

很多人理解了 Research Agent 的概念之后,真正开始写代码时还是会卡在这些地方:

  • 研究任务要不要先拆维度
  • state 里到底应该存什么
  • 搜索和阅读工具怎么分层
  • 什么情况下继续检索,什么情况下应该停止
  • 发现信息缺口或来源冲突时,系统应该怎么处理
  • 最后怎样从一堆材料收束成一个有依据的结论

所以这篇文档不讲“万能研究系统”,也不讲高风险自动执行,而是只聚焦一件事:

给你一个围绕 research loop 展开的、足够完整的 TypeScript 伪代码骨架。

重点不是把代码写得多炫,而是让你看懂一个研究型 Agent 最核心的闭环:

研究目标 -> 规划研究维度 -> 检索 -> 阅读 -> 提取证据 -> 判断缺口/冲突 -> 继续探索或停止 -> 形成结论

1. 这种代码骨架适合什么场景

Research Agent 不是所有任务都需要。

如果用户问的是一个非常直接的问题,比如:

这个接口为什么 500?

那你更可能需要的是 Debug Agent,而不是 Research Agent。

Research Agent 更适合下面这类任务:

  • 需要综合多个来源,而不是单条事实
  • 需要围绕若干维度做比较
  • 第一轮信息通常不够,需要继续补资料
  • 最终输出不是“摘抄资料”,而是“形成判断”

比如:

  • 调研某个 AI 框架是否适合做内部知识助手
  • 比较几种向量数据库在某个业务场景下的适用性
  • 分析一个新模型是否适合接入现有工作流
  • 研究一个技术方案的成本、限制和风险边界

这类任务的共同点是:

问题的价值不在于搜到内容,而在于把探索过程组织起来。

2. 先定义一个研究型任务

为了让代码结构更具体,我们先固定一个示例任务:

调研某个 AI 框架是否适合做企业内部知识助手。

这个目标天然不是单点问答,而是综合判断。

系统至少要回答:

  • 它能解决什么问题
  • 它适合什么场景
  • 它不适合什么场景
  • 团队接入成本大不大
  • 已知限制和风险是什么

也就是说,Research Agent 的输出不只是“找到信息”,而是:

  • 有结构的结论
  • 每个结论背后的证据
  • 当前仍然存在的不确定性

3. 先规划研究维度,而不是立刻搜索

Research Agent 很容易犯的第一个错误是:

一上来就搜很多,然后在结果里打转。

更稳的方式是先把研究任务拆成若干维度。

例如这个案例里,可以先拆成下面几类:

  • capability_fit:能力覆盖是否满足需求
  • integration_cost:接入与工程改造成本
  • operational_risk:稳定性、可观测性、治理边界
  • known_limitations:已知限制、缺口和不适用场景

这样做的好处不是“形式完整”,而是它能让后面的循环有明确目标:

  • 这一轮检索是在补哪个维度
  • 哪个维度已经足够
  • 哪个维度仍然证据薄弱
  • 哪些来源其实在讨论同一件事

所以第一版 Research Agent 很值得显式保留一个“研究维度计划”。

4. state 应该怎么设计

Research Agent 的 state 通常比最小 Agent 更重要。

因为研究型任务最怕的不是“查不到”,而是:

  • 查过的东西散掉了
  • 缺口没有被记录
  • 冲突来源被覆盖掉
  • 最后只剩一个模糊结论

下面是一套比较适合第一版实现的状态结构:

type ResearchDimensionId =
| "capability_fit"
| "integration_cost"
| "operational_risk"
| "known_limitations";

type ResearchDimension = {
id: ResearchDimensionId;
question: string;
priority: "high" | "medium" | "low";
done: boolean;
confidence: number;
};

type Evidence = {
id: string;
dimensionId: ResearchDimensionId;
claim: string;
sourceTitle: string;
sourceUrl: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
supportLevel: "strong" | "medium" | "weak";
extractedAtStep: number;
};

type InformationGap = {
dimensionId: ResearchDimensionId;
question: string;
reason: "not_enough_evidence" | "conflict_detected" | "too_generic";
};

type ConflictRecord = {
dimensionId: ResearchDimensionId;
claimA: string;
claimB: string;
sourceA: string;
sourceB: string;
resolutionStatus: "open" | "resolved" | "needs_human_review";
note: string;
};

type ToolCallRecord = {
step: number;
toolName: string;
args: Record<string, unknown>;
summary: string;
};

type ResearchState = {
goal: string;
dimensions: ResearchDimension[];
evidence: Evidence[];
gaps: InformationGap[];
conflicts: ConflictRecord[];
notes: string[];
toolCalls: ToolCallRecord[];
currentStep: number;
maxSteps: number;
done: boolean;
stopReason?: string;
};

你可以看到,这个结构相比最小 Agent 多出来了几类真正关键的信息:

  • dimensions:告诉系统这次研究到底覆盖哪些面
  • evidence:不只存原始文本,而是存“可引用的证据单元”
  • gaps:显式记录哪里还没搞清楚
  • conflicts:显式记录互相冲突的说法

这几个字段会直接决定 research loop 是否能稳定推进。

5. evidence 为什么不能只存原始文本

很多第一版实现会把证据写成:

const evidence: string[] = [];

这当然能跑,但很快会出现两个问题:

  • 你不知道这段话支持的是哪个维度
  • 你不知道这段话来自什么来源,也无法比较可信度

Research Agent 更推荐把证据存成“结构化 claim”。

也就是说,一条证据至少要回答:

  • 它在支持什么判断
  • 它来自哪里
  • 它强还是弱
  • 它属于哪个研究维度

这样后面做综合时,系统才能回答:

  • 哪个维度证据已经够了
  • 哪个结论只建立在弱证据上
  • 哪个冲突只是来源层级不同造成的

换句话说:

Research Agent 不是把资料装进数组,而是把资料整理成可比较的证据对象。

6. 工具设计:把 retrievalread 分开

第一版实现时,很容易把工具写成一个大而全的接口,比如:

search_and_read_everything

这种工具通常会带来两个问题:

  • 决策层不知道自己到底是在“找来源”还是“读内容”
  • 日志里很难看清 research loop 的推进过程

更清楚的方式是把工具拆成两层:

  1. retrieval:找候选来源
  2. read:读取某个已选来源的详细内容

例如:

type ToolResult<T> = {
ok: boolean;
data?: T;
error?: string;
};

type SearchHit = {
title: string;
url: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
};

type ToolDefinition<TArgs, TResult> = {
name: string;
description: string;
run: (args: TArgs) => Promise<ToolResult<TResult>>;
};

const searchSourcesTool: ToolDefinition<
{ query: string; dimensionId: ResearchDimensionId },
{ hits: SearchHit[] }
> = {
name: "search_sources",
description: "Search candidate sources for a research dimension",
async run(args) {
return {
ok: true,
data: {
hits: [
{
title: "Framework official docs",
url: "https://example.com/docs",
snippet: "Architecture, features, and deployment model",
sourceType: "official_doc",
},
],
},
};
},
};

const readSourceTool: ToolDefinition<
{ url: string },
{ title: string; content: string }
> = {
name: "read_source",
description: "Read the full content of a selected source",
async run(args) {
return {
ok: true,
data: {
title: "Framework official docs",
content: "Detailed content here...",
},
};
},
};

这样的分层有几个直接好处:

  • 决策更清楚:先找,再读
  • 工具日志更清楚:能看到研究是在拓展来源,还是在加深理解
  • 资源更可控:不会一上来把所有长文都读完

Research loop 的味道,很多时候就体现在这种“先扩大搜索面,再对关键来源做深读”的节奏里。

7. 决策层应该决定什么

Research Agent 的决策层,第一版不用追求太复杂。

它最核心只需要决定三件事:

  • 下一步研究哪个维度
  • 是继续 search 还是 read
  • 当前信息是否已经足够收束

也就是说,第一版的行动类型可以很简单:

type ResearchAction =
| {
type: "search";
dimensionId: ResearchDimensionId;
query: string;
reason: string;
}
| {
type: "read";
dimensionId: ResearchDimensionId;
url: string;
reason: string;
}
| {
type: "finish";
reason: string;
};

你会发现,这里故意没有任何高风险动作。

因为这个示例的重点不是自动执行系统改动,而是:

围绕研究任务不断找证据、补缺口、决定何时收束。

8. 什么时候继续检索,什么时候继续阅读

Research loop 里一个特别关键的问题是:

下一步应该继续搜,还是先把已有来源读深?

一个比较稳的经验是:

  • 如果某个维度几乎没有来源,先 search
  • 如果某个维度已经有候选来源但 claim 还不清楚,先 read
  • 如果已经出现冲突,优先寻找更高质量来源做 read
  • 如果证据都很泛,继续 search 更具体的问题

可以把这个判断写成非常简单的启发式规则:

function chooseNextAction(state: ResearchState): ResearchAction {
const openGap = state.gaps[0];
if (!openGap) {
return {
type: "finish",
reason: "No explicit gaps remain and evidence is sufficient across dimensions",
};
}

const dimensionEvidence = state.evidence.filter(
(item) => item.dimensionId === openGap.dimensionId
);

if (dimensionEvidence.length === 0) {
return {
type: "search",
dimensionId: openGap.dimensionId,
query: openGap.question,
reason: "Need initial sources for this dimension",
};
}

const hasStrongEvidence = dimensionEvidence.some(
(item) => item.supportLevel === "strong"
);

if (!hasStrongEvidence) {
return {
type: "read",
dimensionId: openGap.dimensionId,
url: pickBestSourceUrl(state, openGap.dimensionId),
reason: "Need deeper reading to strengthen evidence quality",
};
}

return {
type: "finish",
reason: "Current gaps are sufficiently addressed",
};
}

这里最重要的不是规则本身有多“聪明”,而是:

继续检索还是继续阅读,应该由显式的状态来驱动。

9. 什么时候继续,什么时候停止

Research Agent 最怕两种极端:

  • 太早停:第一轮搜到一点资料就总结
  • 停不下来:一直检索,但研究质量没有继续提升

所以停止条件最好也显式化。

比较常见的停止条件有:

  • 每个高优先级维度至少有 1 到 2 条可用证据
  • 没有未解决的关键冲突
  • 新一轮检索带来的信息增量已经很低
  • 达到最大步数或预算上限

可以把它写成一个单独函数:

function shouldStop(state: ResearchState): { stop: boolean; reason: string } {
const highPriorityDimensions = state.dimensions.filter(
(dimension) => dimension.priority === "high"
);

const uncovered = highPriorityDimensions.filter((dimension) => {
const count = state.evidence.filter(
(item) => item.dimensionId === dimension.id
).length;
return count < 2;
});

const openCriticalConflicts = state.conflicts.filter(
(conflict) => conflict.resolutionStatus === "open"
);

if (state.currentStep >= state.maxSteps) {
return { stop: true, reason: "Reached max research steps" };
}

if (uncovered.length > 0) {
return {
stop: false,
reason: "Some high-priority dimensions still lack enough evidence",
};
}

if (openCriticalConflicts.length > 0) {
return {
stop: false,
reason: "Important conflicts remain unresolved",
};
}

return {
stop: true,
reason: "Coverage is sufficient and key conflicts are resolved",
};
}

这样做的价值是:

  • 研究循环为什么结束,不再是黑盒
  • 日志里能明确记录 stop reason
  • 后面做 eval 时,可以检查系统是不是经常过早停止

10. 一个较完整的 TypeScript 伪代码实现

下面这段代码不是生产实现,但已经足够表达一个完整的 research loop。

重点看结构,不要纠结具体接口细节。

type ResearchDimensionId =
| "capability_fit"
| "integration_cost"
| "operational_risk"
| "known_limitations";

type ResearchDimension = {
id: ResearchDimensionId;
question: string;
priority: "high" | "medium" | "low";
done: boolean;
confidence: number;
};

type Evidence = {
id: string;
dimensionId: ResearchDimensionId;
claim: string;
sourceTitle: string;
sourceUrl: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
supportLevel: "strong" | "medium" | "weak";
extractedAtStep: number;
};

type InformationGap = {
dimensionId: ResearchDimensionId;
question: string;
reason: "not_enough_evidence" | "conflict_detected" | "too_generic";
};

type ConflictRecord = {
dimensionId: ResearchDimensionId;
claimA: string;
claimB: string;
sourceA: string;
sourceB: string;
resolutionStatus: "open" | "resolved" | "needs_human_review";
note: string;
};

type ToolCallRecord = {
step: number;
toolName: string;
args: Record<string, unknown>;
summary: string;
};

type SearchHit = {
title: string;
url: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
};

type ResearchAction =
| {
type: "search";
dimensionId: ResearchDimensionId;
query: string;
reason: string;
}
| {
type: "read";
dimensionId: ResearchDimensionId;
url: string;
reason: string;
}
| {
type: "finish";
reason: string;
};

type ResearchState = {
goal: string;
dimensions: ResearchDimension[];
evidence: Evidence[];
gaps: InformationGap[];
conflicts: ConflictRecord[];
notes: string[];
toolCalls: ToolCallRecord[];
currentStep: number;
maxSteps: number;
done: boolean;
stopReason?: string;
};

type ToolResult<T> = {
ok: boolean;
data?: T;
error?: string;
};

const searchSourcesTool = {
name: "search_sources",
description: "Search candidate sources for a research dimension",
async run(args: {
query: string;
dimensionId: ResearchDimensionId;
}): Promise<ToolResult<{ hits: SearchHit[] }>> {
return {
ok: true,
data: {
hits: [
{
title: "Official docs overview",
url: "https://example.com/docs/overview",
snippet: "Capabilities and system architecture",
sourceType: "official_doc",
},
{
title: "Community implementation notes",
url: "https://example.com/community/post",
snippet: "Practical integration experience",
sourceType: "community",
},
],
},
};
},
};

const readSourceTool = {
name: "read_source",
description: "Read full content for a selected source",
async run(args: { url: string }): Promise<ToolResult<{ title: string; content: string }>> {
return {
ok: true,
data: {
title: "Official docs overview",
content:
"The framework supports tool orchestration, retrieval integration, and evaluation hooks, but requires explicit state management for multi-step workflows.",
},
};
},
};

function initializeResearchState(goal: string): ResearchState {
return {
goal,
dimensions: [
{
id: "capability_fit",
question: "Does the framework cover the core capabilities needed for an internal knowledge assistant?",
priority: "high",
done: false,
confidence: 0,
},
{
id: "integration_cost",
question: "What is the integration and maintenance cost for an existing engineering team?",
priority: "high",
done: false,
confidence: 0,
},
{
id: "operational_risk",
question: "What are the observability, safety, and reliability risks in production use?",
priority: "high",
done: false,
confidence: 0,
},
{
id: "known_limitations",
question: "What limitations or unsuitable scenarios are repeatedly mentioned across sources?",
priority: "medium",
done: false,
confidence: 0,
},
],
evidence: [],
gaps: [],
conflicts: [],
notes: [],
toolCalls: [],
currentStep: 0,
maxSteps: 8,
done: false,
};
}

function rebuildGaps(state: ResearchState) {
const newGaps: InformationGap[] = [];

for (const dimension of state.dimensions) {
const relatedEvidence = state.evidence.filter(
(item) => item.dimensionId === dimension.id
);

if (relatedEvidence.length < 2) {
newGaps.push({
dimensionId: dimension.id,
question: dimension.question,
reason: "not_enough_evidence",
});
continue;
}

const onlyWeakEvidence = relatedEvidence.every(
(item) => item.supportLevel === "weak"
);

if (onlyWeakEvidence) {
newGaps.push({
dimensionId: dimension.id,
question: `${dimension.question} Use stronger and more direct sources.`,
reason: "too_generic",
});
}
}

for (const conflict of state.conflicts) {
if (conflict.resolutionStatus === "open") {
newGaps.push({
dimensionId: conflict.dimensionId,
question: `Resolve conflict between ${conflict.sourceA} and ${conflict.sourceB}`,
reason: "conflict_detected",
});
}
}

state.gaps = newGaps;
}

function pickBestSourceUrl(state: ResearchState, dimensionId: ResearchDimensionId): string {
const preferred = state.toolCalls
.filter((call) => call.toolName === "search_sources")
.map((call) => String(call.args.selectedUrl ?? ""))
.find(Boolean);

return preferred || "https://example.com/docs/overview";
}

function chooseNextAction(state: ResearchState): ResearchAction {
rebuildGaps(state);

const stopCheck = shouldStop(state);
if (stopCheck.stop) {
return { type: "finish", reason: stopCheck.reason };
}

const nextGap = state.gaps[0];
const evidenceForDimension = state.evidence.filter(
(item) => item.dimensionId === nextGap.dimensionId
);

if (evidenceForDimension.length === 0) {
return {
type: "search",
dimensionId: nextGap.dimensionId,
query: nextGap.question,
reason: "Need initial candidate sources",
};
}

return {
type: "read",
dimensionId: nextGap.dimensionId,
url: pickBestSourceUrl(state, nextGap.dimensionId),
reason: "Need deeper evidence from a selected source",
};
}

function shouldStop(state: ResearchState): { stop: boolean; reason: string } {
if (state.currentStep >= state.maxSteps) {
return { stop: true, reason: "Reached max research steps" };
}

const highPriorityDimensions = state.dimensions.filter(
(dimension) => dimension.priority === "high"
);

const missingCoverage = highPriorityDimensions.some((dimension) => {
const evidenceCount = state.evidence.filter(
(item) => item.dimensionId === dimension.id
).length;
return evidenceCount < 2;
});

if (missingCoverage) {
return {
stop: false,
reason: "High-priority dimensions still need more coverage",
};
}

const openConflicts = state.conflicts.some(
(conflict) => conflict.resolutionStatus === "open"
);

if (openConflicts) {
return {
stop: false,
reason: "Conflicts remain unresolved",
};
}

return {
stop: true,
reason: "Coverage and evidence quality are sufficient for synthesis",
};
}

function extractEvidenceFromRead(
state: ResearchState,
dimensionId: ResearchDimensionId,
url: string,
title: string,
content: string
): Evidence[] {
return [
{
id: `${dimensionId}-${state.currentStep}`,
dimensionId,
claim: "The framework supports multi-step orchestration but requires explicit workflow and state design.",
sourceTitle: title,
sourceUrl: url,
snippet: content.slice(0, 160),
sourceType: "official_doc",
supportLevel: "strong",
extractedAtStep: state.currentStep,
},
];
}

function detectConflicts(state: ResearchState) {
const capabilityEvidence = state.evidence.filter(
(item) => item.dimensionId === "capability_fit"
);

const mentionsEasy = capabilityEvidence.find((item) =>
item.claim.includes("easy")
);
const mentionsHard = capabilityEvidence.find((item) =>
item.claim.includes("requires explicit workflow")
);

if (mentionsEasy && mentionsHard) {
state.conflicts.push({
dimensionId: "capability_fit",
claimA: mentionsEasy.claim,
claimB: mentionsHard.claim,
sourceA: mentionsEasy.sourceUrl,
sourceB: mentionsHard.sourceUrl,
resolutionStatus: "open",
note: "Need narrower evidence about what is easy by default versus what still requires custom engineering.",
});
}
}

function updateDimensionStatus(state: ResearchState) {
for (const dimension of state.dimensions) {
const relatedEvidence = state.evidence.filter(
(item) => item.dimensionId === dimension.id
);
const strongCount = relatedEvidence.filter(
(item) => item.supportLevel === "strong"
).length;

dimension.confidence = Math.min(1, strongCount / 2);
dimension.done = relatedEvidence.length >= 2 && dimension.confidence >= 0.5;
}
}

function logStep(
state: ResearchState,
event: string,
extra: Record<string, unknown> = {}
) {
console.log(
JSON.stringify({
step: state.currentStep,
goal: state.goal,
event,
...extra,
})
);
}

function buildFinalReport(state: ResearchState): string {
const lines: string[] = [];

lines.push(`# Research Summary`);
lines.push(`Goal: ${state.goal}`);
lines.push("");
lines.push(`## Dimension Status`);

for (const dimension of state.dimensions) {
lines.push(
`- ${dimension.id}: done=${dimension.done}, confidence=${dimension.confidence.toFixed(2)}`
);
}

lines.push("");
lines.push(`## Key Evidence`);
for (const item of state.evidence) {
lines.push(
`- [${item.dimensionId}] ${item.claim} (${item.sourceTitle}, ${item.supportLevel})`
);
}

lines.push("");
lines.push(`## Remaining Gaps`);
if (state.gaps.length === 0) {
lines.push(`- None`);
} else {
for (const gap of state.gaps) {
lines.push(`- [${gap.dimensionId}] ${gap.question}`);
}
}

lines.push("");
lines.push(`## Stop Reason`);
lines.push(`- ${state.stopReason ?? "unknown"}`);

return lines.join("\n");
}

async function runResearchAgent(goal: string) {
const state = initializeResearchState(goal);

rebuildGaps(state);

while (!state.done) {
const action = chooseNextAction(state);

logStep(state, "decision_made", {
actionType: action.type,
reason: action.reason,
});

if (action.type === "finish") {
state.done = true;
state.stopReason = action.reason;
break;
}

if (action.type === "search") {
const result = await searchSourcesTool.run({
query: action.query,
dimensionId: action.dimensionId,
});

if (!result.ok || !result.data) {
state.notes.push(`Search failed for ${action.dimensionId}: ${result.error ?? "unknown error"}`);
state.currentStep += 1;
continue;
}

const topHit = result.data.hits[0];
state.toolCalls.push({
step: state.currentStep,
toolName: searchSourcesTool.name,
args: {
query: action.query,
dimensionId: action.dimensionId,
selectedUrl: topHit?.url,
},
summary: `Found ${result.data.hits.length} candidate sources`,
});

state.notes.push(
`For ${action.dimensionId}, selected ${topHit?.title ?? "no hit"} for deeper reading.`
);
}

if (action.type === "read") {
const result = await readSourceTool.run({ url: action.url });

if (!result.ok || !result.data) {
state.notes.push(`Read failed for ${action.url}: ${result.error ?? "unknown error"}`);
state.currentStep += 1;
continue;
}

state.toolCalls.push({
step: state.currentStep,
toolName: readSourceTool.name,
args: { url: action.url },
summary: `Read ${result.data.title}`,
});

const extracted = extractEvidenceFromRead(
state,
action.dimensionId,
action.url,
result.data.title,
result.data.content
);

state.evidence.push(...extracted);
detectConflicts(state);
updateDimensionStatus(state);
rebuildGaps(state);
}

state.currentStep += 1;
}

return {
report: buildFinalReport(state),
state,
};
}

这段伪代码里最值得你抓住的不是类型数量,而是下面这条主线:

  • 先初始化研究维度
  • 每轮先重建 gaps
  • 再决定是 searchread 还是 finish
  • 读完后把内容整理成 evidence
  • 再做冲突检测、维度更新和停止判断

这就是一个比较典型的 research loop。

11. 如何处理信息缺口

Research Agent 的价值,很多时候就体现在它能不能意识到:

现在的信息还不够。

第一版实现时,你不需要做特别复杂的反思模块,但至少要让系统能识别几类常见缺口:

  • 还没有覆盖到某个高优先级维度
  • 某个维度虽然有资料,但都太泛
  • 只有单一来源,没有交叉支持
  • 已经发现冲突,但还没有更高质量证据去收束

所以一个实用的做法是:

  • 每轮结束都重建 gaps
  • gap 变成下一轮的 query 输入
  • 明确区分“缺来源”和“缺高质量证据”

这样系统就不是机械循环,而是在围绕“当前还缺什么”继续推进。

12. 如何处理来源冲突

研究型任务里,来源冲突不是异常,而是常态。

真正重要的不是“避免冲突”,而是:

冲突出现后,系统能不能把它显式保留下来,并继续寻找更高质量证据。

第一版可以先用一个很朴素的原则:

  • 官方文档通常更适合回答“支持什么、边界是什么”
  • 实践博客更适合回答“落地成本和坑点是什么”
  • 社区讨论更适合暴露“真实摩擦点”,但可信度往往不稳定

所以当冲突出现时,不要急着直接二选一,而应该先做两件事:

  • 记录冲突属于哪个维度
  • 继续查找更接近原始事实、更高质量的来源

如果最后还是解决不了,就不要假装问题已经消失,而应该把它保留到最终报告里,标记为:

  • needs_human_review
  • 或者“当前证据不足以下定论”

Research Agent 的可信度,往往就来自这种不强行过度收束的克制。

13. 日志应该记什么

Research Agent 没有日志,后面几乎没法调。

最小日志建议至少记下面几类事件:

  • 当前步数
  • 当前决策类型
  • 当前聚焦的研究维度
  • 工具参数摘要
  • 证据提取数量
  • 是否发现 gap 或 conflict
  • 最终 stop reason

例如每一步可以输出:

logStep(state, "decision_made", {
actionType: action.type,
dimensionId:
action.type === "finish" ? undefined : action.dimensionId,
reason: action.reason,
});

这类日志的价值不是“方便看热闹”,而是它能帮助你回答几个很关键的问题:

  • 为什么这个任务会停在第 3 步
  • 为什么这个维度一直没有证据
  • 为什么系统总在 search,却不进入 read
  • 为什么最终结论看起来像拍脑袋

很多时候,研究质量问题最后都会暴露成 loop 或日志问题。

14. 最小 eval 应该怎么做

Research Agent 的第一版,不需要一上来就做大而全评测。

更实际的做法是先做一组最小 eval,哪怕只有 5 到 10 个固定研究题目也足够有价值。

可以重点看这几项:

14.1 Coverage

高优先级研究维度有没有被覆盖到。

14.2 Evidence Quality

结论是不是主要建立在高质量来源上。

14.3 Groundedness

最终结论能不能回指到具体证据。

14.4 Conflict Handling

出现冲突时,系统是忽略掉了,还是保留并继续求证。

14.5 Efficiency

是不是用了很多步,却没有带来明显的信息增量。

一个最小的 eval 记录长这样就够用:

type EvalResult = {
taskId: string;
finished: boolean;
totalSteps: number;
coveredHighPriorityDimensions: number;
totalHighPriorityDimensions: number;
openConflicts: number;
groundedConclusion: boolean;
};

第一版不要急着追求“研究分数”多精确,而是先能稳定发现这些问题:

  • 过早停止
  • 漏维度
  • 证据太弱
  • 冲突被忽略
  • 步数很多但质量没提升

只要这几类现象能被看见,你的系统就已经进入可迭代状态了。

15. 第一版实现时最值得坚持的几个原则

最后给你几个很实用的原则。

15.1 先把研究过程做清楚,再谈“更聪明”

不要一开始就加复杂反思、多 Agent 协同、自动任务树展开。

Research Agent 第一版更重要的是:

  • 研究维度清楚
  • 证据结构清楚
  • 缺口和冲突可见
  • 停止条件清楚

15.2 searchread 分层,比“一个万能工具”更稳

Research loop 很依赖节奏感。

分层之后,你才能清楚看到:

  • 什么时候是在扩展来源面
  • 什么时候是在做深读
  • 什么时候是在补证据强度

15.3 允许不确定性存在

研究型任务不是每次都能得到完全一致的答案。

所以好的输出不是“每次都很肯定”,而是:

  • 哪些结论已经有足够依据
  • 哪些地方还有冲突
  • 哪些地方需要人工继续判断

15.4 先做最小 eval,不要等系统很大才评估

Research Agent 如果没有最小 eval,很容易越写越复杂,但你并不知道它到底有没有变好。

16. 一句话总结

Research Agent 的代码实现重点,不是把“搜索”包起来,而是把“围绕目标做多步探索、发现缺口、整理证据并收束结论”这个研究闭环稳定地表达出来。