Research Agent 代码实现示例
如果说 最小 Agent 代码实现示例 更偏回答:
第一个 Agent 的代码骨架应该长什么样
那么这篇文档要解决的是另一个更贴近研究型任务的问题:
一个会持续做调研、补资料、收束证据的 Research Agent,代码上到底该怎么组织?
很多人理解了 Research Agent 的概念之后,真正开始写代码时还是会卡在这些地方:
- 研究任务要不要先拆维度
state里到底应该存什么- 搜索和阅读工具怎么分层
- 什么情况下继续检索,什么情况下应该停止
- 发现信息缺口或来源冲突时,系统应该怎么处理
- 最后怎样从一堆材料收束成一个有依据的结论
所以这篇文档不讲“万能研究系统”,也不讲高风险自动执行,而是只聚焦一件事:
给你一个围绕 research loop 展开的、足够完整的 TypeScript 伪代码骨架。
重点不是把代码写得多炫,而是让你看懂一个研究型 Agent 最核心的闭环:
研究目标 -> 规划研究维度 -> 检索 -> 阅读 -> 提取证据 -> 判断缺口/冲突 -> 继续探索或停止 -> 形成结论
1. 这种代码骨架适合什么场景
Research Agent 不是所有任务都需要。
如果用户问的是一个非常直接的问题,比如:
这个接口为什么 500?
那你更可能需要的是 Debug Agent,而不是 Research Agent。
Research Agent 更适合下面这类任务:
- 需要综合多个来源,而不是单条事实
- 需要围绕若干维度做比较
- 第一轮信息通常不够,需要继续补资料
- 最终输出不是“摘抄资料”,而是“形成判断”
比如:
- 调研某个 AI 框架是否适合做内部知识助手
- 比较几种向量数据库在某个业务场景下的适用性
- 分析一个新模型是否适合接入现有工作流
- 研究一个技术方案的成本、限制和风险边界
这类任务的共同点是:
问题的价值不在于搜到内容,而在于把探索过程组织起来。
2. 先定义一个研究型任务
为了让代码结构更具体,我们先固定一个示例任务:
调研某个 AI 框架是否适合做企业内部知识助手。
这个目标天然不是单点问答,而是综合判断。
系统至少要回答:
- 它能解决什么问题
- 它适合什么场景
- 它不适合什么场景
- 团队接入成本大不大
- 已知限制和风险是什么
也就是说,Research Agent 的输出 不只是“找到信息”,而是:
- 有结构的结论
- 每个结论背后的证据
- 当前仍然存在的不确定性
3. 先规划研究维度,而不是立刻搜索
Research Agent 很容易犯的第一个错误是:
一上来就搜很多,然后在结果里打转。
更稳的方式是先把研究任务拆成若干维度。
例如这个案例里,可以先拆成下面几类:
capability_fit:能力覆盖是否满足需求integration_cost:接入与工程改造成本operational_risk:稳定性、可观测性、治理边界known_limitations:已知限制、缺口和不适用场景
这样做的好处不是“形式完整”,而是它能让后面的循环有明确目标:
- 这一轮检索是在补哪个维度
- 哪个维度已经足够
- 哪个维度仍然证据薄弱
- 哪些来源其实在讨论同一件事
所以第一版 Research Agent 很值得显式保留一个“研究维度计划”。
4. state 应该怎么设计
Research Agent 的 state 通常比最小 Agent 更重要。
因为研究型任务最怕的不是 “查不到”,而是:
- 查过的东西散掉了
- 缺口没有被记录
- 冲突来源被覆盖掉
- 最后只剩一个模糊结论
下面是一套比较适合第一版实现的状态结构:
type ResearchDimensionId =
| "capability_fit"
| "integration_cost"
| "operational_risk"
| "known_limitations";
type ResearchDimension = {
id: ResearchDimensionId;
question: string;
priority: "high" | "medium" | "low";
done: boolean;
confidence: number;
};
type Evidence = {
id: string;
dimensionId: ResearchDimensionId;
claim: string;
sourceTitle: string;
sourceUrl: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
supportLevel: "strong" | "medium" | "weak";
extractedAtStep: number;
};
type InformationGap = {
dimensionId: ResearchDimensionId;
question: string;
reason: "not_enough_evidence" | "conflict_detected" | "too_generic";
};
type ConflictRecord = {
dimensionId: ResearchDimensionId;
claimA: string;
claimB: string;
sourceA: string;
sourceB: string;
resolutionStatus: "open" | "resolved" | "needs_human_review";
note: string;
};
type ToolCallRecord = {
step: number;
toolName: string;
args: Record<string, unknown>;
summary: string;
};
type ResearchState = {
goal: string;
dimensions: ResearchDimension[];
evidence: Evidence[];
gaps: InformationGap[];
conflicts: ConflictRecord[];
notes: string[];
toolCalls: ToolCallRecord[];
currentStep: number;
maxSteps: number;
done: boolean;
stopReason?: string;
};
你可以看到,这个结构相比最小 Agent 多出来了几类真正关键的信息:
dimensions:告诉系统这次研究到底覆盖哪些面evidence:不只存原始文本,而是存“可引用的证据单元”gaps:显式记录哪里还没搞清楚conflicts:显式记录互相冲突的说法
这几个字段会直接决定 research loop 是否能稳定推进。
5. evidence 为什么不能只存原始文本
很多第一版实现会把证据写成:
const evidence: string[] = [];
这当然能跑,但很快会出现两个问题:
- 你不知道这段话支持的是哪个维度
- 你不知道这段话来自什么来源,也无法比较可信度
Research Agent 更推荐把证据存成“结构化 claim”。
也就是说,一条证据至少要回答:
- 它在支持什么判断
- 它来自哪里
- 它强还是弱
- 它属于哪个研究维度
这样后面做综合时,系统才能回答:
- 哪个维度证据已经够了
- 哪个结论只建立在弱证据上
- 哪个冲突只是来源层级不同造成的
换句话说:
Research Agent 不是把资料装进数组,而是把资料整理成可比较的证据对象。
6. 工具设计:把 retrieval 和 read 分开
第一版实现时,很容易把工具写成一个大而全的接口,比如:
search_and_read_everything
这种工具通常会带来两个问题:
- 决策层不知道自己到底是在“找来源”还是“读内容”
- 日志里很难看清 research loop 的推进过程
更清楚的方式是把工具拆成两层:
retrieval:找候选来源read:读取某个已选来源的详细内容
例如:
type ToolResult<T> = {
ok: boolean;
data?: T;
error?: string;
};
type SearchHit = {
title: string;
url: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
};
type ToolDefinition<TArgs, TResult> = {
name: string;
description: string;
run: (args: TArgs) => Promise<ToolResult<TResult>>;
};
const searchSourcesTool: ToolDefinition<
{ query: string; dimensionId: ResearchDimensionId },
{ hits: SearchHit[] }
> = {
name: "search_sources",
description: "Search candidate sources for a research dimension",
async run(args) {
return {
ok: true,
data: {
hits: [
{
title: "Framework official docs",
url: "https://example.com/docs",
snippet: "Architecture, features, and deployment model",
sourceType: "official_doc",
},
],
},
};
},
};
const readSourceTool: ToolDefinition<
{ url: string },
{ title: string; content: string }
> = {
name: "read_source",
description: "Read the full content of a selected source",
async run(args) {
return {
ok: true,
data: {
title: "Framework official docs",
content: "Detailed content here...",
},
};
},
};
这样的分层有几个直接好处:
- 决策更清楚:先找,再读
- 工具日志更清楚:能看到研究是在拓展来源,还是在加深理解
- 资源更可控:不会一上来把所有长文都读完
Research loop 的味道,很多时候就体现在这种“先扩大搜索面,再对关键来源做深读”的节奏里。
7. 决策层应该决定什么
Research Agent 的决策层,第一版不用追求太复杂。
它最核心只需要决定三件事:
- 下一步研究哪个维度
- 是继续
search还是read - 当前信息是否已经足够收束
也就是说,第一版的行动类型可以很简单:
type ResearchAction =
| {
type: "search";
dimensionId: ResearchDimensionId;
query: string;
reason: string;
}
| {
type: "read";
dimensionId: ResearchDimensionId;
url: string;
reason: string;
}
| {
type: "finish";
reason: string;
};
你会发现,这里故意没有任何高风险动作。
因为这个示例的重点不是自动 执行系统改动,而是:
围绕研究任务不断找证据、补缺口、决定何时收束。
8. 什么时候继续检索,什么时候继续阅读
Research loop 里一个特别关键的问题是:
下一步应该继续搜,还是先把已有来源读深?
一个比较稳的经验是:
- 如果某个维度几乎没有来源,先
search - 如果某个维度已经有候选来源但 claim 还不清楚,先
read - 如果已经出现冲突,优先寻找更高质量来源做
read - 如果证据都很泛,继续
search更具体的问题
可以把这个判断写成非常简单的启发式规则:
function chooseNextAction(state: ResearchState): ResearchAction {
const openGap = state.gaps[0];
if (!openGap) {
return {
type: "finish",
reason: "No explicit gaps remain and evidence is sufficient across dimensions",
};
}
const dimensionEvidence = state.evidence.filter(
(item) => item.dimensionId === openGap.dimensionId
);
if (dimensionEvidence.length === 0) {
return {
type: "search",
dimensionId: openGap.dimensionId,
query: openGap.question,
reason: "Need initial sources for this dimension",
};
}
const hasStrongEvidence = dimensionEvidence.some(
(item) => item.supportLevel === "strong"
);
if (!hasStrongEvidence) {
return {
type: "read",
dimensionId: openGap.dimensionId,
url: pickBestSourceUrl(state, openGap.dimensionId),
reason: "Need deeper reading to strengthen evidence quality",
};
}
return {
type: "finish",
reason: "Current gaps are sufficiently addressed",
};
}
这里最重要的不是规则本身有多“聪明”,而是:
继续检索还是继续阅读,应该由显式的状态来驱动。
9. 什么时候继续,什么时候停止
Research Agent 最怕两种极端:
- 太早停:第一轮搜到一点资料就总结
- 停不下来:一直检索,但研究质量没有继续提升
所以停止条件最好也显式化。
比较常见的停止条件有:
- 每个高优先级维度至少有 1 到 2 条可用证据
- 没有未解决的关键冲突
- 新一轮检索带来的信息增量已经很低
- 达到最大步数或预算上限
可以把它写成一个单独函数:
function shouldStop(state: ResearchState): { stop: boolean; reason: string } {
const highPriorityDimensions = state.dimensions.filter(
(dimension) => dimension.priority === "high"
);
const uncovered = highPriorityDimensions.filter((dimension) => {
const count = state.evidence.filter(
(item) => item.dimensionId === dimension.id
).length;
return count < 2;
});
const openCriticalConflicts = state.conflicts.filter(
(conflict) => conflict.resolutionStatus === "open"
);
if (state.currentStep >= state.maxSteps) {
return { stop: true, reason: "Reached max research steps" };
}
if (uncovered.length > 0) {
return {
stop: false,
reason: "Some high-priority dimensions still lack enough evidence",
};
}
if (openCriticalConflicts.length > 0) {
return {
stop: false,
reason: "Important conflicts remain unresolved",
};
}
return {
stop: true,
reason: "Coverage is sufficient and key conflicts are resolved",
};
}
这样做的价值是:
- 研究循环为什么结束,不再是黑 盒
- 日志里能明确记录 stop reason
- 后面做 eval 时,可以检查系统是不是经常过早停止
10. 一个较完整的 TypeScript 伪代码实现
下面这段代码不是生产实现,但已经足够表达一个完整的 research loop。
重点看结构,不要纠结具体接口细节。
type ResearchDimensionId =
| "capability_fit"
| "integration_cost"
| "operational_risk"
| "known_limitations";
type ResearchDimension = {
id: ResearchDimensionId;
question: string;
priority: "high" | "medium" | "low";
done: boolean;
confidence: number;
};
type Evidence = {
id: string;
dimensionId: ResearchDimensionId;
claim: string;
sourceTitle: string;
sourceUrl: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
supportLevel: "strong" | "medium" | "weak";
extractedAtStep: number;
};
type InformationGap = {
dimensionId: ResearchDimensionId;
question: string;
reason: "not_enough_evidence" | "conflict_detected" | "too_generic";
};
type ConflictRecord = {
dimensionId: ResearchDimensionId;
claimA: string;
claimB: string;
sourceA: string;
sourceB: string;
resolutionStatus: "open" | "resolved" | "needs_human_review";
note: string;
};
type ToolCallRecord = {
step: number;
toolName: string;
args: Record<string, unknown>;
summary: string;
};
type SearchHit = {
title: string;
url: string;
snippet: string;
sourceType: "official_doc" | "blog" | "benchmark" | "community";
};
type ResearchAction =
| {
type: "search";
dimensionId: ResearchDimensionId;
query: string;
reason: string;
}
| {
type: "read";
dimensionId: ResearchDimensionId;
url: string;
reason: string;
}
| {
type: "finish";
reason: string;
};
type ResearchState = {
goal: string;
dimensions: ResearchDimension[];
evidence: Evidence[];
gaps: InformationGap[];
conflicts: ConflictRecord[];
notes: string[];
toolCalls: ToolCallRecord[];
currentStep: number;
maxSteps: number;
done: boolean;
stopReason?: string;
};
type ToolResult<T> = {
ok: boolean;
data?: T;
error?: string;
};
const searchSourcesTool = {
name: "search_sources",
description: "Search candidate sources for a research dimension",
async run(args: {
query: string;
dimensionId: ResearchDimensionId;
}): Promise<ToolResult<{ hits: SearchHit[] }>> {
return {
ok: true,
data: {
hits: [
{
title: "Official docs overview",
url: "https://example.com/docs/overview",
snippet: "Capabilities and system architecture",
sourceType: "official_doc",
},
{
title: "Community implementation notes",
url: "https://example.com/community/post",
snippet: "Practical integration experience",
sourceType: "community",
},
],
},
};
},
};
const readSourceTool = {
name: "read_source",
description: "Read full content for a selected source",
async run(args: { url: string }): Promise<ToolResult<{ title: string; content: string }>> {
return {
ok: true,
data: {
title: "Official docs overview",
content:
"The framework supports tool orchestration, retrieval integration, and evaluation hooks, but requires explicit state management for multi-step workflows.",
},
};
},
};
function initializeResearchState(goal: string): ResearchState {
return {
goal,
dimensions: [
{
id: "capability_fit",
question: "Does the framework cover the core capabilities needed for an internal knowledge assistant?",
priority: "high",
done: false,
confidence: 0,
},
{
id: "integration_cost",
question: "What is the integration and maintenance cost for an existing engineering team?",
priority: "high",
done: false,
confidence: 0,
},
{
id: "operational_risk",
question: "What are the observability, safety, and reliability risks in production use?",
priority: "high",
done: false,
confidence: 0,
},
{
id: "known_limitations",
question: "What limitations or unsuitable scenarios are repeatedly mentioned across sources?",
priority: "medium",
done: false,
confidence: 0,
},
],
evidence: [],
gaps: [],
conflicts: [],
notes: [],
toolCalls: [],
currentStep: 0,
maxSteps: 8,
done: false,
};
}
function rebuildGaps(state: ResearchState) {
const newGaps: InformationGap[] = [];
for (const dimension of state.dimensions) {
const relatedEvidence = state.evidence.filter(
(item) => item.dimensionId === dimension.id
);
if (relatedEvidence.length < 2) {
newGaps.push({
dimensionId: dimension.id,
question: dimension.question,
reason: "not_enough_evidence",
});
continue;
}
const onlyWeakEvidence = relatedEvidence.every(
(item) => item.supportLevel === "weak"
);
if (onlyWeakEvidence) {
newGaps.push({
dimensionId: dimension.id,
question: `${dimension.question} Use stronger and more direct sources.`,
reason: "too_generic",
});
}
}
for (const conflict of state.conflicts) {
if (conflict.resolutionStatus === "open") {
newGaps.push({
dimensionId: conflict.dimensionId,
question: `Resolve conflict between ${conflict.sourceA} and ${conflict.sourceB}`,
reason: "conflict_detected",
});
}
}
state.gaps = newGaps;
}
function pickBestSourceUrl(state: ResearchState, dimensionId: ResearchDimensionId): string {
const preferred = state.toolCalls
.filter((call) => call.toolName === "search_sources")
.map((call) => String(call.args.selectedUrl ?? ""))
.find(Boolean);
return preferred || "https://example.com/docs/overview";
}
function chooseNextAction(state: ResearchState): ResearchAction {
rebuildGaps(state);
const stopCheck = shouldStop(state);
if (stopCheck.stop) {
return { type: "finish", reason: stopCheck.reason };
}
const nextGap = state.gaps[0];
const evidenceForDimension = state.evidence.filter(
(item) => item.dimensionId === nextGap.dimensionId
);
if (evidenceForDimension.length === 0) {
return {
type: "search",
dimensionId: nextGap.dimensionId,
query: nextGap.question,
reason: "Need initial candidate sources",
};
}
return {
type: "read",
dimensionId: nextGap.dimensionId,
url: pickBestSourceUrl(state, nextGap.dimensionId),
reason: "Need deeper evidence from a selected source",
};
}
function shouldStop(state: ResearchState): { stop: boolean; reason: string } {
if (state.currentStep >= state.maxSteps) {
return { stop: true, reason: "Reached max research steps" };
}
const highPriorityDimensions = state.dimensions.filter(
(dimension) => dimension.priority === "high"
);
const missingCoverage = highPriorityDimensions.some((dimension) => {
const evidenceCount = state.evidence.filter(
(item) => item.dimensionId === dimension.id
).length;
return evidenceCount < 2;
});
if (missingCoverage) {
return {
stop: false,
reason: "High-priority dimensions still need more coverage",
};
}
const openConflicts = state.conflicts.some(
(conflict) => conflict.resolutionStatus === "open"
);
if (openConflicts) {
return {
stop: false,
reason: "Conflicts remain unresolved",
};
}
return {
stop: true,
reason: "Coverage and evidence quality are sufficient for synthesis",
};
}
function extractEvidenceFromRead(
state: ResearchState,
dimensionId: ResearchDimensionId,
url: string,
title: string,
content: string
): Evidence[] {
return [
{
id: `${dimensionId}-${state.currentStep}`,
dimensionId,
claim: "The framework supports multi-step orchestration but requires explicit workflow and state design.",
sourceTitle: title,
sourceUrl: url,
snippet: content.slice(0, 160),
sourceType: "official_doc",
supportLevel: "strong",
extractedAtStep: state.currentStep,
},
];
}
function detectConflicts(state: ResearchState) {
const capabilityEvidence = state.evidence.filter(
(item) => item.dimensionId === "capability_fit"
);
const mentionsEasy = capabilityEvidence.find((item) =>
item.claim.includes("easy")
);
const mentionsHard = capabilityEvidence.find((item) =>
item.claim.includes("requires explicit workflow")
);
if (mentionsEasy && mentionsHard) {
state.conflicts.push({
dimensionId: "capability_fit",
claimA: mentionsEasy.claim,
claimB: mentionsHard.claim,
sourceA: mentionsEasy.sourceUrl,
sourceB: mentionsHard.sourceUrl,
resolutionStatus: "open",
note: "Need narrower evidence about what is easy by default versus what still requires custom engineering.",
});
}
}
function updateDimensionStatus(state: ResearchState) {
for (const dimension of state.dimensions) {
const relatedEvidence = state.evidence.filter(
(item) => item.dimensionId === dimension.id
);
const strongCount = relatedEvidence.filter(
(item) => item.supportLevel === "strong"
).length;
dimension.confidence = Math.min(1, strongCount / 2);
dimension.done = relatedEvidence.length >= 2 && dimension.confidence >= 0.5;
}
}
function logStep(
state: ResearchState,
event: string,
extra: Record<string, unknown> = {}
) {
console.log(
JSON.stringify({
step: state.currentStep,
goal: state.goal,
event,
...extra,
})
);
}
function buildFinalReport(state: ResearchState): string {
const lines: string[] = [];
lines.push(`# Research Summary`);
lines.push(`Goal: ${state.goal}`);
lines.push("");
lines.push(`## Dimension Status`);
for (const dimension of state.dimensions) {
lines.push(
`- ${dimension.id}: done=${dimension.done}, confidence=${dimension.confidence.toFixed(2)}`
);
}
lines.push("");
lines.push(`## Key Evidence`);
for (const item of state.evidence) {
lines.push(
`- [${item.dimensionId}] ${item.claim} (${item.sourceTitle}, ${item.supportLevel})`
);
}
lines.push("");
lines.push(`## Remaining Gaps`);
if (state.gaps.length === 0) {
lines.push(`- None`);
} else {
for (const gap of state.gaps) {
lines.push(`- [${gap.dimensionId}] ${gap.question}`);
}
}
lines.push("");
lines.push(`## Stop Reason`);
lines.push(`- ${state.stopReason ?? "unknown"}`);
return lines.join("\n");
}
async function runResearchAgent(goal: string) {
const state = initializeResearchState(goal);
rebuildGaps(state);
while (!state.done) {
const action = chooseNextAction(state);
logStep(state, "decision_made", {
actionType: action.type,
reason: action.reason,
});
if (action.type === "finish") {
state.done = true;
state.stopReason = action.reason;
break;
}
if (action.type === "search") {
const result = await searchSourcesTool.run({
query: action.query,
dimensionId: action.dimensionId,
});
if (!result.ok || !result.data) {
state.notes.push(`Search failed for ${action.dimensionId}: ${result.error ?? "unknown error"}`);
state.currentStep += 1;
continue;
}
const topHit = result.data.hits[0];
state.toolCalls.push({
step: state.currentStep,
toolName: searchSourcesTool.name,
args: {
query: action.query,
dimensionId: action.dimensionId,
selectedUrl: topHit?.url,
},
summary: `Found ${result.data.hits.length} candidate sources`,
});
state.notes.push(
`For ${action.dimensionId}, selected ${topHit?.title ?? "no hit"} for deeper reading.`
);
}
if (action.type === "read") {
const result = await readSourceTool.run({ url: action.url });
if (!result.ok || !result.data) {
state.notes.push(`Read failed for ${action.url}: ${result.error ?? "unknown error"}`);
state.currentStep += 1;
continue;
}
state.toolCalls.push({
step: state.currentStep,
toolName: readSourceTool.name,
args: { url: action.url },
summary: `Read ${result.data.title}`,
});
const extracted = extractEvidenceFromRead(
state,
action.dimensionId,
action.url,
result.data.title,
result.data.content
);
state.evidence.push(...extracted);
detectConflicts(state);
updateDimensionStatus(state);
rebuildGaps(state);
}
state.currentStep += 1;
}
return {
report: buildFinalReport(state),
state,
};
}
这段伪代码里最值得你抓住的不是类型数量,而是下面这条主线:
- 先初始化研究维度
- 每轮先重建
gaps - 再决定是
search、read还是finish - 读完后把内容整理成
evidence - 再做冲突检测、维度更新和停止判断
这就是一个比较典型的 research loop。
11. 如何处理信息缺口
Research Agent 的价值,很多时候就体现在它能不能意识到:
现在的信息还不够。
第一版实现时,你不需要做特别复杂的反思模块,但至少要让系统能识别几类常见缺口:
- 还没有覆盖到某个高优先级维度
- 某个维度虽然有资料,但都太泛
- 只有单一来源,没有交叉支持
- 已经发现冲突,但还没有更高质量证据去收束
所以一个实用的做法是:
- 每轮结束都重建
gaps - 把
gap变成下一轮的 query 输入 - 明确区分“缺来源”和“缺高质量证据”
这样系统就不是机械循环,而是在围绕“当前还缺什么”继续推进。
12. 如何处理来源冲突
研究型任务里,来源冲突不是异常,而是常态。
真正重要的不是“避免冲突”,而是:
冲突出现后,系统能不能把它显式保留下来,并继续寻找更高质量证据。
第一版可以先用一个很朴素的原则:
- 官方文档通常更适合回答“支持什么、边界是什么”
- 实践博客更适合回答“落地成本和坑点是什么”
- 社区讨论更适合暴露“真实摩擦点”,但可信度往往不稳定
所以当冲突出现时,不要急着直接二选一,而应该先做两件事:
- 记录冲突属于哪个维度
- 继续查找更接近原始事实、更高质量的来源
如果最后还是解决不了,就不要假装问题已经消失,而应该把它保留到最终报告里,标记为:
needs_human_review- 或者“当前证据不足以下定论”
Research Agent 的可信度,往往就来自这种不强行过度收束的克制。
13. 日志应该记什么
Research Agent 没有日志,后面几乎没法调。
最小日志建议至少记下面几类事件:
- 当前步数
- 当前决策类型
- 当前聚焦的研究维度
- 工具参数摘要
- 证据提取数量
- 是否发现 gap 或 conflict
- 最终 stop reason
例如每一步可以输出:
logStep(state, "decision_made", {
actionType: action.type,
dimensionId:
action.type === "finish" ? undefined : action.dimensionId,
reason: action.reason,
});
这类日志的价值不是“方便看热闹”,而是它能帮助你回答几个很关键的问题:
- 为什么这个任务会停在第 3 步
- 为什么这个维度一直没有证据
- 为什么系统总在
search,却不进入read - 为什么最终结论看起来像拍脑袋
很多时候,研究质量问题最后都会暴露成 loop 或日志问题 。
14. 最小 eval 应该怎么做
Research Agent 的第一版,不需要一上来就做大而全评测。
更实际的做法是先做一组最小 eval,哪怕只有 5 到 10 个固定研究题目也足够有价值。
可以重点看这几项:
14.1 Coverage
高优先级研究维度有没有被覆盖到。
14.2 Evidence Quality
结论是不是主要建立在高质量来源上。
14.3 Groundedness
最终结论能不能回指到具体证据。
14.4 Conflict Handling
出现冲突时,系统是忽略掉了,还是保留并继续求证。
14.5 Efficiency
是不是用了很多步,却没有带来明显的信息增量。
一个最小的 eval 记录长这样就够用:
type EvalResult = {
taskId: string;
finished: boolean;
totalSteps: number;
coveredHighPriorityDimensions: number;
totalHighPriorityDimensions: number;
openConflicts: number;
groundedConclusion: boolean;
};
第一版不要急着追求“研究分数”多精确,而是先能稳定发现这些问题:
- 过早停止
- 漏维度
- 证据太弱
- 冲突被忽略
- 步数很多但质量没提升
只要这几类现象能被看见,你的系统就已经进入可迭代状态了。
15. 第一版实现时最值得坚持的几个原则
最后给你几个很实用的原则。
15.1 先把研究过程做清楚,再谈“更聪明”
不要一开始就加复杂反思、多 Agent 协同、自动任务树展开。
Research Agent 第一版更重要的是:
- 研究维度清楚
- 证据结构清楚
- 缺口和冲突可见
- 停止条件清楚
15.2 search 和 read 分层,比“一个万能工具”更稳
Research loop 很依赖节奏感。
分层之后,你才能清楚看到:
- 什么时候是在扩展来源面
- 什么时候是在做深读
- 什么时候是在补证据强度
15.3 允许不确定性存在
研究型任务不是每次都能得到完全一致的答案。
所以好的输出不是“每次都很肯定”,而是:
- 哪些结论已经有足够依据
- 哪些地方还有冲突
- 哪些地方需要人工继续判断
15.4 先做最小 eval,不要等系统很大才评估
Research Agent 如果没有最小 eval,很容易越写越复杂,但你并不知道它到底有没有变好。
16. 一句话总结
Research Agent 的代码实现重点,不是把“搜索”包起来,而是把“围绕目标做多步探索、发现缺口、整理证据并收束结论”这个研究闭环稳定地表达出来。