跳至主要内容

Anthropic开源Agent之道!

 

Building effective agents
建立有效的代理

Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.
在过去的一年里,我们与数十个团队合作,跨行业构建大型语言模型 (LLM) 代理。始终如一,最成功的实现并不使用复杂的框架或专门的库。相反,他们使用简单、可组合的模式进行构建。

In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.
在这篇文章中,我们将与大家分享从与客户合作和自己建立代理中获得的经验,并为开发人员建立有效代理提供实用建议。

What are agents?  什么是代理?

"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
"代理 "可以有多种定义。一些客户将代理定义为完全自主的系统,可长期独立运行,使用各种工具完成复杂的任务。另一些客户则用这个词来描述遵循预定义工作流程的规范性更强的实施方案。在 Anthropic,我们将所有这些变化都归类为代理系统,但在工作流和代理之间做了重要的架构区分:

  • Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
    工作流程是通过预定义的代码路径编排LLMs和工具的系统。
  • Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.
    而代理则是 LLMs 动态指导自己的进程和工具使用,保持对完成任务方式的控制的系统。

Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems.
下面,我们将详细探讨这两种类型的代理系统。在附录 1(“实践中的代理”)中,我们描述了客户发现使用此类系统具有特殊价值的两个领域。

When (and when not) to use agents
何时(以及何时不)使用代理

When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.
当使用LLMs构建应用程序时,我们建议寻找尽可能简单的解决方案,并且仅在需要时增加复杂性。这可能意味着根本不构建代理系统。代理系统通常会以延迟和成本来换取更好的任务性能,您应该考虑这种权衡何时有意义。

When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.
当需要更高的复杂性时,工作流为明确定义的任务提供可预测性和一致性,而当大规模需要灵活性和模型驱动的决策时,代理是更好的选择。然而,对于许多应用程序来说,通过检索和上下文示例来优化单个 LLM 调用通常就足够了。

When and how to use frameworks
何时以及如何使用框架

There are many frameworks that make agentic systems easier to implement, including:
有许多框架可以使代理系统更容易实现,包括:

  • LangGraph from LangChain;
    来自LangChain的LangGraph;
  • Amazon Bedrock's AI Agent framework;
    Amazon Bedrock 的 AI 代理框架;
  • Rivet, a drag and drop GUI LLM workflow builder; and
    Rivet,一个拖放 GUI LLM 工作流程构建器;和
  • Vellum, another GUI tool for building and testing complex workflows.
    Vellum,另一个用于构建和测试复杂工作流程的 GUI 工具。

These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts ​​and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.
这些框架通过简化标准低级任务(例如调用 LLMs、定义和解析工具以及将调用链接在一起)使入门变得容易。然而,它们经常创建额外的抽象层,这些抽象层可能会掩盖底层的提示和响应,从而使它们更难以调试。当更简单的设置就足够时,它们还可能会增加复杂性。

We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error.
我们建议开发人员直接使用LLM API 开始:只需几行代码即可实现许多模式。如果您确实使用框架,请确保您了解底层代码。对底层内容的错误假设是客户错误的常见来源。

See our cookbook for some sample implementations.
请参阅我们的食谱以获取一些示例实现。

Building blocks, workflows, and agents
构建块、工作流程和代理

In this section, we’ll explore the common patterns for agentic systems we’ve seen in production. We'll start with our foundational building block—the augmented LLM—and progressively increase complexity, from simple compositional workflows to autonomous agents.
在本节中,我们将探讨在生产中常见的代理系统模式。我们将从基础构件--增强型 LLM 开始,逐步提高复杂性,从简单的组合工作流到自主代理。

Building block: The augmented LLM
构件:增强型 LLM

The basic building block of agentic systems is an LLM enhanced with augmentations such as retrieval, tools, and memory. Our current models can actively use these capabilities—generating their own search queries, selecting appropriate tools, and determining what information to retain.
代理系统的基本构件是一个具有检索、工具和记忆等增强功能的 LLM 。我们目前的模型可以主动使用这些功能--生成自己的搜索查询、选择适当的工具并决定保留哪些信息。

The augmented LLM  增强的 LLM

We recommend focusing on two key aspects of the implementation: tailoring these capabilities to your specific use case and ensuring they provide an easy, well-documented interface for your LLM. While there are many ways to implement these augmentations, one approach is through our recently released Model Context Protocol, which allows developers to integrate with a growing ecosystem of third-party tools with a simple client implementation.
我们建议将重点放在实施的两个关键方面:根据您的特定用例量身定制这些功能,并确保它们为您的 LLM 提供简单、文档齐全的接口。实现这些增强功能的方法有很多,其中一种方法是通过我们最近发布的《模型上下文协议》(Model Context Protocol),该协议允许开发人员通过简单的客户端实现与不断增长的第三方工具生态系统集成。

For the remainder of this post, we'll assume each LLM call has access to these augmented capabilities.
在本文的其余部分中,我们假设每个 LLM 调用都可以访问这些增强功能。

Workflow: Prompt chaining
工作流程:提示链

Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see "gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.
提示链接将任务分解为一系列步骤,其中每个 LLM 调用都会处理前一个步骤的输出。您可以在任何中间步骤上添加编程检查(请参见下图中的“门”),以确保流程仍按计划进行。

The prompt chaining workflow
提示链工作流程

When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.
何时使用此工作流程:此工作流程适用于可轻松、简洁地将任务分解为固定子任务的情况。主要目的是通过让每次 LLM 调用都变得更简单,从而以延迟换取更高的准确性。

Examples where prompt chaining is useful:
提示链有用的示例:

  • Generating Marketing copy, then translating it into a different language.
    制作市场营销文案,然后将其翻译成不同的语言。
  • Writing an outline of a document, checking that the outline meets certain criteria, then writing the document based on the outline.
    编写文档大纲,检查大纲是否符合某些标准,然后根据大纲编写文档。

Workflow: Routing  工作流程:路由

Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.
路由对输入进行分类并将其引导至专门的后续任务。此工作流程允许分离关注点并构建更专业的提示。如果没有此工作流程,针对一种输入的优化可能会损害其他输入的性能。

The routing workflow  路由工作流程

When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.
何时使用此工作流程:路由功能适用于复杂的任务,在这些任务中,最好将不同的类别分开处理,并且可以通过 LLM 或更传统的分类模型/算法准确地进行分类。

Examples where routing is useful:
路由有用的例子

  • Directing different types of customer service queries (general questions, refund requests, technical support) into different downstream processes, prompts, and tools.
    将不同类型的客户服务查询(一般问题、退款请求、技术支持)导入不同的下游流程、提示和工具。
  • Routing easy/common questions to smaller models like Claude 3.5 Haiku and hard/unusual questions to more capable models like Claude 3.5 Sonnet to optimize cost and speed.
    将简单/常见问题分配给 Claude 3.5 Haiku 等较小的模型,将困难/异常问题分配给 Claude 3.5 Sonnet 等能力较强的模型,以优化成本和速度。

Workflow: Parallelization
工作流程:并行化

LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:
LLMs 有时可以同时处理一项任务,并以编程方式聚合其输出。此工作流(并行化)体现在两个关键变体中:

  • Sectioning: Breaking a task into independent subtasks run in parallel.
    分段:将任务分解成独立的子任务,并行运行。
  • Voting: Running the same task multiple times to get diverse outputs.
    投票:多次运行同一任务,以获得不同的输出结果。
The parallelization workflow
并行化工作流程

When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.
何时使用此工作流程:当划分的子任务可以并行化以提高速度,或需要多个角度或尝试以获得更高信 度的结果时,并行化是有效的。对于有多个考虑因素的复杂任务,当每个考虑因素都由单独的 LLM 调用来处理时,LLMs 的性能通常会更好,这样可以集中关注每个特定方面。

Examples where parallelization is useful:
并行化有用的例子

  • Sectioning:  切片:
    • Implementing guardrails where one model instance processes user queries while another screens them for inappropriate content or requests. This tends to perform better than having the same LLM call handle both guardrails and the core response.
      在一个模型实例处理用户查询的同时,另一个模型实例对不适当的内容或请求进行筛查。这往往比让同一个 LLM 调用同时处理防护栏和核心响应的性能更好。
    • Automating evals for evaluating LLM performance, where each LLM call evaluates a different aspect of the model’s performance on a given prompt.
      自动评估 LLM 性能的 evals,其中每个 LLM 调用都会评估模型在给定提示下不同方面的性能。
  • Voting:  投票:
    • Reviewing a piece of code for vulnerabilities, where several different prompts review and flag the code if they find a problem.
      审查一段代码是否存在漏洞,在此过程中,多个不同的提示会对代码进行审查,并在发现问题时进行标记。
    • Evaluating whether a given piece of content is inappropriate, with multiple prompts evaluating different aspects or requiring different vote thresholds to balance false positives and negatives.
      评估给定的内容是否不当,通过多个提示评估不同方面或要求不同的投票阈值来平衡误报和否定。

Workflow: Orchestrator-workers
工作流程:协调者-工作人员

In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.
在 "协调者-工作者 "工作流程中,中央 LLM 会动态地分解任务,将其委托给工作者 LLMs 并综合其结果。

The orchestrator-workers workflow
协调者-工作者工作流程

When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren't pre-defined, but determined by the orchestrator based on the specific input.
何时使用此工作流程:此工作流程非常适合您无法预测所需子任务的复杂任务(例如,在编码中,需要更改的文件数量以及每个文件中可能发生的更改的性质)取决于任务)。虽然它在拓扑上相似,但与并行化的主要区别在于它的灵活性——子任务不是预先定义的,而是由协调器根据特定输入确定。

Example where orchestrator-workers is useful:
Orchestrator-Workers 有用的示例:

  • Coding products that make complex changes to multiple files each time.
    每次对多个文件进行复杂更改的编码产品。
  • Search tasks that involve gathering and analyzing information from multiple sources for possible relevant information.
    搜索任务涉及从多个来源收集和分析信息以获取可能的相关信息。

Workflow: Evaluator-optimizer
工作流程:评价器-优化器

In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.
在评估器-优化器工作流程中,一个 LLM 调用生成响应,而另一个调用则在循环中提供评估和反馈。

The evaluator-optimizer workflow
评估器-优化器工作流程

When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.
何时使用此工作流程:当我们有明确的评估标准并且迭代细化提供可衡量的价值时,此工作流程特别有效。良好契合的两个标志是,首先,当人类清楚地表达他们的反馈时,LLM 响应可以得到明显改善;其次,LLM 可以提供此类反馈。这类似于人类作家在制作精美文档时可能经历的迭代写作过程。

Examples where evaluator-optimizer is useful:
评估器优化器有用的例子

  • Literary translation where there are nuances that the translator LLM might not capture initially, but where an evaluator LLM can provide useful critiques.
    文学翻译中存在译者LLM最初可能无法捕捉到的细微差别,但评估者LLM可以提供有用的批评。
  • Complex search tasks that require multiple rounds of searching and analysis to gather comprehensive information, where the evaluator decides whether further searches are warranted.
    复杂的搜索任务,需要多轮搜索和分析来收集全面的信息,评估者决定是否需要进一步搜索。

Agents  代理商

Agents are emerging in production as LLMs mature in key capabilities—understanding complex inputs, engaging in reasoning and planning, using tools reliably, and recovering from errors. Agents begin their work with either a command from, or interactive discussion with, the human user. Once the task is clear, agents plan and operate independently, potentially returning to the human for further information or judgement. During execution, it's crucial for the agents to gain “ground truth” from the environment at each step (such as tool call results or code execution) to assess its progress. Agents can then pause for human feedback at checkpoints or when encountering blockers. The task often terminates upon completion, but it’s also common to include stopping conditions (such as a maximum number of iterations) to maintain control.
随着 LLMs 关键能力的成熟--理解复杂的输入、参与推理和规划、可靠地使用工具以及从错误中恢复--人工智能正在生产中崭露头角。代理通过人类用户的指令或与人类用户的互动讨论开始工作。一旦任务明确,代理就会独立进行规划和操作,并有可能返回人类获取进一步的信息或判断。在执行过程中,代理从环境中获取每一步的 "基本事实"(如工具调用结果或代码执行情况)以评估其进度至关重要。然后,代理可以在检查点或遇到阻碍时暂停,以获得人工反馈。任务通常会在完成后终止,但通常也会包含停止条件(如迭代的最大次数)以保持控制。

Agents can handle sophisticated tasks, but their implementation is often straightforward. They are typically just LLMs using tools based on environmental feedback in a loop. It is therefore crucial to design toolsets and their documentation clearly and thoughtfully. We expand on best practices for tool development in Appendix 2 ("Prompt Engineering your Tools").
代理可以处理复杂的任务,但它们的实现通常很简单。他们通常只是 LLMs 使用基于循环环境反馈的工具。因此,清晰且深思熟虑地设计工具集及其文档至关重要。我们在附录 2(“提示设计您的工具”)中详细介绍了工具开发的最佳实践。

Autonomous agent  自主代理

When to use agents: Agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path. The LLM will potentially operate for many turns, and you must have some level of trust in its decision-making. Agents' autonomy makes them ideal for scaling tasks in trusted environments.
何时使用代理:代理可用于解决难以或不可能预测所需步骤数以及无法硬编码固定路径的开放式问题。 LLM 可能会运行很多轮,您必须对其决策有一定程度的信任。代理的自主性使它们成为在可信环境中扩展任务的理想选择。

The autonomous nature of agents means higher costs, and the potential for compounding errors. We recommend extensive testing in sandboxed environments, along with the appropriate guardrails.
代理的自主性意味着更高的成本,并有可能导致错误复杂化。我们建议在沙盒环境中进行广泛测试,并设置适当的防护措施。

Examples where agents are useful:
代理人有用的例子

The following examples are from our own implementations:
以下示例来自我们自己的实施:

  • A coding Agent to resolve SWE-bench tasks, which involve edits to many files based on a task description;
    编码代理可解决 SWE-工作台任务,其中涉及根据任务描述对许多文件进行编辑;
  • Our “computer use” reference implementation, where Claude uses a computer to accomplish tasks.
    我们的“计算机使用”参考实现,克劳德使用计算机来完成任务。
High-level flow of a coding agent
编码代理的高级流程

Combining and customizing these patterns
组合和定制这些模式

These building blocks aren't prescriptive. They're common patterns that developers can shape and combine to fit different use cases. The key to success, as with any LLM features, is measuring performance and iterating on implementations. To repeat: you should consider adding complexity only when it demonstrably improves outcomes.
这些构建模块并不是规定性的。它们是常见的模式,开发人员可以根据不同的用例进行塑造和组合。与任何 LLM 功能一样,成功的关键在于衡量性能和迭代实施。重复一遍:只有当复杂性能够明显改善结果时,你才应该考虑增加复杂性。

Summary  概括

Success in the LLM space isn't about building the most sophisticated system. It's about building the right system for your needs. Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.
在 LLM 领域取得成功并不是要建立最复杂的系统。而是要根据自己的需要建立合适的系统。从简单的提示开始,通过综合评估对其进行优化,只有在简单的解决方案无法满足要求时,才添加多步骤代理系统。

When implementing agents, we try to follow three core principles:
在实施代理时,我们努力遵循三个核心原则:

  1. Maintain simplicity in your agent's design.
    保持代理设计的简洁性。
  2. Prioritize transparency by explicitly showing the agent’s planning steps.
    通过明确显示代理的规划步骤来优先考虑透明度。
  3. Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.
    通过详尽的工具文档和测试,精心设计代理-计算机接口(ACI)。

Frameworks can help you get started quickly, but don't hesitate to reduce abstraction layers and build with basic components as you move to production. By following these principles, you can create agents that are not only powerful but also reliable, maintainable, and trusted by their users.
框架可以帮助您快速入门,但当您进入生产阶段时,不要犹豫减少抽象层并使用基本组件构建。遵循这些原则,您就能创建出不仅功能强大,而且可靠、可维护并深受用户信赖的代理。

Acknowledgements  致谢

Written by Erik Schluntz and Barry Zhang. This work draws upon our experiences building agents at Anthropic and the valuable insights shared by our customers, for which we're deeply grateful.
由埃里克·施伦茨和巴里·张撰写。这项工作借鉴了我们在 Anthropic 建立代理的经验以及我们的客户分享的宝贵见解,对此我们深表感谢。

Appendix 1: Agents in practice
附录1:代理实践

Our work with customers has revealed two particularly promising applications for AI agents that demonstrate the practical value of the patterns discussed above. Both applications illustrate how agents add the most value for tasks that require both conversation and action, have clear success criteria, enable feedback loops, and integrate meaningful human oversight.
我们与客户的合作揭示了人工智能代理的两个特别有前途的应用,证明了上述模式的实用价值。这两个应用都说明了人工智能代理是如何为那些既需要对话又需要行动、具有明确的成功标准、能够实现反馈回路并整合有意义的人工监督的任务带来最大价值的。

A. Customer support  A、客户支持

Customer support combines familiar chatbot interfaces with enhanced capabilities through tool integration. This is a natural fit for more open-ended agents because:
客户支持通过工具集成将熟悉的聊天机器人界面与增强的功能结合起来。这自然适合更多开放式代理,因为:

  • Support interactions naturally follow a conversation flow while requiring access to external information and actions;
    支持交互自然地遵循对话流程,同时需要访问外部信息和操作;
  • Tools can be integrated to pull customer data, order history, and knowledge base articles;
    可以集成工具来提取客户数据、订单历史记录和知识库文章;
  • Actions such as issuing refunds or updating tickets can be handled programmatically; and
    退款或更新机票等操作可以通过编程方式处理;和
  • Success can be clearly measured through user-defined resolutions.
    成功可以通过用户定义的解决方案来明确衡量。

Several companies have demonstrated the viability of this approach through usage-based pricing models that charge only for successful resolutions, showing confidence in their agents' effectiveness.
一些公司已经通过基于使用的定价模型证明了这种方法的可行性,该模型仅对成功的解决方案收费,显示出对其代理有效性的信心。

B. Coding agents  B.编码代理

The software development space has shown remarkable potential for LLM features, with capabilities evolving from code completion to autonomous problem-solving. Agents are particularly effective because:
软件开发领域已显示出 LLM 功能的巨大潜力,其功能从代码完成发展到自主解决问题。代理之所以特别有效,是因为

  • Code solutions are verifiable through automated tests;
    代码解决方案可通过自动化测试进行验证;
  • Agents can iterate on solutions using test results as feedback;
    代理可以利用测试结果作为反馈,迭代解决方案;
  • The problem space is well-defined and structured; and
    问题空间定义明确且结构合理;和
  • Output quality can be measured objectively.
    产出质量可以客观衡量。

In our own implementation, agents can now solve real GitHub issues in the SWE-bench Verified benchmark based on the pull request description alone. However, whereas automated testing helps verify functionality, human review remains crucial for ensuring solutions align with broader system requirements.
在我们自己的实现中,代理现在可以仅根据拉取请求描述来解决 SWE-bench Verified 基准中的实际 GitHub 问题。然而,虽然自动化测试有助于验证功能,但人工审查对于确保解决方案符合更广泛的系统要求仍然至关重要。

Appendix 2: Prompt engineering your tools
附录 2:提示工程工具

No matter which agentic system you're building, tools will likely be an important part of your agent. Tools enable Claude to interact with external services and APIs by specifying their exact structure and definition in our API. When Claude responds, it will include a tool use block in the API response if it plans to invoke a tool. Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts. In this brief appendix, we describe how to prompt engineer your tools.
无论您正在构建哪种代理系统,工具都可能是代理的重要组成部分。工具使 Claude 能够通过在我们的 API 中指定外部服务和 API 的确切结构和定义来与外部服务和 API 进行交互。当 Claude 响应时,如果它计划调用工具,它将在 API 响应中包含一个工具使用块。工具定义和规范应该像整体提示一样得到及时的工程关注。在这个简短的附录中,我们描述了如何提示设计您的工具。

There are often several ways to specify the same action. For instance, you can specify a file edit by writing a diff, or by rewriting the entire file. For structured output, you can return code inside markdown or inside JSON. In software engineering, differences like these are cosmetic and can be converted losslessly from one to the other. However, some formats are much more difficult for an LLM to write than others. Writing a diff requires knowing how many lines are changing in the chunk header before the new code is written. Writing code inside JSON (compared to markdown) requires extra escaping of newlines and quotes.
通常有多种方法来指定相同的操作。例如,您可以通过写入差异或重写整个文件来指定文件编辑。对于结构化输出,您可以在 markdown 或 JSON 中返回代码。在软件工程中,此类差异是表面性的,可以无损地从一种差异转换为另一种差异。然而,对于 LLM 来说,某些格式比其他格式更难编写。编写差异需要知道在编写新代码之前块头中有多少行发生了变化。在 JSON 中编写代码(与 Markdown 相比)需要额外转义换行符和引号。

Our suggestions for deciding on tool formats are the following:
我们对确定工具格式的建议如下:

  • Give the model enough tokens to "think" before it writes itself into a corner.
    在模型陷入困境之前,给模型足够的令牌来“思考”。
  • Keep the format close to what the model has seen naturally occurring in text on the internet.
    格式应接近模型在互联网上看到的自然文本。
  • Make sure there's no formatting "overhead" such as having to keep an accurate count of thousands of lines of code, or string-escaping any code it writes.
    确保没有格式化“开销”,例如必须准确计数数千行代码,或者对其编写的任何代码进行字符串转义。

One rule of thumb is to think about how much effort goes into human-computer interfaces (HCI), and plan to invest just as much effort in creating good agent-computer interfaces (ACI). Here are some thoughts on how to do so:
一个经验法则是,想想在人机界面(HCI)上投入了多少精力,并计划在创建良好的代理-计算机界面(ACI)上投入同样多的精力。以下是关于如何做到这一点的一些想法:

  • Put yourself in the model's shoes. Is it obvious how to use this tool, based on the description and parameters, or would you need to think carefully about it? If so, then it’s probably also true for the model. A good tool definition often includes example usage, edge cases, input format requirements, and clear boundaries from other tools.
    设身处地为模特着想。根据描述和参数,如何使用这个工具是否显而易见,或者您是否需要仔细考虑?如果是这样,那么该模型可能也是如此。好的工具定义通常包括示例用法、边缘情况、输入格式要求以及与其他工具的明确界限。
  • How can you change parameter names or descriptions to make things more obvious? Think of this as writing a great docstring for a junior developer on your team. This is especially important when using many similar tools.
    如何更改参数名称或描述才能让事情更明显?就像为团队中的初级开发人员编写一份出色的文档一样。在使用许多类似工具时,这一点尤为重要。
  • Test how the model uses your tools: Run many example inputs in our workbench to see what mistakes the model makes, and iterate.
    测试模型如何使用您的工具:在我们的工作台中运行许多示例输入,以查看模型犯了哪些错误,然后进行迭代。
  • Poka-yoke your tools. Change the arguments so that it is harder to make mistakes.
    防错你的工具。改变论点,这样就更难犯错误。

While building our agent for SWE-bench, we actually spent more time optimizing our tools than the overall prompt. For example, we found that the model would make mistakes with tools using relative filepaths after the agent had moved out of the root directory. To fix this, we changed the tool to always require absolute filepaths—and we found that the model used this method flawlessly.
在为 SWE-bench 构建代理时,我们实际上花了更多时间优化工具,而不是整体提示。例如,我们发现在代理移出根目录后,模型会在使用相对文件路径的工具时出错。为了解决这个问题,我们将工具改为始终要求绝对文件路径--我们发现模型在使用这种方法时非常完美。

评论

此博客中的热门博文

Docker-Compose 安装 PolarDB

 version: '3.1' services:   polardb:     container_name: polardb     restart: always     image: polardb/polardb_pg_local_instance     ports:       - 5432:5432     networks:       - net_db     environment:       TZ: Asia/Shanghai     volumes:       - ./polardb:/var/polardb networks:   net_db:     external: true

Docker-Compose 安装 Portainer

 version: '3.1' services: portainer: image: portainer/portainer-ce:latest container_name: portainer hostname: portainer restart: always volumes: - "/var/run/docker.sock:/var/run/docker.sock" - "./portainer_data:/data" - "./portainer_cn:/public" environment: TZ: Asia/Shanghai LANG: en_US.UTF-8 ports: - "9001:9000" networks:      - net_db networks: net_db: external: true

Wireguard 部署应用

 一、首先要有一个公网IP来进行服务端安装配置,为了简化部署,并且尽量少的侵入性,就用Docker安装服务端: services:   wireguard:     restart: always     image: weejewel/wg-easy     container_name: wireguard     ports:       - "51820:51820/udp"       - "51821:51821/tcp"     environment:       - TZ=Asia/Shanghai       - WG_HOST=0.0.0.0       - PASSWORD=123456       - WG_PERSISTENT_KEEPALIVE=25       - WG_DEFAULT_ADDRESS=10.0.8.x       - WG_DEFAULT_DNS=114.114.114.114       - WG_ALLOWED_IPS=10.0.8.0/24     volumes:       - ./wg-easy:/etc/wireguard     cap_add:       - NET_ADMIN       - SYS_MODULE     sysctls:       - net.ipv4.ip_forward=1       - net.ipv4.conf.all.src_valid_mark=1 二、客户端安装: # root权限 sudo -i # 安装wireguard软件 apt install w...