简介

谷歌机器人联合Everyday机器人研发的新的语言处理模型SayCan，该模型能够更好地理解语言指令并给出回答，并且能结合当前物理环境评估每个回答的真正实现可能性，从而让机器人更好地帮助用户完成任务。SayCan模型还能够提取大型语言模型中的结果，进行以语言为条件的价值函数的学习和训练，并采用强化学习方法。实验表明，SayCan模型的规划成功率为84%，执行成功率为74%，比其他模型更好地将语言任务转化为机器人行为。

自谷歌提出SayCan框架以来，大语言模型赋能机器人对复杂任务的指令解释成为热门研究。人对于机器人的需求不再局限于我说你做，而是希望进一步对潜在需求挖掘的的语义理解和任务执行。以ChatGPT为代表的技术出现推动了这一技术的演进，而RoboSDK和ROS针对机器人制定的面向云、机器人对象的统一API可以满足对异构设备的数据采集和任务执行。

本文尝试构建一种思路，以prompt technique来实现Robo特定的功能，如通过用户输入来形成特定的信息，构建特定指令等。下面，本文将从技术背景、设计思路上描述如何构建一个可以用ChatGPT控制的机器人。

功能框架

一个可以用LLM来控制的机器人需要有什么样的功能？

以复合机器人在室内或室外的导航和抓取任务为测试场景，根据给出的指令（例如：你去前面那辆比亚迪看一下，是否有落下什么重要的东西。），结合LLM（如chatgpt），输出机器人的感知、规划与控制等算法。
完成该过程在仿真环境中的闭环验证。
可以部署在实地的应用场景中，通过ROS或者RoboSDK去开发基于LLM的机器人框架。

当然，这样的机器人可以有很多的可能性，可以通过LLM来控制机器人，那么首先这些机器人就需要开放对应的控制权限，如智能车控制权限与对应API，LLM通过调用这些API从而达到控制机器人的目的，那么，如何使用LLM控制机器人来使其执行特定的指令呢？

一个简单的思路实现

对于上面的问题，一个简单的流程是，LLM在接收到来自用户发送的需求指令之后，LLM需要在内部拆解命令，下面是一个简单的示例。

用户发送“请你到厨房帮我拿我的水杯过来”这个指令，LLM来接收到这个指令之后，内部应该先生成一系列的计划，如“我应该先找到厨房的位置”；“然后我需要移动到厨房的位置”；“我需要在厨房中找到水杯的位置”；“我需要移动到水杯旁边，用我的机械臂来夹取水杯”；“我需要返回原来出发的地方，并把水杯交给用户。”

对于每一个任务，LLM应该对其构建一个任务树，一个任务树的架构可能如下所示，一个任务在构建出来时候又可能会产生子任务，又会构建出一个子树。通过这种机制，LLM会在最后任务成功之后把结果返回。

对于第一个任务，“我应该先找到厨房的位置”，LLM需要调用传感器的数据，如激光雷达、深度相机等，获取到厨房的位置，然后使用控制相关的权限，让机器人移动到厨房的位置。

如何让机器人进行自主导航？

一个简单的思路实现可以参考Microsoft的PromptCraft-Robotics，PromptCraft-Robotics资料库是为人们提供一个社区，以在机器人领域测试和共享大型语言模型（LLMs）有趣的提示示例。此外，PromptCraft-Robotics还提供了一个示例机器人模拟器（基于Microsoft AirSim），与ChatGPT集成，让用户可以开始使用。

在这个仓库中，他们将ChatGPT的功能扩展到了机器人上，并使用语言直观地控制多平台，如机器臂、无人机和家庭助手机器人。

提示LLMs是一门高度经验主义的科学。通过试错，我们建立了一套写机器人任务提示的方法论和设计原则：

首先，我们定义一组高级机器人 API 或函数库。这个库可以针对特定的机器人，并且应该映射到来自机器人控制堆栈或感知库的现有低级实现。对于高级 API，使用描述性名称非常重要，这样 ChatGPT 就可以推断它们的行为；

接下来，我们为 ChatGPT 编写文本提示，描述任务目标，同时明确指出高级库中可用的函数。提示还可以包含有关任务限制的信息，或者 ChatGPT 应如何形成其答案（使用特定的编程语言，使用辅助解析元素）；

用户会一直留在循环中，以评估 ChatGPT 的代码输出，可以直接检查或使用模拟器。如果需要，用户将使用自然语言向 ChatGPT 提供反馈，以评估答案的质量和安全性。当用户对解决方案满意时，最终代码可以部署到机器人上。

从下面一个简单的prompt，我们可以了解到其具体是怎么工作的，下面是一个使用仿真进行LLM操控机器人进行室内导航的prompt。

Imagine I am a robot equipped with a camera and a depth sensor. I am trying to perform a task, and you should help me by sending me commands. You are only allowed to give me the following commands:

turn(angle): turn the robot by a given number of degrees
move(distance): moves the robot straight forward by a given distance in meters.

On each step, I will provide you with the objects in the scene as a list of <object name, distance, angle in degrees>. You should reply with only one command at a time.
The distance is in meters, and the direction angle in degrees with respect to the robot’s orientation. Negative angles are to the left and positive angles are to the right. If a command is not valid, I will ignore it and ask you for another command. If there is no relevant information in the scene, use the available commands to explore the environment.

Task: go to the chairs
Objects in the scene: <door, 0.53 m, 22>, <chair, 4.84 m, -21>, <chair, 2.12 m, -24>, <window, 6.74 m, -3>, <window, 7.17 m, 6>, <door, 0.61 m, 18>, <stairs, 1.42 m, 0>
Command:

move(1.5) # Move forward by 1.5 meters to explore the environment.

Objects in the scene: <seating, 5.05 m, 21>, <chair, 3.93 m, -25>, <window, 4.55 m, 22>, <window, 5.24 m, -4>, <stairs, 3.49 m, 13>, <window, 5.67 m, 7>, <window, 4.37 m, 23>
Command:

turn(-25) # Turn to the left to face the chair that is 3.93 meters away.

Objects in the scene: <chair, 4.85 m, -14>, <chair, 4.76 m, -9>, <chair, 1.57 m, -23>, <chair, 2.66 m, -24>, <window, 4.96 m, 10>, <window, 4.19 m, 25>
Command:

turn(-10) # Turn to the right to explore the environment further.

Objects in the scene: <door, 5.07 m, -25>, <chair, 5.36 m, -6>, <table, 2.17 m, -24>, <chair, 5.01 m, -1>, <chair, 1.80 m, -16>, <chair, 3.15 m, -16>, <window, 4.30 m, 19>
Command:

turn(-6) # Turn to the right to face the chair that is 5.36 meters away.

Objects in the scene: <counter, 1.24 m, -24>, <door, 5.93 m, -17>, <chair, 5.48 m, 1>, <table, 2.60 m, -14>, <chair, 4.92 m, 6>, <chair, 2.02 m, -7>, <chair, 3.53 m, -7>, <chair, 3.16 m, -18>
Command:

turn(1) # Turn to the left to explore the environment further.

Objects in the scene: <door, 5.07 m, -25>, <chair, 5.36 m, -6>, <table, 2.17 m, -24>, <chair, 5.01 m, -1>, <chair, 1.80 m, -16>, <chair, 3.15 m, -16>, <window, 4.30 m, 19>
Command:

move(5.5) # Move forward towards the chair that is 5.36 meters away.

Objects in the scene: <chair, 0.96 m, -10>, <chair, 0.99 m, 20>, <chair, 1.77 m, -20>
Command:

move(1) # Move forward to approach the chairs.

Full Navigation

这里的演示应该比较通俗了，我就不直接翻译了，大概的意思就是给定LLM角色预设，让其进行特定格式的输入和输出，并且每执行一步，传感器监测到的数据就会传给LLM，其根据传感器返回的当前位姿信息进行进一步调整，规划下一步应该做什么，通过这种机制，LLM操控机器人移动到了椅子旁边的位置。

当然这只是一个很简单的例子，里面还有很多复杂问题丞待解决：

如果使用GPT，返回的结果的时效性无法保证，如何解决GPT指令返回时间的不确定性问题
如果指令更新时间过久，两个指令之间的中间过程怎么处理
传感器部分失效之后的判别机制怎么处理
…

总的来说，这里只是一个简单地demo演示，如果想要看更多相关的prompt demo，可以查看https://github.com/microsoft/PromptCraft-Robotics

RoboAgent以及配套RoboToolKit的构建

上面只是一个简单的demo，对于复杂任务的处理，需要更加复杂的Prompt，而不是简单的进行室内二维导航，正如最开头介绍的任务树，一个复杂的任务需要LLM构建一套复杂的框架去严谨地执行各个方面的指令，通常来说，可以执行复杂任务的LLM我们叫做Agent。关于这方面的工作，我们不需要从零开始构建一个这种Agent框架，在LLM蓬勃发展的过程中，我们已经做了一些工作出来。

在项目初期，我们可以采用LangChain的能力，用agent + tool的思路来构建一个具有任务规划、任务分析、指令生成、任务执行一体的RoboAgent。下面，为了更好地介绍RoboAgent如何可以拥有处理复杂Robo任务的能力，我将会介绍一下LangChain、agent、tool、ReAct等相关的Prompt technique概念。

技术背景

LangChain

如果你想构建一些复杂的LLM应用，我强烈建议你使用LangChain（虽然有一点私心，本人也是LangChain的开发者之一），LangChain是一个强大的框架，旨在帮助开发人员使用语言模型构建端到端的应用程序。它提供了一套工具、组件和接口，可简化创建由大型语言模型 (LLM) 和聊天模型提供支持的应用程序的过程。LangChain 可以轻松管理与语言模型的交互，将多个组件链接在一起，并集成额外的资源，例如 API 和数据库。

ReAct

paper: https://arxiv.org/pdf/2210.03629.pdf

ReAct是Reasoning和Acting的缩写。这个框架的基本思路是给一个Prompt，这个Prompt将Question拆解成几个步骤。分别是：

Tought：面对这个 Question 我下一步应该做什么。
Action：执行某个动作。在 ReAct 里有三种动作，第一个是 Search[entity] 如果存在对应实体的维基页面，则返回前5句话，否则使用维基百科搜索引擎搜索前5个类似的实体；第二个是 Look[sring] 它将返回包含该字符串的页面中的下一句话，模拟浏览器上的 Ctrl+F 功能。第三个是 Finish[answer] 它将使用答案完成当前任务。
Observation：观察到的外部工具给到的结果，将作为新的提示输入给 ChatGPT。

下面是一些ReAct的Prompt demo：

Question: What is the elevation range for the area that the eastern sector of the
Colorado orogeny extends into?
Thought 1: I need to search Colorado orogeny, find the area that the eastern sector
of the Colorado orogeny extends into, then find the elevation range of the
area.
Action 1: Search[Colorado orogeny]
Observation 1: The Colorado orogeny was an episode of mountain building (an orogeny) in
Colorado and surrounding areas.
Thought 2: It does not mention the eastern sector. So I need to look up eastern
sector.
Action 2: Lookup[eastern sector]
Observation 2: (Result 1 / 1) The eastern sector extends into the High Plains and is called
the Central Plains orogeny.
Thought 3: The eastern sector of Colorado orogeny extends into the High Plains. So I
need to search High Plains and find its elevation range.
Action 3: Search[High Plains]
Observation 3: High Plains refers to one of two distinct land regions
Thought 4: I need to instead search High Plains (United States).
Action 4: Search[High Plains (United States)]
Observation 4: The High Plains are a subregion of the Great Plains. From east to west, the
High Plains rise in elevation from around 1,800 to 7,000 ft (550 to 2,130
m).[3]
Thought 5: High Plains rise in elevation from around 1,800 to 7,000 ft, so the answer
is 1,800 to 7,000 ft.
Action 5: Finish[1,800 to 7,000 ft]

Question: Musician and satirist Allie Goertz wrote a song about the "The Simpsons"
character Milhouse, who Matt Groening named after who?
Thought 1: The question simplifies to "The Simpsons" character Milhouse is named after
who. I only need to search Milhouse and find who it is named after.
Action 1: Search[Milhouse]
Observation 1: Milhouse Mussolini Van Houten is a recurring character in the Fox animated
television series The Simpsons voiced by Pamela Hayden and created by Matt
Groening.
Thought 2: The paragraph does not tell who Milhouse is named after, maybe I can look up
"named after".
Action 2: Lookup[named after]
Observation 2: (Result 1 / 1) Milhouse was named after U.S. president Richard Nixon, whose
middle name was Milhous.
Thought 3: Milhouse was named after U.S. president Richard Nixon, so the answer is
Richard Nixon.
Action 3: Finish[Richard Nixon]

Question: Which documentary is about Finnish rock groups, Adam Clayton Powell or The
Saimaa Gesture?
Thought 1: I need to search Adam Clayton Powell and The Saimaa Gesture, and find which
documentary is about Finnish rock groups.
Action 1: Search[Adam Clayton Powell]
Observation 1 Could not find [Adam Clayton Powell]. Similar: [’Adam Clayton Powell
III’, ’Seventh Avenue (Manhattan)’, ’Adam Clayton Powell Jr. State Office
Building’, ’Isabel Washington Powell’, ’Adam Powell’, ’Adam Clayton Powell
(film)’, ’Giancarlo Esposito’].
Thought 2: To find the documentary, I can search Adam Clayton Powell (film).
Action 2: Search[Adam Clayton Powell (film)]
Observation 2: Adam Clayton Powell is a 1989 American documentary film directed by
Richard Kilberg. The film is about the rise and fall of influential
African-American politician Adam Clayton Powell Jr.[3][4] It was later aired
as part of the PBS series The American Experience.
Thought 3: Adam Clayton Powell (film) is a documentary about an African-American
politician, not Finnish rock groups. So the documentary about Finnish rock
groups must instead be The Saimaa Gesture.
Action 3: Finish[The Saimaa Gesture]

Question: What profession does Nicholas Ray and Elia Kazan have in common?
Thought 1: I need to search Nicholas Ray and Elia Kazan, find their professions, then
find the profession they have in common.
Action 1: Search[Nicholas Ray]
Observation 1: Nicholas Ray (born Raymond Nicholas Kienzle Jr., August 7, 1911 - June 16,
1979) was an American film director, screenwriter, and actor best known for
the 1955 film Rebel Without a Cause.
Thought 2: Professions of Nicholas Ray are director, screenwriter, and actor. I need
to search Elia Kazan next and find his professions.
Action 2: Search[Elia Kazan]
Observation 2: Elia Kazan was an American film and theatre director, producer, screenwriter
and actor.
Thought 3: Professions of Elia Kazan are director, producer, screenwriter, and actor.
So profession Nicholas Ray and Elia Kazan have in common is director,
screenwriter, and actor.
Action 3: Finish[director, screenwriter, actor]

其中，前面部分属于少样本学习中的示例，目的是教会 ChatGPT 如何使用工具。最后一行的 Question 就是我们要问 ChatGPT 的问题。对于这个问题，ChatGPT 将使用外部工具解答。具体的结果如下：

Question: Author David Chanoff has collaborated with a U.S. Navy admiral who served as the ambassador to the United Kingdom under which President?Thought 1: I need to search David Chanoff and the U.S. Navy admiral, find the ambassador to the United Kingdom, then find the President they served under.
Action 1: Search[David Chanoff]
Observation 1: David Chanoff is an American author and journalist. He is the co-author of
several books, including Inside the Mirage: America's Fragile Partnership
with Saudi Arabia (2004) and The Game Player: Confessions of the CIA's
Original Political Operative (1989).
Thought 2: I need to search the U.S. Navy admiral David Chanoff collaborated with.
Action 2: Search[U.S. Navy admiral]
Observation 2: Admiral Charles R. Larson was a United States Navy four-star admiral who
served as the United States Ambassador to the United Kingdom from 1998 to
2001.
Thought 3: Admiral Charles R. Larson served as the United States Ambassador to the
United Kingdom from 1998 to 2001. So the President they served under is
Bill Clinton.
Action 3: Finish[Bill Clinton]

Thought 1: I need to search David Chanoff and the U.S. Navy admiral, find the ambassador to the United Kingdom, then find the President they served under.
Action 1: Search[David Chanoff]
Observation 1: David Chanoff is an American author and journalist. He is the co-author of
several books, including Inside the Mirage: America's Fragile Partnership
with Saudi Arabia (2004) and The Game Player: Confessions of the CIA's
Original Political Operative (1989).
Thought 2: I need to search the U.S. Navy admiral David Chanoff collaborated with.
Action 2: Search[U.S. Navy admiral]
Observation 2: Admiral Charles R. Larson was a United States Navy four-star admiral who
served as the United States Ambassador to the United Kingdom from 1998 to
2001.
Thought 3: Admiral Charles R. Larson served as the United States Ambassador to the
United Kingdom from 1998 to 2001. So the President they served under is
Bill Clinton.
Action 3: Finish[Bill Clinton]

当然，这里只是对ReAct的思路做了简单的阐述，其背后还有一些更加复杂的逻辑处理，这里不做详细阐述，如消息截断、Prompt等处理不做详细阐述。通过介绍ReAct，我们可以了解到如何处理Prompt来让ReAct处理更加复杂的功能。

Agent与Tool

单单有ReAct的Prompt并不能很好地构建起本项目的解决方案，必须有一套完善的框架可以更好地对ReAct的Prompt进行更加细致化地调整，告诉ChatGPT可以使用哪些工具，并且怎么使用这些工具，然后框架可以根据ChatGPT输出的内容准确地进行工具调用，并使用工具返回的结果进行进一步操作。随着系统的复杂化，我们需要引入Agent和Tool的概念。

在LLM的Prompt Engineering中，Agent是更高级的执行器，负责复杂任务的调度和分发，在用户向Agent输入了其要求之后，Agent内部会通过Action Plan Generation拆解用户的要求并形成一系列的计划，进一步地，我们让Agent内部自动执行每一个Plan，并通过ReAct Prompting technique来让Agent对自己Plan的执行计划的输出进行一个观察，对输出的结果得出自己的结论，并根据结论继续执行任务，直到Agent认为其得到了想要的结果。

我们可以为Agent构建相关的ToolKit，对于每个Tool，提供其使用方法和工具名的Prompt，并实现其对应的功能，如对于FileWriteTool，我们需要在代码上实现写入文件的功能。有了Tool，我们可以在Agent初始化的时候注入到SystemMessage中作为系统预设，从而为Agent提供调用外部工具的能力。而LangChain已经提供了这种框架，可以让我们更加方便的实现Agent的能力，并提供了高度的自定义化，我们由此可以对RoboAgent进行深度定制化。

RoboToolKit的构建

关于如何去构建Robot查询语句以及校验等，我们可以参考一下langchain中SqlDatabaseTookKit的思路来构建RoboToolKit，具体来说，我们可以将RoboToolKit分为以下几个部分。

RoboQueryTool Robo指令查询工具
RoboInfoTool Robo当前信息查询工具
RoboActionTool Robo行为指令工具，这里或许并不是RoboActionTool，而是某某一些具体的行为实现，如前进，后退等动作.

在将任务输入到RoboToolKit之前，我们可能需要任务进行预处理。具体而言，预处理过程包括：将任务转换为适合RoboToolKit处理的格式，如对任务进行特征提取，例如提取出具体的行为信息等。

RoboAgent的构建

构建一个RoboAgent，RoboAgent可以调用RoboToolKit的功能，我们需要构建一个合适的Prompt，然后通过ReAct实现Zero-shot的复杂需求理解，让RoboSDK对生成的具体指令行为进行执行，并最终驱动机器人。

关于Prompt设计，遵循1）设计模板；2）生成模板；3）筛选最佳模板的流程；关于Prompt有效性的验证，需要进行后续的测试进行横向对比。

构建仿真，在仿真中完成闭环验证

在完成了基本的功能验证之后，我们需要在仿真中完成闭环验证，进一步地，我们需要一步步调试以优化机器人的表现能力，最终达到特定的预期。

在安装LangChain、RoboSDK等开发环境，针对当前构建RoboAgent和RoboToolKit的各个模块进行单元测试，并且分别对单机模式和实物模式（如果可以的话）下进⾏测试，得到运⾏效果。

总结

本文介绍了如何使用LLM的能力构建一个可以控制机器人的复杂指令系统，并且介绍了当前前言的一些研究，如Google的SayCan，MicroSoft的PromptCraft等，最后，本文介绍了一下笔者的构建思路。2023是LLM蓬勃发展的一年，未来，肯定会有越来越多LLM+机器人相关的项目和研究出来，可以期待一下！笔者也期待可以与志同道合的小伙伴可以一起交流一下。

目录CONTENT

如何构建基于LLM的机器人复杂任务控制系统

简介