GPT Models Meet Robotic Applications: Long-Step Robot Control in Various Environments
Introduction
Imagine having a humanoid robot in your household that can be instructed and demonstrated household chores without coding—Our team has been developing such a system, which we call Learning-from-Observation.
As part of our effort, we recently released a paper, «ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application (opens in new tab),» where we provide a specific example of how OpenAI’s ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of executable robot actions. Our prompts and source code for using them are open-source and publicly available at this GitHub repository (opens in new tab).
In fact, generating programs for robots from language is an attractive goal and has attracted research interest in the robotics research community; some of them are built on top of large language models such as ChatGPT (opens in new tab). However, most of them were developed within a limited scope, hardware-dependent, or lack the functionality of human-in-the-loop. Additionally, most of these studies rely on a specific dataset, which requires data recollection and model retraining when transferring or extending them to other robotic scenes. From a practical application standpoint, an ideal robotic solution would be one that can be easily applied to other applications or operational settings without requiring extensive data collection or model retraining.
In this paper, we provide a specific example of how ChatGPT can be used in a few-shot setting to convert natural language instructions into a sequence of actions that a robot can execute. In designing the prompts, we tried to ensure that they meet the requirements common to many practical applications while also being structured in a way that they can be easily customizable. The requirements we defined for this paper are:
- Easy integration with robot execution systems or visual recognition programs.
- Applicability to various home environments.
- The ability to provide an arbitrary number of natural language instructions while minimizing the impact of ChatGPT’s token limit.
To meet these requirements, we designed input prompts to encourage ChatGPT to:
- Output a sequence of predefined robot actions with explanations in a readable JSON format.
- Represent the operating environment in a formalized style.
- Infer and output the updated state of the operating environment, which can be reused as the next input, allowing ChatGPT to operate based solely on the memory of the latest operations.
We provide a set of prompt templates that structure the entire conversation for input into ChatGPT, enabling it to generate a response. The user’s instructions, as well as a specific explanation of the working environment, are incorporated into the template and used to generate ChatGPT’s response. For the second and subsequent instructions, ChatGPT’s next response is created based on all previous turns of the conversation, allowing ChatGPT to make corrections based on its own previous output and user feedback, if requested. If the number of input tokens exceeds the allowable limit for ChatGPT, we adjust the token size by truncating the prompt while retaining the most recent information about the updated environment.
In our paper, we demonstrated the effectiveness of our proposed prompts in inferring appropriate robot actions for multi-stage language instructions in various environments. Additionally, we observed that ChatGPT’s conversational ability allows users to adjust its output with natural language feedback, which is crucial for developing an application that is both safe and robust while providing a user-friendly interface.
Integration with vision systems and robot controllers
Among recent experimental attempts to generate robot manipulation from natural language using ChatGPT, our work is unique in its focus on the generation of robot action sequences (i.e., «what-to-do»), while avoiding redundant language instructions to obtain visual and physical parameters (i.e., «how-to-do»), such as how to grab, how high to lift, and what posture to adopt. Although both types of information are essential for operating a robot in reality, the latter is often better presented visually than explained verbally. Therefore, we have focused on designing prompts for ChatGPT to recognize what-to-do, while obtaining the how-to-do information from human visual demonstrations and a vision system during robot execution.
As part of our efforts to develop a realistic robotic operation system, we have integrated the proposed system with a learning-from-observation system that includes a speech interface [ (opens in new tab)1 (opens in new tab)] (opens in new tab), [2] (opens in new tab), a visual teaching interface [3] (opens in new tab), a reusable library of robot actions [4] (opens in new tab), and a simulator for testing robot execution [5] (opens in new tab). If you are interested, please refer to the respective papers for the results of robot execution. The code for the teaching interface is available at another GitHub repository (opens in new tab).
Conclusion
The main contribution of this paper is the provision and publication of generic prompts for ChatGPT that can be easily adapted to meet the specific needs of individual experimenters. The impressive progress of large language models is expected to further expand their use in robotics. We hope that this paper provides practical knowledge to the robotics research community, and we have made our prompts and source code available as open-source material on this GitHub repository (opens in new tab).
Bibliography
@ARTICLE{10235949,
author={Wake, Naoki and Kanehira, Atsushi and Sasabuchi, Kazuhiro and Takamatsu, Jun and Ikeuchi, Katsushi},
journal={IEEE Access},
title={ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application},
year={2023},
volume={11},
number={},
pages={95060-95078},
doi={10.1109/ACCESS.2023.3310935}}
About our research group
Visit our homepage: Applied Robotics Research
Learn more about this project
- [homepage] Learning-from-Observation
- [paper] A Learning-from-Observation Framework: One-Shot Robot Teaching for Grasp-Manipulation-Release Household Operations (opens in new tab)
- [paper] Interactive Learning-from-Observation through multimodal human demonstration (opens in new tab)
- [paper] Learning-from-Observation System Considering Hardware-Level Reusability (opens in new tab)
- [paper] Task-sequencing Simulator: Integrated Machine Learning to Execution Simulation for Robot Manipulation (opens in new tab)
- [blog] GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System – Microsoft Research